Artificial Intelligence 7 min read

OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

OpenAI introduced GPT‑4o, a free, omni‑capable multimodal model that processes text, audio, and images together, delivers near‑human response latency, showcases impressive live demos, and will soon be available via a discounted API, marking a significant step forward in end‑to‑end AI research.

Rare Earth Juejin Tech Community

May 15, 2024

OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

OpenAI announced its latest flagship model, GPT‑4o, which is free for all users and combines text, audio, and image inputs to generate corresponding outputs, embodying the meaning of “Omni” (全能) in its name.

The model can respond to audio within as little as 232 ms (average 320 ms), matching human conversational speed, and supports seamless voice, vision, and text interactions without noticeable delay.

GPT‑4o’s capabilities are being demonstrated live, showing it can sense breathing rhythm, use richer tones, interrupt, and even engage in real‑time video‑call‑like conversations.

All of ChatGPT Plus features—including vision, web browsing, memory, code execution, and GPT Store—are now available for free, and an API will be offered at a 50 % discount with double the request speed.

During the launch, CTO Mira Murati and President Brockman presented live demos, including a translation scenario where the model acted as a real‑time interpreter between English and Italian, and a playful interaction where two instances of ChatGPT (one legacy, one new with visual abilities) conversed and sang together.

The demo highlighted GPT‑4o’s end‑to‑end training: unlike the previous three‑stage pipeline (speech‑to‑text → GPT‑3.5/4 → text‑to‑speech), the new model processes all modalities within a single neural network, eliminating the 2.8‑second (GPT‑3.5) and 5.4‑second (GPT‑4) latencies of the old system.

Benchmarks show GPT‑4o surpasses specialized models such as Whisper‑V3 in speech translation and outperforms Gemini 1.0 Ultra and Claude Opus in visual understanding.

A quoted scholar noted that a successful demo like this equates to the impact of a thousand papers.

The article also reminds readers of the upcoming Google I/O conference on May 15 and hints at further OpenAI announcements in the near future.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI Large Language Model OpenAI AI research GPT-4o voice interaction

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.