Artificial Intelligence 7 min read

OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

OpenAI introduced GPT‑4o, a free, omni‑capable multimodal model that processes text, audio, and images together, delivers near‑human response latency, showcases impressive live demos, and will soon be available via a discounted API, marking a significant step forward in end‑to‑end AI research.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

OpenAI announced its latest flagship model, GPT‑4o, which is free for all users and combines text, audio, and image inputs to generate corresponding outputs, embodying the meaning of “Omni” (全能) in its name.

The model can respond to audio within as little as 232 ms (average 320 ms), matching human conversational speed, and supports seamless voice, vision, and text interactions without noticeable delay.

GPT‑4o’s capabilities are being demonstrated live, showing it can sense breathing rhythm, use richer tones, interrupt, and even engage in real‑time video‑call‑like conversations.

All of ChatGPT Plus features—including vision, web browsing, memory, code execution, and GPT Store—are now available for free, and an API will be offered at a 50 % discount with double the request speed.

During the launch, CTO Mira Murati and President Brockman presented live demos, including a translation scenario where the model acted as a real‑time interpreter between English and Italian, and a playful interaction where two instances of ChatGPT (one legacy, one new with visual abilities) conversed and sang together.

The demo highlighted GPT‑4o’s end‑to‑end training: unlike the previous three‑stage pipeline (speech‑to‑text → GPT‑3.5/4 → text‑to‑speech), the new model processes all modalities within a single neural network, eliminating the 2.8‑second (GPT‑3.5) and 5.4‑second (GPT‑4) latencies of the old system.

Benchmarks show GPT‑4o surpasses specialized models such as Whisper‑V3 in speech translation and outperforms Gemini 1.0 Ultra and Claude Opus in visual understanding.

A quoted scholar noted that a successful demo like this equates to the impact of a thousand papers.

The article also reminds readers of the upcoming Google I/O conference on May 15 and hints at further OpenAI announcements in the near future.

multimodal AIlarge language modelOpenAIAI researchGPT-4ovoice interaction
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.