Artificial Intelligence 11 min read

Google Gemini: Native Multimodal Model That Outperforms GPT‑4 on Benchmarks

Google’s Gemini, a trillion‑parameter native multimodal model trained on TPU v4/v5e, was launched overnight and, according to its technical report, surpasses GPT‑4 on 30 of 32 academic benchmarks, achieves the first human‑level score on MMLU, and powers the new AlphaCode 2 code‑generation system.

Smart Era Software Development

Dec 7, 2023

Google Gemini: Native Multimodal Model That Outperforms GPT‑4 on Benchmarks

Google unveiled Gemini late at night as a “native multimodal” large language model designed to counter OpenAI’s GPT‑4. The model integrates text, image, audio, video, and code processing from the outset, rather than stitching together separate modality‑specific models.

Gemini was trained on multiple modalities and later fine‑tuned with additional multimodal data, enabling seamless understanding and reasoning across diverse inputs. According to the Gemini technical report, the model contains roughly a trillion parameters and was trained with compute estimated at five times that used for GPT‑4, using Google’s Tensor Processing Units (TPU v4 and v5e).

In a comprehensive evaluation on 32 academic benchmarks, Gemini Ultra outperformed GPT‑4 on 30 tasks. Notably, on the Massive Multitask Language Understanding (MMLU) benchmark it achieved a 90.0 % score, the first model to exceed human expert performance, and on the MMMU multimodal benchmark it scored 59.4 %. Image benchmarks were also surpassed without any OCR assistance, demonstrating strong multimodal reasoning.

The model’s multimodal capabilities are illustrated with examples such as processing a non‑English audio clip followed by an English audio clip to produce a concise summary, or answering a cooking question by simultaneously interpreting spoken instructions and a photo of ingredients, guiding the user step‑by‑step.

Building on Gemini, Google released AlphaCode 2, a code‑generation system that solves roughly twice as many programming problems as the original AlphaCode and exceeds the performance of about 85 % of human programmers. AlphaCode 2 employs multiple strategy models, a sampling mechanism for diverse code samples, filtering, clustering, and a scoring model to select the best solution, as described in the AlphaCode 2 technical report.

Gemini’s training on Google’s own TPUs makes it faster and cheaper than earlier models such as PaLM. The accompanying Cloud TPU v5p accelerator is positioned to further accelerate Gemini’s development and enable large‑scale AI workloads for developers and enterprises.

Google executives, including CEO Sundar Pichai and DeepMind co‑founder Demis Hassabis, view Gemini as the beginning of a larger project aimed at achieving more general AI capabilities. While the technical report does not disclose detailed architecture or training data, external commentary (e.g., Oren Etzioni) acknowledges Gemini’s superiority on current benchmarks while noting that future models like GPT‑5 may remain competitive.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI Gemini Google AI GPT-4 TPU MMLU AlphaCode 2

Written by

Smart Era Software Development

Committed to openness and connectivity, we build frontline engineering capabilities in software, requirements, and platform engineering. By integrating digitalization, cloud computing, blockchain, new media and other hot tech topics, we create an efficient, cutting‑edge tech exchange platform and a diversified engineering ecosystem. Provides frontline news, summit updates, and practical sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.