Artificial Intelligence 16 min read

xAI's Grok 3 Model: Benchmarks, Reasoning, and Industry Reactions

Elon Musk’s xAI introduced the Grok 3 family—trained on roughly 200,000 GPUs and offered in standard, mini and Reasoning versions—that claims top‑slot performance on math, science and coding benchmarks, outpacing Google Gemini, DeepSeek V3, Claude and OpenAI GPT‑4o, while pricing starts at $30 per month and drawing both praise for its speed and criticism for lingering hallucinations and ethical sensitivities.

Java Tech Enthusiast

Feb 19, 2025

xAI's Grok 3 Model: Benchmarks, Reasoning, and Industry Reactions

On February 18 (Beijing time), Elon Musk's AI company xAI unveiled the Grok 3 series, claiming it outperforms Google Gemini, DeepSeek V3, Claude and OpenAI GPT‑4o on mathematics, science and coding benchmarks.

The model was trained on an estimated 200,000 GPUs—about 263 times the compute used for DeepSeek V3—leading Musk to call it “the smartest AI on Earth.”

Grok 3 is actually a family of models. Its lightweight variant, Grok 3 mini, sacrifices some accuracy for faster responses. Not all versions are publicly released yet; a voice mode was announced but postponed for about a week.

According to xAI engineers, Grok 3 surpasses GPT‑4o, Gemini‑2 Pro, DeepSeek V3 and Claude 3.5 on the AIME (math) and GPQA (doctoral‑level physics, biology, chemistry) benchmark suites.

In the Chatbot Arena (LMSYS) competition, an early Grok‑3 scored 1,402 points, beating Gemini 2.0 Flash, ChatGPT‑4o, DeepSeek R1 and other leading models—the first large‑model to break the 1,400‑point barrier.

Reasoning capabilities have also been added. xAI released Grok‑3 Reasoning Beta and Grok‑3 mini Reasoning, similar to OpenAI’s o1‑mini and DeepSeek R1, which self‑verify their answers before responding.

Extended reasoning tests show Grok‑3 Reasoning outperforming o1‑mini‑high on several popular benchmarks, including the new AIME‑2025 math test.

These reasoning models are accessible through the Grok app, where users can choose the “Think” mode for standard queries or the “Big Brain” mode that allocates extra compute for more demanding problems.

xAI also introduced DeepSearch, a new search‑plus‑agent feature that crawls the web and X (formerly Twitter) to provide concise answers, comparable to Perplexity’s DeepResearch.

Pricing: the SuperGrok subscription costs $30 per month or $300 per year and unlocks unlimited reasoning, DeepSearch queries and unrestricted image generation.

Musk later said a voice mode will be released in roughly a week, and that Grok 3 will soon be available via an enterprise API together with DeepSearch.

Industry reactions vary. TMT investor Gavin Baker praised xAI as the “SR‑71 of AI labs.” NYU professor Gary Marcus criticized the launch as incremental, noting a lack of genuine innovation. xAI engineer Andrej Karpathy shared mixed hands‑on results: strong performance on certain reasoning tasks (e.g., generating a Catan‑style hex grid) but failures on others (e.g., emoji‑variant puzzles, complex ethical queries). He placed Grok 3 roughly at the level of OpenAI’s o1‑pro, ahead of DeepSeek‑R1 and Gemini 2.0 Flash.

Overall, Grok 3 with its reasoning mode reaches frontier performance comparable to OpenAI’s most advanced models, yet it still exhibits hallucinations, sensitivity to ethical prompts, and occasional gaps in knowledge.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Large Language Model benchmark DeepSearch Grok3 Reasoning xAI

Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.