Artificial Intelligence 11 min read

Do Scaling Laws Still Hold? Analyzing Grok‑3, Deepseek and LLM Training Trends

The article examines whether pre‑training scaling laws remain valid, compares Grok‑3’s architecture and training strategy with Deepseek models, and explores how different scaling approaches—pre‑training, RL‑based, and test‑time—affect the cost‑effectiveness and intelligence of large language models.

NewBeeNLP

Feb 21, 2025

Do Scaling Laws Still Hold? Analyzing Grok‑3, Deepseek and LLM Training Trends

1. Does the Pre‑training Scaling Law Still Hold?

Scaling laws are still applicable; the perceived “wall” is due to data scarcity, causing a slowdown rather than a hard ceiling.

According to the Chinchilla scaling law, increasing the base model size can still improve performance even without new data, though the cost‑benefit ratio becomes poor.

Consequently, practitioners prioritize higher‑ROI scaling methods such as Test‑time Scaling and RL‑Scaling over pre‑training scaling.

If higher‑ROI methods become saturated, reverting to pre‑training scaling (larger models) remains an option, albeit a low‑efficiency fallback.

More GPU resources do not directly improve the best possible model quality, but they dramatically shorten experimentation cycles for new ideas, algorithms, or data mixes.

2. Grok‑3 Base Model (Compared with Deepseek V3)

Grok‑3’s public benchmarks focus only on mathematics, science, and code, omitting broader evaluations like MMLU, suggesting limited general‑purpose gains.

To boost mathematical and coding abilities, a common approach is to distill long‑chain‑of‑thought (COT) data from a stronger model (e.g., Deepseek R1) into the base model during post‑training or even pre‑training, requiring only a few hundred gigabytes of data and modest compute.

OpenAI’s upcoming GPT 4.5 is expected to follow a similar distillation strategy, using COT data to raise its base model’s intelligence.

Grok‑3 consumes roughly ten times the compute of Grok‑2; following Chinchilla, optimal training would increase data volume threefold and model size threefold, though current trends favor smaller models with larger datasets to reduce inference cost.

If the reported 10× compute increase is accurate, two scenarios arise: (a) massive multimodal data expansion (e.g., from 10 TB to 30 TB) leading to a ~3× model size increase, or (b) modest data growth with a 4–5× model‑size jump to absorb the extra compute. Either way, Grok‑3 likely falls in the 200‑500 B parameter range.

Grok‑3 appears to rely on the traditional “scale‑up” method, which is low‑ROI compared with RL‑Scaling, raising the question of why the developers accept the inefficiency.

3. Grok‑3 Logical‑Reasoning Variant (Deep‑Thinking Version)

The deep‑thinking version matches or exceeds the performance of OpenAI’s o3 mini, making it one of the strongest publicly known LLMs in this niche.

A plausible hypothesis: during post‑training, RL‑Scaling benefits increase with larger base models, so expanding the base model size (despite low pre‑training ROI) is justified to amplify the gains from RL‑Scaling.

Deepseek R1, though open‑source and high‑performing, suffers from deployment challenges due to its large size; a similar rationale may apply—larger bases enable more effective deep‑thinking fine‑tuning.

If this hypothesis holds, the hierarchy of scaling‑law ROI remains Test‑time > RL > Pre‑train. Test‑time scaling’s ceiling depends on RL‑Scaling, which in turn depends on pre‑training scaling. When RL and Test‑time ceilings are reached, scaling the base model again could raise the next‑level ceiling, potentially forming a recursive path toward higher‑level AGI capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Grok-3 scaling laws AI research

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.