Tag

AI benchmarking

1 views collected around this technical thread.

DataFunTalk
DataFunTalk
Jun 9, 2025 · Artificial Intelligence

Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test

The author conducts a transparent, objective assessment of several large language models on the 2025 Chinese national math exam, converting all questions to LaTeX, applying strict Gaokao scoring rules, and revealing each model's strengths and weaknesses across single‑choice, multiple‑choice, and fill‑in‑the‑blank items.

AI benchmarkingGaokaolarge language models
0 likes · 7 min read
Can AI Models Pass the Chinese Math Gaokao? A Fair, Objective Test
Data Thinking Notes
Data Thinking Notes
May 29, 2025 · Artificial Intelligence

DeepSeek‑R1‑0528: How the New Open‑Source LLM Outperforms Gemini and Claude

DeepSeek‑R1‑0528, the latest open‑source 660B LLM, dramatically improves coding, reasoning, and long‑context abilities, matching or surpassing top models like Gemini 2.5 Pro and Claude 4 in benchmarks and real‑world tests, while offering faster, more stable, and fully executable outputs.

AI benchmarkingDeepSeekcode generation
0 likes · 7 min read
DeepSeek‑R1‑0528: How the New Open‑Source LLM Outperforms Gemini and Claude
DataFunSummit
DataFunSummit
May 4, 2023 · Artificial Intelligence

LLM Ranking Arena: Elo‑Based Competitive Evaluation of Open‑Source Chatbots

A recent study by the LMSYS organization introduces an Elo‑rated, 1v1 battle arena for large language models, ranking open‑source chatbots like Vicuna, Koala, and ChatGLM, while discussing the limitations of traditional benchmarks and the advantages of crowd‑sourced, scalable evaluation.

AI benchmarkingChatbot ArenaElo rating
0 likes · 7 min read
LLM Ranking Arena: Elo‑Based Competitive Evaluation of Open‑Source Chatbots