Tagged articles
2 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 29, 2026 · Artificial Intelligence

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test

This article walks through ten mainstream open‑source large‑model benchmarks—SWE‑bench Verified and Pro, MMLU‑Pro, GPQA Diamond, HLE, AIME, HMMT, olmOCR‑bench, Terminal‑Bench 2.0, and EvasionBench—explaining their data, evaluation metrics, current leading models, and the capability dimensions they reveal.

AI evaluationLLM benchmarksMMLU-Pro
0 likes · 20 min read
Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 3, 2026 · Artificial Intelligence

Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent

The newly released Qwopus3.5‑v3 model combines higher‑quality reasoning chains, dedicated tool‑calling reinforcement learning, and an act‑then‑refine paradigm, delivering a 5‑point HumanEval boost, a 1.43‑point MMLU‑Pro gain, 31.7% faster inference and 24% lower token cost, while remaining runnable on a 3090 or a 16 GB MacBook, with easy deployment via GGUF, LM Studio, Ollama or llama.cpp.

Claude OpusHumanEvalMMLU-Pro
0 likes · 12 min read
Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent