May 14, 2026 · Artificial Intelligence

General 365: Meituan LongCat’s Open‑Source Benchmark Redefines LLM Reasoning Evaluation

The General 365 benchmark, built from 365 original seed questions and 1,095 variants across eight reasoning challenges, reveals that most mainstream large language models struggle with everyday logical tasks, achieving at most 62.8% accuracy and requiring far more tokens than on traditional subject‑specific tests.

AI reasoningGeneral 365LLM evaluation

0 likes · 9 min read

General 365: Meituan LongCat’s Open‑Source Benchmark Redefines LLM Reasoning Evaluation