Artificial Intelligence 12 min read

MiniMax M3 Beats GPT‑5.5 in Programming and Goes Open‑Source

MiniMax M3, a domestically developed LLM, combines a new sparse‑attention MSA architecture, native multimodal support, and million‑token context to match or surpass top closed‑source models in programming and agent benchmarks, while achieving a 9.4× speedup on FP8 GEMM and preparing for open‑source release.

SuanNi

Jun 1, 2026

MiniMax M3 Beats GPT‑5.5 in Programming and Goes Open‑Source

MiniMax M3 was released as a Chinese‑origin large language model that simultaneously pushes the frontiers of programming ability, million‑token context, and native multimodal support, and is slated to become open‑source.

Practical Validation

The M3 team gave the model the ICLR 2025 Outstanding Paper Award work "Learning Dynamics of LLM Finetuning" to reproduce autonomously. Over a 12‑hour run the model generated 18 code commits and 23 experiment figures, reproducing the paper’s SFT probability trends, observing the DPO squeezing effect, and confirming the proposed Extend mitigation.

FP8 GEMM Optimization

FP8 matrix‑multiply (GEMM) is a major bottleneck in LLM inference. Implementing a production‑grade FP8 GEMM kernel on NVIDIA Hopper GPUs typically requires 1–2 weeks of senior‑engineer effort. Starting from a bare Triton skeleton, M3 autonomously explored optimization paths, ultimately raising Hopper FP8 hardware utilization from 7.6 % to 71.3 % and delivering a 9.4× speedup.

Fully Autonomous Benchmark Loop

With only a task description, a benchmark script, and an un‑runnable Triton skeleton, M3 executed a 24‑hour continuous process that performed 147 benchmark submissions and 1,959 tool calls. It iteratively progressed from a baseline implementation through autotune configuration, bottleneck diagnosis, CUDA‑Graph integration, persistent‑kernel rewrite, and host‑side scheduling, each step validated by benchmark feedback without human intervention.

MSA Sparse Attention for Million‑Token Context

To overcome the quadratic cost of full attention, M3 introduced MiniMax Sparse Attention (MSA), which adds a pre‑filter stage that partitions KV pairs before attention computation. Compared with DSA and MoBA, MSA achieves finer KV block selection, yielding higher effective context coverage. On the hardware level, MSA uses a KV‑outer‑gather‑Q pattern, reading each KV block once with contiguous memory access, delivering over 4× speedup versus Flash‑Sparse‑Attention and flash‑moba. Under a 1 M‑token window, at least 512 K tokens remain usable, and per‑token compute is reduced to 1/20 of previous models. Prefill speed improves >9× and decoding >15×, while matching full‑attention performance in most tests.

Programming and Agent Capabilities

In SWE‑Bench Pro, M3 surpasses GPT‑5.5 and Gemini 3.1 Pro, approaching Opus 4.7. It also outperforms Opus 4.7 on SVG‑Bench and achieves the highest score on the BrowseComp agent benchmark (83.5 vs 79.3). On the end‑to‑end Claw‑Eval framework for autonomous agents, M3 attains the top rank. To bridge the gap between single‑turn benchmarks and real‑world multi‑turn development, M3 built an interactive user‑simulator that mimics developer behavior—adding requirements, discussing solutions, correcting feedback, and switching tasks—enabling agents to collaborate proactively.

Native Multimodal Training

M3 was trained from step 0 with interleaved text‑image (or other modality) sequences, a strategy the team found more impactful than simple image‑only augmentation. The pre‑training token count was scaled to one quadrillion, ingesting text and visual data simultaneously, which embeds multimodal understanding deeply in the model. On the OmniDocBench multimodal document‑understanding suite, M3 surpasses Gemini 3.1 Pro and supports image, video, and computer‑use interactions.

MiniMax Code Agent Product

MiniMax Code, built on open‑source projects OpenCode and Pi Agent, leverages M3’s long‑context, programming, and multimodal strengths. It decomposes complex tasks into multi‑stage, concurrent, dynamically adjustable workflows executed by an agent team. Using a producer‑verifier harness, the team can run autonomously for days, and the multimodal ability enables computer‑use commands such as opening ERP clients and batch‑entering invoices.

Pricing, API, and Open‑Source Release

At comparable pricing, M3 offers roughly 15× the usage of Claude subscriptions. Existing users retain their plans and can switch between M2 and M3. The API provides two pricing tiers based on context length and supports a thinking mode for complex reasoning and an non‑thinking mode for low‑latency tasks, both share the same pricing and can be toggled per request. M3’s API is now available, with open‑source model weights and technical report to be released on HuggingFace and GitHub.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Large Language Model Benchmarking multimodal LLM Sparse Attention MiniMax M3 FP8 GEMM

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.