2025 Large Model Service Performance Report: Near‑100% Success, Rising Throughput, and Falling Prices

The 2025 monitoring report by AIIA and the China Academy of Information and Communications Technology evaluates 42 large‑model services across 13 MaaS platforms, revealing near‑100% call success rates, significant TPS growth, sub‑second latency, increasing open‑source model adoption, and a gradual decline in service pricing.

Software Engineering 3.0 Era
Software Engineering 3.0 Era
Software Engineering 3.0 Era
2025 Large Model Service Performance Report: Near‑100% Success, Rising Throughput, and Falling Prices

With rapid iteration of large‑model technology and the expansion of Model‑as‑a‑Service (MaaS) platforms, the China AI Industry Alliance (AIIA) and the China Academy of Information and Communications Technology (CAICT) launched a public‑cloud large‑model performance monitoring project in September 2024 and released the 2025 results.

1. Service stability and success rate

From 1 January to 31 December 2025, 42 original‑vendor large‑model services (38 domestic, 4 overseas) were monitored. Domestic models achieved an average call success rate of 99.9 % in December, with 68 % of models reaching 100 % success; all four foreign models also achieved 100 % success.

Success rate chart
Success rate chart

2. Throughput (TPS) growth

Most services showed an upward trend in characters per second (TPS), especially in Q4. Domestic models increased from an average of 29 TPS in February to 50.5 TPS in December, a 67 % rise; the month‑over‑month growth in Q4 averaged 8 %. Foreign models GPT and Claude averaged 51.35 TPS.

TPS chart
TPS chart

3. First‑token latency (TTFT)

Average TTFT remained below 1 second for most models throughout the year, with a marked improvement in Q4. In December, 76 % of domestic models recorded TTFT under 1 second and 29 % under 0.5 second; the median was 0.58 second, substantially lower than earlier quarters. Foreign models GPT and Claude both stayed below 0.5 second.

TTFT chart
TTFT chart

4. Open‑source model adoption

Open‑source large models are increasingly preferred by developers. Among the monitored MaaS platforms, DeepSeek achieved a 100 % deployment rate, followed by Kimi and Qwen (both 91 %), MiniMax (73 %), GLM (64 %), GPT (55 %) and Llama (27 %).

Open‑source model deployment rates
Open‑source model deployment rates

5. Service pricing

Domestic model pricing generally fell below 10 CNY per million tokens, while foreign models remained expensive: GPT 5.2 priced at 33.7 CNY per million tokens and Claude Opus 4.5 at 70 CNY per million tokens.

Pricing trend chart
Pricing trend chart

6. Context length trends

Longer context windows are gaining traction. Models with 32 K context account for ~33 % of the monitored set, while 128 K and 256 K contexts together represent ~47.6 %, a 10 % increase over the first half of the year.

Context length distribution
Context length distribution

II. MaaS platform engineering improvements

The monitoring covered 13 MaaS platforms (11 domestic, 2 overseas) offering DeepSeek‑R1 and DeepSeek‑V3 (including V3.1 and V3.2) APIs from February to December 2025.

1. DeepSeek performance evolution

DeepSeek‑R1 TTFT improved from 3.07 s (Feb) to 1.02 s (Sep); V3 TTFT dropped from 2.4 s (Feb) to 1.35 s (Dec). TPS for R1 rose from 17.86 to 37.29 tokens/s, and V3 from 19.55 to 33.27 tokens/s. Amazon and Google clouds achieved TTFT around 0.8 s, with TPS of 96.19 and 113.63 tokens/s respectively.

DeepSeek TTFT trend
DeepSeek TTFT trend
DeepSeek TPS trend
DeepSeek TPS trend

2. System stability

DeepSeek service success rates exceeded 99 % since March. R1 improved from 87.01 % (Feb) to 99.63 % (Dec); V3 rose from 94.05 % to 99.83 %.

Success rate trend
Success rate trend

III. Future work

The AIIA MaaS working group will continue the "Public‑Cloud Large‑Model Service Promotion Plan", expand monitoring scope, enhance multimodal monitoring (voice, image, video), and refine analysis dimensions. It also aims to build a real‑time monitoring dashboard, provide customized performance testing for enterprises, and analyze market share to guide industry development.

Monitoring methodology

Three methods were employed:

Daily monitoring: automated tests from four Chinese cloud nodes and a Silicon Valley node at five fixed times each day, sending three requests of varying length and measuring TTFT and TPS.

Concentrated monitoring: weekly batch of 300 mixed‑length requests to assess call success rate.

Manual monitoring: collection of disclosed pricing, RPM and TPM figures from platform documentation.

All services used streaming output with default settings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsperformance monitoringlatencythroughputpricingopen-source modelsMaaS
Software Engineering 3.0 Era
Written by

Software Engineering 3.0 Era

With large models (LLMs) reshaping countless industries, software engineering is leading the charge into the Software Engineering 3.0 era—model-driven development and operations. This account focuses on the new paradigms, theories, and methods of SE 3.0, and showcases its tools and practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.