Tagged articles

84 articles

Page 1 of 1

May 31, 2026 · Artificial Intelligence

vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

The vLLM 0.22 stable release introduces production‑grade DeepSeek V4 support, massive kernel fusions, up to 10‑20× speedups, Batch Invariance with 28.9% latency gain, a Rust front‑end, multi‑level KV cache offload that can double context length, and broad hardware coverage across NVIDIA, AMD, CPU and RISC‑V, making it a pivotal upgrade for inference infrastructure teams.

Batch InvarianceDeepSeek V4Inference Optimization

0 likes · 13 min read

vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

Xiaomi Tech

May 30, 2026 · Artificial Intelligence

How Xiaomi’s MiMo V2.5 Achieves 99% API Price Cut with Full‑Stack Inference Optimizations

The MiMo‑V2.5 series combines Hybrid Sliding‑Window Attention, Mixture‑of‑Experts and multimodal support with a complete redesign of KVCache management, tiered caching, prefix‑tree logic and scheduling, compressing KVCache to about one‑seventh of full‑attention models and delivering up to 40% faster Prefill, 30% lower TTFT and dramatically reduced inference costs that enable a 99% API price reduction.

Hybrid SWAInference OptimizationKVCache

0 likes · 12 min read

How Xiaomi’s MiMo V2.5 Achieves 99% API Price Cut with Full‑Stack Inference Optimizations

Xiaomi Tech

May 26, 2026 · Artificial Intelligence

MiMo V2.5 API Gets Permanent Price Cut and Token Plan Overhaul – Incentive Program Ends

MiMo announces a permanent up to 99% price reduction for its V2.5 API, a 5‑8× usage boost in its Token Plan billing, a full reset of all Token Plan quotas, and the conclusion of its Hundred‑Trillion Token Creator Incentive Program, effective May 27, 2026.

AI infrastructureAPI pricingInference Optimization

0 likes · 5 min read

MiMo V2.5 API Gets Permanent Price Cut and Token Plan Overhaul – Incentive Program Ends

Tencent Technical Engineering

May 24, 2026 · Artificial Intelligence

How Tsinghua & Tencent Mixed‑X Won the MLSys 2026 MoE Inference Challenge with a 4.1× Speedup

The Tsinghua‑Tencent Mixed‑X team captured the MLSys 2026 MoE inference optimization championship by analyzing NPU bottlenecks, redesigning data movement, applying expert‑level sharding, continuous DMA, PSUM batching, and an Agent‑based optimizer, achieving a 4.1× end‑to‑end speedup while preserving bit‑level output fidelity.

Agent optimizerInference OptimizationMLSys 2026

0 likes · 14 min read

How Tsinghua & Tencent Mixed‑X Won the MLSys 2026 MoE Inference Challenge with a 4.1× Speedup

ZhiKe AI

May 23, 2026 · Artificial Intelligence

Zhipu AI Unveils GLM-5.1-HighSpeed, Achieving 400 Tokens/s and 6× Faster Generation

On May 22 2026, Zhipu AI released the GLM‑5.1‑HighSpeed variant, which generates up to 400 tokens per second—over six times the speed of the standard GLM‑5.1 and twice that of Google’s Gemini‑3.5‑Flash—thanks to multi‑dimensional inference, attention and sequence‑parallel optimizations while preserving full model capabilities.

GLM-5.1-HighSpeedInference OptimizationLLM

0 likes · 3 min read

Zhipu AI Unveils GLM-5.1-HighSpeed, Achieving 400 Tokens/s and 6× Faster Generation

SuanNi

May 22, 2026 · Artificial Intelligence

How GLM‑5.1‑highspeed Achieves 7× Faster Inference to Become the World’s Fastest Flagship Model

On May 22, Zhipu launched the GLM‑5.1‑highspeed API, delivering 400 tokens per second—about 7× faster than the original model and twice as fast as Gemini 3.5 Flash—through a three‑layer optimization that rewrites the MoE inference path, introduces dynamic scheduling, and leverages TileRT’s AOT engine to cut latency while preserving full flagship capabilities.

GLM-5.1Inference OptimizationLarge Language Model

0 likes · 10 min read

How GLM‑5.1‑highspeed Achieves 7× Faster Inference to Become the World’s Fastest Flagship Model

IT Services Circle

May 17, 2026 · Artificial Intelligence

60 Essential AI Terms Every Programmer Should Master

This article walks programmers through 60 core AI concepts—from the basics of large language models and tokens to advanced topics like prompt engineering, retrieval‑augmented generation, fine‑tuning, and inference optimization—organized into progressive skill levels and illustrated with concrete examples and code snippets.

AIInference OptimizationRAG

0 likes · 25 min read

60 Essential AI Terms Every Programmer Should Master

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Elastic Speculative Decoding Breaks Large‑Model Inference Bottlenecks

The paper introduces ECHO, an elastic speculative decoding framework that treats token verification as a global budget‑scheduling problem, uses sparse confidence gating and a two‑level priority scheduler, and demonstrates up to 14.4% throughput gains for high‑concurrency LLM serving.

Inference OptimizationSpeculative Decodingelastic budget

0 likes · 14 min read

Elastic Speculative Decoding Breaks Large‑Model Inference Bottlenecks

Machine Heart

May 11, 2026 · Artificial Intelligence

How PRISM Enables Efficient Test‑Time Scaling for Discrete Diffusion Language Models

The article analyzes how the PRISM framework redesigns test‑time scaling for discrete diffusion language models by replacing costly Best‑of‑N sampling with a three‑stage hierarchical search, local branching via partial remasking, and self‑verified feedback, achieving large accuracy gains on math and code benchmarks while cutting inference compute by up to four‑fold.

Discrete DiffusionHierarchical SearchInference Optimization

0 likes · 11 min read

How PRISM Enables Efficient Test‑Time Scaling for Discrete Diffusion Language Models

Lao Guo's Learning Space

May 7, 2026 · Artificial Intelligence

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

The article explains why large‑language‑model inference is bottlenecked by memory‑bandwidth, then details Google’s Gemma 4 MTP technique—using a small draft model with speculative decoding and shared KV‑Cache—to parallelize token prediction, achieving up to three‑fold speed gains without any loss in output quality, and provides step‑by‑step local deployment instructions.

Gemma 4Inference OptimizationKV Cache

0 likes · 11 min read

Gemma 4 MTP Deep Dive: Speculative Decoding & KV‑Cache Sharing for 3× Faster Inference

Machine Learning Algorithms & Natural Language Processing

May 2, 2026 · Artificial Intelligence

RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixtures

RouteMoA moves model selection ahead of inference by using a lightweight scorer to predict each model's suitability from the query, dramatically cutting computation cost and latency while preserving or improving accuracy, as demonstrated on a 15‑model pool with up to 90% cost reduction and 64% latency reduction.

ACL 2026Dynamic RoutingInference Optimization

0 likes · 9 min read

RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixtures

PaperAgent

Apr 29, 2026 · Artificial Intelligence

Skill‑Driven Reasoning Cuts Tokens by Up to 59% While Boosting Accuracy

The article introduces the TRS (Thinking with Reasoning Skills) framework, which distills historical LLM reasoning traces into reusable skill cards, enabling offline skill‑base construction and online retrieval that dramatically reduces token consumption (6‑59%) and often improves accuracy on math and coding tasks.

Inference OptimizationReasoning SkillsTRS

0 likes · 13 min read

Skill‑Driven Reasoning Cuts Tokens by Up to 59% While Boosting Accuracy

Machine Heart

Apr 25, 2026 · Artificial Intelligence

Can Multi-Model Co-Evolution Shatter the Single-Model Ceiling? Squeeze Evolve Achieves Validator-Free SOTA Inference

The paper introduces Squeeze Evolve, a validator‑free multi‑model evolutionary framework that orchestrates diverse large language models to break the performance ceiling of any single model, delivering up to 23‑point accuracy improvements and 1.4‑3.3× cost reductions across math, vision, and scientific benchmarks.

AI researchInference OptimizationSqueeze Evolve

0 likes · 8 min read

Can Multi-Model Co-Evolution Shatter the Single-Model Ceiling? Squeeze Evolve Achieves Validator-Free SOTA Inference

Huawei Cloud Developer Alliance

Apr 23, 2026 · Artificial Intelligence

Kimi K2.6 Launches on Huawei Cloud – Experience the New AI Model Today

On April 20, the open‑source Kimi K2.6 model debuted with industry‑leading code generation, long‑range task execution and a 300‑agent cluster, while Huawei Cloud’s KV‑Cache‑Aware scheduling cuts TTFT by 10% and enables free, one‑click API access for developers.

AI agentHuawei CloudInference Optimization

0 likes · 4 min read

Kimi K2.6 Launches on Huawei Cloud – Experience the New AI Model Today

Machine Learning Algorithms & Natural Language Processing

Apr 21, 2026 · Artificial Intelligence

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.

Heterogeneous PDInference OptimizationKVCache

0 likes · 7 min read

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

Geek Labs

Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

Flash AttentionInference OptimizationKV Cache

0 likes · 5 min read

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

Xiaomi Tech

Apr 10, 2026 · Artificial Intelligence

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

Xiaomi’s AI team announced seven ACL 2026 papers that span low‑bit KV‑cache quantization for 8.3× faster LLM inference, OCR‑free multi‑page document VQA, a new attention‑basin analysis, non‑autoregressive spoken dialogue generation, a comprehensive mobile‑agent benchmark, a success‑rate‑aware training policy, and a progressive universal information‑extraction framework.

Inference Optimizationbenchmarkdialogue generation

0 likes · 12 min read

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

AI Tech Publishing

Apr 5, 2026 · Artificial Intelligence

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.

Inference OptimizationKV CacheLLM

0 likes · 7 min read

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

DeepHub IMBA

Apr 2, 2026 · Artificial Intelligence

Speculative Decoding Explained: Small Draft Model + One‑Shot Verification

The article details how speculative decoding—using a fast small model to draft tokens and a large model to verify them—overcomes the memory‑bandwidth bottleneck of autoregressive inference, introduces SSD’s self‑draft and tree‑verification stages, presents real‑world benchmark gains, and shows how to enable it in vLLM.

GPU memory bandwidthInference OptimizationSSD

0 likes · 14 min read

Speculative Decoding Explained: Small Draft Model + One‑Shot Verification

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%

The newly released GigaWorld-Policy world‑action model replaces traditional video‑prediction‑heavy WAM designs with an action‑centered architecture, achieving a ten‑fold inference speedup, ten‑fold training efficiency gain, and a 30% increase in real‑robot task success rate while reducing memory usage compared with Motus and Cosmos‑Policy.

Action-Centered ArchitectureInference OptimizationMultimodal Learning

0 likes · 8 min read

GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%

Old Zhang's AI Learning

Mar 26, 2026 · Artificial Intelligence

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Google’s TurboQuant reduces KV‑Cache memory by up to 4.6×, speeds 3‑bit attention computation up to 8× on H100, and delivers near‑zero accuracy loss across long‑context benchmarks, with open‑source implementations for Metal, vLLM and llama.cpp.

GoogleInference OptimizationKV Cache

0 likes · 10 min read

Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed

Machine Learning Algorithms & Natural Language Processing

Mar 7, 2026 · Artificial Intelligence

How Princeton’s DYSCO Decoder Boosts Long-Context Reasoning by 25% Without Fine‑Tuning

The DYSCO (Dynamic Attention‑Scaling Decoding) algorithm, introduced by Princeton’s Chen Danqi team and NYU, eliminates the need for fine‑tuning and restores performance on long‑context tasks, delivering up to a 25% relative gain on 128K token benchmarks while adding only about 3.8% extra FLOPs.

DYSCODynamic Attention ScalingInference Optimization

0 likes · 10 min read

How Princeton’s DYSCO Decoder Boosts Long-Context Reasoning by 25% Without Fine‑Tuning

SuanNi

Feb 27, 2026 · Artificial Intelligence

How Dual‑Channel Loading Doubles LLM Inference Throughput

The article analyzes the storage‑bandwidth bottleneck of agent‑style large language models, explains why traditional pre‑fill and decode architectures underutilize network resources, and details a dual‑channel loading and smart scheduling design that unlocks idle bandwidth, achieving up to 1.9× higher throughput in both offline and online inference workloads.

AI infrastructureDual-Channel LoadingInference Optimization

0 likes · 14 min read

How Dual‑Channel Loading Doubles LLM Inference Throughput

Data Party THU

Feb 25, 2026 · Artificial Intelligence

Why Multimodal LLMs Miss Tiny Objects—and How to Fix It

This article analyzes why multimodal large language models often fail to detect small objects, identifies three core bottlenecks, and presents a four‑tiered optimization roadmap—from zero‑cost inference tricks to data augmentation, model fine‑tuning, and engineering safeguards—backed by three real‑world case studies and actionable guidelines.

Inference Optimizationdata augmentationmodel fine-tuning

0 likes · 20 min read

Why Multimodal LLMs Miss Tiny Objects—and How to Fix It

AI Engineering

Feb 16, 2026 · Artificial Intelligence

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

Alibaba’s Qwen3.5-397B-A17B, a 397‑billion‑parameter open‑source multimodal LLM, combines mixed linear attention with a sparse MoE architecture to achieve 8.6‑19× higher decoding throughput than Qwen3‑Max, supports 201 languages, and can be deployed via vLLM, Docker, Transformers, or SGLang with various optimization presets.

Inference OptimizationLarge Language ModelSparse MoE

0 likes · 8 min read

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

DeWu Technology

Feb 11, 2026 · Artificial Intelligence

How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations

This article examines the evolution of re‑ranking systems from traditional pointwise models to a two‑stage generation‑evaluation framework, compares autoregressive and non‑autoregressive generative approaches, details inference speed optimizations with GPU and model‑server upgrades, and outlines a future end‑to‑end sequence generation architecture enhanced by reinforcement learning and contrastive learning.

AIGenerative ModelsInference Optimization

0 likes · 14 min read

How Generative Models Transform Re‑ranking Architecture for Faster, More Diverse Recommendations

Tencent Technical Engineering

Jan 30, 2026 · Artificial Intelligence

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

This article introduces Render‑of‑Thought (RoT), a novel paradigm that compresses chain‑of‑thought reasoning into visual embeddings using frozen vision encoders, achieving 3‑4× token reduction, faster inference, and improved interpretability while requiring minimal pre‑training.

Inference OptimizationLatent SpaceMultimodal

0 likes · 12 min read

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

Alibaba Cloud Developer

Jan 26, 2026 · Artificial Intelligence

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

Inference OptimizationKV CacheLLM

0 likes · 15 min read

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

Alibaba Cloud Developer

Dec 23, 2025 · Artificial Intelligence

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

This article explains how SGLang’s hybrid model design combines Transformer attention with Mamba state‑space layers, introduces a dual‑pool memory architecture and elastic allocation, and presents specialized prefix‑cache and speculative‑decoding techniques that together enable efficient, scalable inference for long‑context large language models.

Inference OptimizationKVCacheSGLang

0 likes · 22 min read

How Hybrid Transformer‑Mamba Architectures Overcome KVCache Challenges in Large‑Model Inference

MaGe Linux Operations

Dec 19, 2025 · Artificial Intelligence

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.

DockerGPUInference Optimization

0 likes · 30 min read

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

Tencent Cloud Developer

Dec 9, 2025 · Artificial Intelligence

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Inference OptimizationSelf-AttentionTransformer

0 likes · 29 min read

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

Tencent Technical Engineering

Oct 31, 2025 · Artificial Intelligence

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

SpecExit combines speculative sampling with a lightweight draft model to predict early‑exit signals, shortening large‑reasoning model chains by up to two‑thirds and achieving up to 2.5× end‑to‑end inference acceleration on vLLM without sacrificing accuracy.

AI efficiencyEarly StoppingInference Optimization

0 likes · 12 min read

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

Baidu Intelligent Cloud Tech Hub

Oct 28, 2025 · Artificial Intelligence

How Baidu’s New MTP Inference Code Doubles DeepSeek‑V3.2 Throughput

Baidu Baige and the SGLang community have open‑sourced a production‑tested MTP inference engine that boosts DeepSeek‑V3.2 decoding speed by over two times while delivering exceptional stability, thanks to a DSA‑optimized architecture that predicts multiple tokens in a single forward pass.

AIDSADeepSeek

0 likes · 4 min read

How Baidu’s New MTP Inference Code Doubles DeepSeek‑V3.2 Throughput

Data Party THU

Oct 21, 2025 · Artificial Intelligence

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.

Inference OptimizationLinear Time ComplexityTransformer

0 likes · 11 min read

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

AntTech

Oct 13, 2025 · Artificial Intelligence

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

Ant Group's open‑source dInfer framework dramatically speeds up diffusion language model inference—achieving more than a ten‑fold boost over Fast‑dLLM, surpassing autoregressive baselines, and delivering 1011 tokens per second on HumanEval—by tackling computational cost, KV‑cache invalidation, and parallel decoding challenges through modular system‑level innovations.

AI PerformanceDiffusion Language ModelInference Optimization

0 likes · 11 min read

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

DataFunSummit

Oct 8, 2025 · Artificial Intelligence

How EasyRec Boosts Recommendation Training and Inference Performance

This article explains the EasyRec recommendation system’s training and inference architecture, detailing optimization techniques such as embedding parallelism, CPU/GPU placement, XLA and TRT fusion, online learning pipelines, network compression, and real‑world deployment results that dramatically improve throughput and latency.

AI infrastructureEasyRecInference Optimization

0 likes · 15 min read

How EasyRec Boosts Recommendation Training and Inference Performance

DataFunTalk

Sep 30, 2025 · Artificial Intelligence

DeepSeek‑V3.2‑Exp Unveiled: Million‑Token Context, Sparse Attention, and Cost‑Effective Inference

DeepSeek‑V3.2‑Exp, the latest experimental large‑language model, is open‑sourced with a paper, featuring a million‑token context window, a new sparse attention mechanism, GRPO‑enhanced reasoning, and detailed cost‑analysis showing up to ten‑fold inference savings.

DeepSeekGRPOInference Optimization

0 likes · 5 min read

DeepSeek‑V3.2‑Exp Unveiled: Million‑Token Context, Sparse Attention, and Cost‑Effective Inference

AntTech

Sep 14, 2025 · Artificial Intelligence

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Ring-mini-2.0 is a high‑performance inference MoE model that activates only 1.4 B parameters out of 16 B total, achieving dense‑model quality below 10 B while supporting 128 K context length and ultra‑fast generation speeds of over 300 tokens/s.

AIInference OptimizationMoE

0 likes · 4 min read

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

DataFunSummit

Sep 11, 2025 · Artificial Intelligence

How Meituan’s MTGR is Redefining Generative Recommendation at Scale

This article explains why Meituan introduced a generative recommendation model, describes the MTGR architecture, data organization, training and inference engines built on TorchRec and TensorRT, reports performance gains and cost reductions, and outlines future directions such as simplifying the recommendation funnel and cross‑business heterogeneous modeling.

Generative RecommendationInference OptimizationMTGR

0 likes · 15 min read

How Meituan’s MTGR is Redefining Generative Recommendation at Scale

Eric Tech Circle

Sep 10, 2025 · Artificial Intelligence

Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide

This article walks through installing and configuring vLLM for local large language model inference, compares it with Ollama and LM Studio, details environment setup, model download, testing scripts, and shows how to expose an OpenAI‑compatible API for production use.

Inference OptimizationLarge Language ModelModelScope

0 likes · 11 min read

Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide

Efficient Ops

Sep 2, 2025 · Artificial Intelligence

Inside Meituan’s LongCat‑Flash‑Chat: 560B‑Parameter MoE Model with Ultra‑Fast Inference

Meituan has open‑sourced LongCat‑Flash‑Chat, a 5.6‑trillion‑parameter Mixture‑of‑Experts model that activates only a fraction of its weights per token, delivering mainstream‑level performance, high inference speed, and low cost for complex agent applications.

Artificial IntelligenceInference OptimizationLarge Language Model

0 likes · 4 min read

Inside Meituan’s LongCat‑Flash‑Chat: 560B‑Parameter MoE Model with Ultra‑Fast Inference

Baobao Algorithm Notes

Sep 2, 2025 · Artificial Intelligence

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.

Artificial IntelligenceInference OptimizationLarge Language Model

0 likes · 23 min read

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

Architects' Tech Alliance

Aug 18, 2025 · Artificial Intelligence

How Large Model Training Dominates Compute and What New Techniques Can Change It

This article explains why pre‑training large AI models consumes 90‑99% of total compute, describes the full training and inference pipelines, introduces resource‑saving strategies such as PD‑separation, and reviews market trends and infrastructure challenges shaping the next generation of AI systems.

AI infrastructureAI trainingGPU architecture

0 likes · 13 min read

How Large Model Training Dominates Compute and What New Techniques Can Change It

AIWalker

Aug 4, 2025 · Artificial Intelligence

Can Lumina-mGPT 2.0 Replace Diffusion Models? A Deep Dive into Its Autoregressive Power

Lumina-mGPT 2.0 is a decoder‑only, zero‑shot trained autoregressive image model that rivals diffusion systems like DALL·E 3 in quality while offering unified multimodal tokenization, flexible multi‑task generation, and several inference‑speed tricks, yet it still faces licensing, scaling and sampling‑time challenges.

AI model analysisInference OptimizationLumina-mGPT

0 likes · 22 min read

Can Lumina-mGPT 2.0 Replace Diffusion Models? A Deep Dive into Its Autoregressive Power

Tencent Technical Engineering

Jul 18, 2025 · Artificial Intelligence

From CPUs to GPUs: How Traditional Backend Skills Power Modern AI Infrastructure

This article explores the evolution of AI infrastructure, comparing it with traditional backend systems, and details how hardware shifts to GPU-centric designs, software adaptations like deep learning frameworks, and engineering challenges in model training and inference can be addressed using established backend methodologies.

AI infrastructureGPU computingInference Optimization

0 likes · 19 min read

From CPUs to GPUs: How Traditional Backend Skills Power Modern AI Infrastructure

Tencent Technical Engineering

Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI PerformanceDeepSeekGPU utilization

0 likes · 13 min read

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

DataFunSummit

Jul 4, 2025 · Artificial Intelligence

How EasyRec Boosts Recommendation Performance: Training, Inference, and Online Learning Optimizations

This article explains the EasyRec recommendation system's training and inference architecture, details a series of optimizations for both CPU and GPU pipelines, and describes the online learning workflow that enables real‑time model updates across large‑scale e‑commerce scenarios.

AIInference OptimizationOnline Learning

0 likes · 16 min read

How EasyRec Boosts Recommendation Performance: Training, Inference, and Online Learning Optimizations

DataFunSummit

Jun 20, 2025 · Artificial Intelligence

EasyRec Deep Dive: Training & Inference Architecture, Optimizations, and Online Learning

This article explains EasyRec's end‑to‑end recommendation system, covering its training‑inference architecture, a series of CPU/GPU and distributed optimizations, and a real‑time online‑learning pipeline that together improve throughput, latency, and model freshness.

AI infrastructureDistributed computingInference Optimization

0 likes · 15 min read

EasyRec Deep Dive: Training & Inference Architecture, Optimizations, and Online Learning

Alibaba Cloud Developer

Jun 10, 2025 · Artificial Intelligence

How AI Application Architectures Evolve: From Simple LLM Calls to Guardrails, Routing, and Agents

This article traces the evolution of AI application architectures—from the earliest minimal user‑LLM interaction to advanced designs featuring context enhancement, input/output guardrails, intent routing, model gateways, caching strategies, agent capabilities, monitoring, and inference performance optimizations—providing practical insights and references for developers.

AI ArchitectureAgentCaching

0 likes · 21 min read

How AI Application Architectures Evolve: From Simple LLM Calls to Guardrails, Routing, and Agents

Meituan Technology Team

May 15, 2025 · Artificial Intelligence

How Meituan’s MTGR Framework Achieved 65× Faster Inference with Scaling Laws

Meituan’s recommendation team introduced the MTGR framework, aligning traditional DLRM features with a unified HSTU‑based Transformer to explore scaling laws, delivering a 65‑fold FLOPs boost, 12% lower inference cost, and record gains in online CTR and order volume across its food‑delivery platform.

Inference OptimizationMTGRRecommendation Systems

0 likes · 26 min read

How Meituan’s MTGR Framework Achieved 65× Faster Inference with Scaling Laws

Baidu Geek Talk

May 12, 2025 · Artificial Intelligence

One‑Click Deployment of Baidu Qwen3 Large Models on Baidu Baige AI Platform

This guide explains how to use Baidu Baige's AI heterogeneous computing platform to deploy the eight‑model Qwen3 family—including dense and MoE variants—via a one‑click process, covering resource configuration, inference acceleration options, and post‑deployment service access.

AIBaidu BaigeCloud AI

0 likes · 4 min read

One‑Click Deployment of Baidu Qwen3 Large Models on Baidu Baige AI Platform

AI Algorithm Path

May 1, 2025 · Artificial Intelligence

Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

FastServeFlexGenInference Optimization

0 likes · 18 min read

Uncovering the Secrets of LLM Inference Optimization

JD Retail Technology

Apr 22, 2025 · Artificial Intelligence

Generative Large‑Model Architecture for JD Advertising: Practices, Challenges, and Optimization

JD’s advertising platform replaces rule‑based recall with a generative large‑model pipeline that unifies e‑commerce knowledge, multimodal user intent, and semantic IDs across recall, coarse‑ranking, fine‑ranking and creative optimization, while meeting sub‑100 ms latency and sub‑¥1‑per‑million‑token cost through quantization, parallelism, caching, and joint generative‑discriminative inference, delivering double‑digit performance gains and paving the way for domain‑specific foundation models.

AdvertisingDistributed SystemsInference Optimization

0 likes · 20 min read

Generative Large‑Model Architecture for JD Advertising: Practices, Challenges, and Optimization

Architects' Tech Alliance

Apr 13, 2025 · Artificial Intelligence

Deploying DeepSeek LLMs On-Premises: Step‑by‑Step Guide and Hardware Sizing

This article provides a comprehensive technical guide for privately deploying DeepSeek large language models, covering model and runtime parameter selection, hardware sizing calculations, software stack preparation, inference service setup, performance tuning, and security monitoring considerations.

AI hardware sizingDeepSeekInference Optimization

0 likes · 14 min read

Deploying DeepSeek LLMs On-Premises: Step‑by‑Step Guide and Hardware Sizing

Ops Development & AI Practice

Mar 19, 2025 · Artificial Intelligence

Can Cache‑Augmented Generation Outperform RAG? A Deep Dive into LLM Efficiency

Cache‑augmented generation (CAG) preloads documents into LLM context using KV caches to eliminate retrieval latency, offering faster inference for static knowledge bases, while RAG remains more flexible for dynamic or large corpora; this article compares their definitions, performance, implementation steps, and future prospects.

CAGCache AugmentationInference Optimization

0 likes · 11 min read

Can Cache‑Augmented Generation Outperform RAG? A Deep Dive into LLM Efficiency

Baidu Tech Salon

Mar 13, 2025 · Artificial Intelligence

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.

DockerInference OptimizationMLA

0 likes · 13 min read

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

Java Architecture Diary

Mar 7, 2025 · Artificial Intelligence

Boost Inference Efficiency with QwQ-32B: Benchmarks, Resource Savings, and Java Integration

QwQ-32B, Alibaba’s new inference‑optimized large language model built on the Qwen2.5 architecture, outperforms DeepSeek‑R1 across math reasoning, code generation, and safety benchmarks while requiring only 24 GB vRAM, and the article provides detailed performance data, resource‑efficiency analysis, and step‑by‑step Java and Ollama integration instructions.

Function CallingInference OptimizationJava integration

0 likes · 7 min read

Boost Inference Efficiency with QwQ-32B: Benchmarks, Resource Savings, and Java Integration

Architects' Tech Alliance

Feb 27, 2025 · Artificial Intelligence

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

AI serverDeepSeekInference Optimization

0 likes · 5 min read

How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

Architects' Tech Alliance

Feb 12, 2025 · Industry Insights

DeepSeek’s Technical Innovations: MoE Architecture, Efficient Inference, and Multimodal Capabilities

The article analyzes DeepSeek’s recent breakthroughs—including its Mixture‑of‑Experts architecture, cost‑effective inference optimizations, high‑accuracy multimodal processing, and open‑source collaboration—while also offering a curated bundle of technical e‑books covering AI chips, networking, storage, and more.

Artificial IntelligenceDeepSeekIndustry Insights

0 likes · 4 min read

DeepSeek’s Technical Innovations: MoE Architecture, Efficient Inference, and Multimodal Capabilities

DeWu Technology

Feb 12, 2025 · Artificial Intelligence

Edge Intelligence for Intelligent Video Cover Recommendation

The article describes an edge‑based video‑cover recommendation system for DeWu that leverages the MNN SDK and a lightweight MobileNetV3 model, performing on‑device inference with quantization and parallel processing to automatically select high‑quality covers, achieving sub‑second latency and boosting click‑through rates by up to 18 %.

Edge AIInference OptimizationModel Deployment

0 likes · 12 min read

Edge Intelligence for Intelligent Video Cover Recommendation

JD Retail Technology

Feb 12, 2025 · Artificial Intelligence

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

JD Advertising accelerates its generative‑recall recommendation system by integrating NVIDIA TensorRT‑LLM, which simplifies the pipeline, injects LLM knowledge, scales to billions of parameters, and delivers over five‑fold throughput gains, one‑fifth the cost, and significant CTR improvements in both recommendation and search.

Inference OptimizationLLMRecommendation Systems

0 likes · 13 min read

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

DataFunTalk

Jan 26, 2025 · Artificial Intelligence

58.com’s LingXi Large Language Model Platform: Development, Deployment, and Performance Optimizations

Since the launch of ChatGPT, 58.com has built a Model‑as‑a‑Service platform called LingXi that trains and serves domain‑specific large language models, supports over a hundred internal scenarios with daily inference exceeding ten million calls, and continuously improves performance through quantization, GPU optimization, model miniaturization, and advanced AI applications such as interview assistants, voice agents, and RAG‑enabled agents.

AI PlatformAI applicationsInference Optimization

0 likes · 9 min read

58.com’s LingXi Large Language Model Platform: Development, Deployment, and Performance Optimizations

JD Tech Talk

Jan 14, 2025 · Artificial Intelligence

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

This article explains how generative recommendation systems powered by large language models simplify the recommendation pipeline, integrate world knowledge, benefit from scaling laws, and require specialized engineering optimizations such as TensorRT‑LLM deployment, inference acceleration, and hybrid model strategies to achieve low latency and high throughput in real‑world e‑commerce scenarios.

AIInference OptimizationLLM

0 likes · 10 min read

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

Baobao Algorithm Notes

Jan 3, 2025 · Artificial Intelligence

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

The article examines DeepSeek‑V3’s architecture and training pipeline, highlighting its use of MLA and a highly granular MoE design, pioneering FP8 mixed‑precision training, fine‑grained per‑tile quantization, advanced parallelism strategies, and inference optimizations such as PD separation and NanoFlow to achieve unprecedented efficiency on limited GPU resources.

DeepSeek-V3FP8Inference Optimization

0 likes · 10 min read

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

DataFunSummit

Dec 31, 2024 · Artificial Intelligence

How Momo Leverages Large Model Technology to Transform Business and R&D Processes

This article explains how Momo utilizes large language model technologies to revamp its AI application paradigm, achieve efficient inference through quantization and prefix caching, build a workflow‑based model platform, and outline future plans for framework optimization and multimodal support.

AI PlatformInference OptimizationMomo

0 likes · 16 min read

How Momo Leverages Large Model Technology to Transform Business and R&D Processes

DataFunSummit

Nov 22, 2024 · Artificial Intelligence

EasyRec Recommendation Algorithm Training and Inference Optimization

This article presents a comprehensive overview of EasyRec’s recommendation system architecture, detailing training and inference optimizations, embedding parallelism, CPU/GPU placement strategies, online learning pipelines, and network compression techniques that together improve scalability, latency, and cost efficiency.

Distributed SystemsEasyRecInference Optimization

0 likes · 15 min read

EasyRec Recommendation Algorithm Training and Inference Optimization

Alibaba Cloud Big Data AI Platform

Sep 26, 2024 · Artificial Intelligence

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

At the 2024 Yunqi Conference, Alibaba Cloud’s AI Infra experts detailed the latest challenges of large‑model deployment—such as hardware costs, resource management, and software‑hardware coordination—and introduced PAI’s new capabilities, including stability tools, automated distributed training, reinforcement‑learning frameworks, inference optimizations, and integrated big‑data AI solutions.

AI InfraBig Data IntegrationInference Optimization

0 likes · 14 min read

How Alibaba Cloud’s PAI Tackles Large‑Model Training and Inference Challenges in 2024

Sohu Tech Products

Aug 28, 2024 · Artificial Intelligence

EasyRec Recommendation Algorithm Training and Inference Optimization

EasyRec, Alibaba Cloud’s modular recommendation framework, unifies configurable data, embedding, dense, and output layers on MaxCompute, EMR, and DLC, and speeds training with deduplication, EmbeddingParallel sharding, lock‑free hash tables, GPU embeddings, and AMX BF16, while inference benefits from operator fusion, low‑precision AVX/AMX kernels, compact caches, batch merging, and network compression, enabling real‑time online learning and delivering higher recommendation quality at lower cost in e‑commerce.

Alibaba CloudEasyRecInference Optimization

0 likes · 14 min read

DataFunTalk

Aug 26, 2024 · Artificial Intelligence

EasyRec Recommendation Algorithm Training and Inference Optimization

This article presents a comprehensive overview of EasyRec's recommendation system architecture, detailing training and inference optimizations, distributed deployment strategies, operator fusion techniques, online learning pipelines, and network-level improvements to enhance performance and scalability.

AIInference OptimizationTraining Optimization

0 likes · 15 min read

Baobao Algorithm Notes

Aug 26, 2024 · Artificial Intelligence

Master Essential LLM Engineering Skills: Transform, Model, and Infer with Custom Scripts

This guide presents a hands‑on curriculum of core large‑model engineering tasks—including model conversion scripts, custom modeling wrappers, multi‑model inference utilities, and channel‑aware loss tracking—to help practitioners build practical, reusable tools without deep theoretical overhead.

AI EngineeringInference OptimizationPython scripting

0 likes · 8 min read

Master Essential LLM Engineering Skills: Transform, Model, and Infer with Custom Scripts

Baidu Tech Salon

May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.

AI infrastructureAIAK-LLMBaidu Baige

0 likes · 26 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Geek Talk

May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.

AI infrastructureBaiduInference Optimization

0 likes · 25 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

iQIYI Technical Product Team

Mar 15, 2024 · Artificial Intelligence

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

By fusing sparse‑feature operators, enabling multi‑stream execution, consolidating data copies, and merging inference batches, iQIYI reduced GPU CTR model latency to CPU‑level, boosted throughput over sixfold, and cut operational costs by more than 40%, overcoming launch‑overhead bottlenecks.

CTRGPUInference Optimization

0 likes · 10 min read

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

Alibaba Cloud Developer

Feb 20, 2024 · Artificial Intelligence

Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling

This article explains two production‑grade optimization techniques for large language model inference—KV‑cache reuse across multi‑turn dialogues and speculative sampling with a small draft model—detailing their design, implementation, and performance impact.

AIInference OptimizationKV Cache

0 likes · 14 min read

Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling

Alibaba Cloud Big Data AI Platform

May 30, 2023 · Artificial Intelligence

Boost Stable Diffusion Inference with PAI-Blade: LoRA & ControlNet Optimization

This article explains how to use PAI-Blade to accelerate Stable Diffusion inference by optimizing LoRA and ControlNet components, detailing configuration steps, code modifications, benchmark results on A100/A10 GPUs, and integration with both Diffusers and the popular Stable-Diffusion-webui, highlighting performance gains and memory savings.

ControlNetGPU BenchmarkInference Optimization

0 likes · 8 min read

Boost Stable Diffusion Inference with PAI-Blade: LoRA & ControlNet Optimization

Alibaba Cloud Big Data AI Platform

May 29, 2023 · Artificial Intelligence

How PAI‑Blade Supercharges Stable Diffusion Inference on GPUs

This article explains how PAI‑Blade, built on the BladeDISC compiler and BlaDNN library, dramatically reduces latency and memory usage for Stable Diffusion inference, provides step‑by‑step usage examples with code, shows performance gains on A100 and A10 GPUs, and outlines future optimization directions.

GPUInference OptimizationPAI-Blade

0 likes · 9 min read

How PAI‑Blade Supercharges Stable Diffusion Inference on GPUs

Alimama Tech

Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

Auto ScalingGPU utilizationHigh Performance Computing

0 likes · 19 min read

Optimizing GPU Utilization for Multimedia AI Services with high_service

DataFunSummit

Apr 19, 2022 · Artificial Intelligence

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

This article reviews DeepSpeed‑MoE, an end‑to‑end system that introduces new MoE architectures, model‑compression techniques, and highly optimized inference pipelines, detailing its motivation, design of PR‑MoE (Pyramid‑MoE and Residual‑MoE), distributed parallel strategies, communication and kernel optimizations, and performance gains over dense baselines.

AIDeepSpeedInference Optimization

0 likes · 11 min read

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

Baidu Geek Talk

Apr 1, 2022 · Artificial Intelligence

How Paddle Lite & PaddleSlim Supercharge Edge AI Inference Performance

With the rapid rise of edge computing, deploying AI models for tasks like object detection, OCR, and speech recognition on resource‑constrained devices faces speed challenges; the upgraded Paddle Lite inference engine and PaddleSlim compression tools claim up to 23% faster inference and significant model size reductions, offering a practical solution.

AI deploymentEdge AIInference Optimization

0 likes · 6 min read

How Paddle Lite & PaddleSlim Supercharge Edge AI Inference Performance

DataFunTalk

Dec 25, 2020 · Artificial Intelligence

Exploring Pretraining Model Optimization and Deployment Challenges in NLP

This article reviews the evolution of pretraining models in NLP, discusses the practical challenges of deploying large models such as inference latency, knowledge integration, and task adaptation, and presents Xiaomi’s optimization techniques including knowledge distillation, low‑precision inference, operator fusion, and multi‑granularity segmentation for dialogue systems.

BERTDialogue SystemsInference Optimization

0 likes · 15 min read

Exploring Pretraining Model Optimization and Deployment Challenges in NLP

58 Tech

Nov 20, 2020 · Artificial Intelligence

Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)

The article details the development, architecture, and optimization of 58.com’s AI algorithm platform (WPAI), covering its background, overall design, large‑scale distributed machine learning, deep‑learning platform features, inference performance enhancements, GPU resource scheduling improvements, and future directions.

AI PlatformGPU schedulingInference Optimization

0 likes · 15 min read

Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)

DataFunTalk

Jul 7, 2020 · Artificial Intelligence

Optimizing Pretrained Language Model Inference: Lessons from the NLPCC Small Model Competition and Deployment at Xiaomi

This article shares the Xiaomi AI Lab NLP team's experience in the NLPCC lightweight language model competition, discusses efficiency challenges of large pretrained models like BERT, and details practical inference optimizations—including model distillation, batching, FP16 quantization, and FasterTransformer integration—that dramatically reduce latency and hardware costs in production.

AIBERTInference Optimization

0 likes · 15 min read

Optimizing Pretrained Language Model Inference: Lessons from the NLPCC Small Model Competition and Deployment at Xiaomi

iQIYI Technical Product Team

Dec 21, 2018 · Artificial Intelligence

CPU-Based Optimization of Deep Learning Inference Services

To alleviate GPU scarcity, iQIYI’s cloud platform migrated deep‑learning inference to CPUs and applied system‑level (MKL‑DNN, OpenVINO), application‑level, and algorithm‑level optimizations—tuning threads, batch size, NUMA, pruning and quantization—delivering 1‑9× speedups across thousands of cores while preserving latency and accuracy.

CPUInference OptimizationMKL-DNN

0 likes · 14 min read

CPU-Based Optimization of Deep Learning Inference Services

Alibaba Cloud Developer

Sep 28, 2017 · Artificial Intelligence

How Alipay’s xNN Brings Deep Learning to Millions of Mobile Devices

This article explains how Alipay’s xNN engine overcomes mobile deep‑learning challenges through aggressive model compression, lightweight SDK design, algorithm‑ and instruction‑level optimizations, enabling high‑accuracy AI inference on a wide range of Android and iOS devices with minimal app‑size impact.

AlipayInference OptimizationModel Compression

0 likes · 13 min read

How Alipay’s xNN Brings Deep Learning to Millions of Mobile Devices