Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention

0 likes · 22 min read

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

JD Tech

Mar 18, 2024 · Artificial Intelligence

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

AI inferenceDeep Learning CompilerDistributed computing

0 likes · 14 min read

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

JD Cloud Developers

Mar 14, 2024 · Artificial Intelligence

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

This article details JD Retail's ad‑tech team's deep‑compute optimizations—including a distributed graph‑based heterogeneous framework, GPU‑focused inference engine enhancements, TensorBatch request aggregation, deep‑learning compiler bucket pre‑compilation, asynchronous compilation, and multi‑stream GPU processing—to overcome high‑concurrency, low‑latency online recommendation challenges.

Deep Learning CompilerDistributed computingGPU inference

0 likes · 14 min read

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

JD Retail Technology

Jan 25, 2024 · Artificial Intelligence

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration

This article describes how JD Retail's advertising technology team tackled the high‑compute demands of modern recommendation models by designing a distributed graph‑partitioned heterogeneous computing framework, introducing TensorBatch request aggregation, leveraging deep‑learning compiler bucketing and asynchronous compilation, and implementing a multi‑stream GPU architecture to dramatically improve online inference throughput and latency.

Deep Learning CompilerDistributed computingGPU Acceleration

0 likes · 13 min read

Optimizing High‑Concurrency Online Inference for Recommendation Models with Distributed Heterogeneous Computing and GPU Acceleration