Tagged articles

45 articles

Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

May 29, 2026 · Artificial Intelligence

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

The article presents RTPurbo, a lightweight two‑stage training method that converts full‑attention LLMs into highly sparse models with over 97% sparsity, achieving up to 9.36× prefill and 2.01× decode speedups while preserving near‑lossless accuracy across long‑context benchmarks up to 512K tokens.

Dynamic Token SelectionKernel OptimizationLLM inference

0 likes · 17 min read

RTPurbo: >97% Sparsity and 9× Faster Long-Context LLM Inference with Minimal Training

Machine Learning Algorithms & Natural Language Processing

May 28, 2026 · Artificial Intelligence

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

This article presents GQLA, a single‑author variant of MLA that eliminates three hardware‑related drawbacks of MLA, demonstrates how it achieves balanced compute‑memory performance on both high‑end H100 and more modest H20 GPUs, and details conversion methods (TransGQLA) and sparse extensions with concrete benchmark results.

GQLALLMMLA

0 likes · 16 min read

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

SuanNi

May 27, 2026 · Artificial Intelligence

Inside Grok-5 and MiniMax-M3: Massive Model Upscale and New Sparse Attention Gains

The article reveals that xAI’s upcoming Grok-5 (Grok V9-Medium) will feature a 1.5-trillion-parameter model trained with extensive Cursor programming data, while MiniMax-M3 introduces a new sparse-attention architecture that boosts pre-fill speed by 9.7× and decode speed by 15.6×, highlighting a strategic partnership between SpaceX, Cursor, and xAI.

AI modelsCursorGrok-5

0 likes · 5 min read

Inside Grok-5 and MiniMax-M3: Massive Model Upscale and New Sparse Attention Gains

Data Party THU

May 16, 2026 · Artificial Intelligence

SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

The article analyzes SubQ, a new LLM architecture using Subquadratic Sparse Attention (SSA) to achieve a 12‑million‑token context window with linear compute scaling, delivering up to 52× speedup and costing just 5% of Opus while matching dense‑attention performance on long‑context benchmarks.

SSASparse AttentionSubQ

0 likes · 14 min read

SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

Data Party THU

May 10, 2026 · Artificial Intelligence

SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs

The Chinese Academy of Sciences unveils SpikingBrain 2.0‑5B, a brain‑inspired large model that uses dual‑space sparse attention and dual activation (FP8 and INT8‑Spiking) to cut training cost by over tenfold, achieve up to 15× speedup on long sequences, and match Qwen‑3 performance while drastically reducing power consumption.

Large Language ModelSparse AttentionSpikingBrain2.0

0 likes · 10 min read

SpikingBrain 2.0 Breaks Long‑Sequence and Low‑Power Bottlenecks in Brain‑Inspired LLMs

Machine Heart

May 6, 2026 · Artificial Intelligence

Beyond Transformers: SubQ Achieves 12‑Million‑Token Context at Just 5% of Opus Cost

The SubQ model introduces Subquadratic Sparse Attention (SSA), a content‑dependent routing mechanism that reduces attention complexity to linear, enabling a 12‑million‑token context window with a 52.2× speedup and only 5% of Opus's cost, as demonstrated on MRCR v2, RULER, and SWE‑Bench benchmarks.

LLMSparse AttentionSubQ

0 likes · 14 min read

Beyond Transformers: SubQ Achieves 12‑Million‑Token Context at Just 5% of Opus Cost

Lao Guo's Learning Space

Apr 30, 2026 · Artificial Intelligence

How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier

Traditional full‑attention cannot handle million‑token contexts due to exponential compute and memory growth, but DeepSeek V4’s Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) compress, sparsely index, and precisely compute tokens, cutting KV cache to 10% and FLOPs to 27% while enabling a 1‑M token window on a single GPU.

CSAHCAKV cache compression

0 likes · 12 min read

How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier

Machine Heart

Apr 30, 2026 · Artificial Intelligence

Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips

The article analyzes how both DeepSeek V4 and Meituan's LongCat‑2.0‑P preview, each with trillion‑scale parameters and 1 M‑token context, were trained and inferred entirely on Chinese‑made accelerators, detailing memory optimizations, deterministic operators, MoE redesigns, and massive multi‑card clusters that prove domestic compute can meet top‑tier AI workloads.

Deterministic OpsDomestic AI ChipLarge Language Model

0 likes · 13 min read

Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips

Architects' Tech Alliance

Apr 29, 2026 · Artificial Intelligence

DeepSeek V4: Open‑Source Bombshell That Shakes Closed‑Source AI Giants

DeepSeek V4’s preview launch unveils two open‑source LLM variants—V4‑Pro with 1.6 T parameters and V4‑Flash with 284 B—both supporting a default 1 M‑token context, and introduces novel mHC residual scheduling, hybrid CSA/HCA sparse attention, and Muon optimizer tricks that together deliver top‑tier performance rivaling closed‑source models across coding, long‑text, and reasoning benchmarks.

DeepSeekLarge Language ModelSparse Attention

0 likes · 10 min read

DeepSeek V4: Open‑Source Bombshell That Shakes Closed‑Source AI Giants

AI Era Action Guide

Apr 24, 2026 · Artificial Intelligence

DeepSeek-V4 Launches with 1M Token Context and Leading Open-Source Agent – A Chinese AI Milestone

DeepSeek has unveiled the V4 preview, offering two open‑source large language models—Pro (1.6 T parameters) and Flash (284 B)—both supporting 1 million‑token context, sparse‑attention efficiency gains, top‑ranked Agent capabilities, and competitive reasoning performance, marking a major milestone for Chinese AI.

1M token contextAgentDeepSeek

0 likes · 5 min read

DeepSeek-V4 Launches with 1M Token Context and Leading Open-Source Agent – A Chinese AI Milestone

Tech Musings

Apr 24, 2026 · Artificial Intelligence

DeepSeek-V4 Unveiled: 1M Context Length and Ascend Compute Power

DeepSeek has launched the open‑source DeepSeek‑V4 series, offering Pro and Flash models with a 1 million token context window, a novel sparse attention mechanism, performance that rivals Opus 4.6 on coding and knowledge benchmarks, tiered pricing, and future cost reductions once Ascend 950 supernodes become widely available.

1M contextAI benchmarkingDeepSeek V4

0 likes · 5 min read

DeepSeek-V4 Unveiled: 1M Context Length and Ascend Compute Power

AI Insight Log

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unveiled: 1.6 T Parameters, Million‑Token Context, Fully Open‑Source

DeepSeek V4 introduces two open‑source MoE models—Pro and Flash—with up to 1.6 T parameters, 1 M token context, a new DSA sparse‑attention mechanism, extensive benchmark results, and a tiered pricing scheme, while remaining compatible with OpenAI and Anthropic APIs.

DeepSeekLarge Language ModelOpen Source

0 likes · 9 min read

DeepSeek V4 Unveiled: 1.6 T Parameters, Million‑Token Context, Fully Open‑Source

AI Engineering

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unveiled: How Its Million-Token Context Redefines Open-Source LLMs

DeepSeek released the V4 preview, introducing V4‑Pro (1.6 T parameters, 49 B activation neurons, 33 T tokens) and V4‑Flash (284 B parameters, 13 B activation neurons, 32 T tokens) with 1 M token context, a novel DSA sparse attention that reduces compute and memory, and performance that rivals top closed‑source models in agentic coding, world‑knowledge and reasoning benchmarks, while offering an API compatible with OpenAI and Anthropic.

DeepSeekLarge Language ModelOpenAI API Compatibility

0 likes · 5 min read

DeepSeek V4 Unveiled: How Its Million-Token Context Redefines Open-Source LLMs

ArcThink

Apr 11, 2026 · Artificial Intelligence

DeepSeek V4 Preview: A Sovereign Shift Beyond Benchmarks

Developers can sift through official silence and industry leaks—internal statements, Ascend 950PR supply‑chain hints, and sparse‑attention innovations—to assess DeepSeek V4’s likely technical leaps, from million‑token context to native Ascend training, and its strategic impact on the open‑source AI landscape and CUDA independence.

AI model analysisDeepSeekHuawei Ascend

0 likes · 27 min read

DeepSeek V4 Preview: A Sovereign Shift Beyond Benchmarks

Machine Learning Algorithms & Natural Language Processing

Mar 24, 2026 · Artificial Intelligence

A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures

This article reviews and compares the most important attention variants used in modern large language models—including multi‑head attention, grouped‑query attention, multi‑head latent attention, sparse and sliding‑window attention, gated attention, and hybrid designs—detailing their motivations, memory trade‑offs, example architectures, and experimental findings.

Hybrid ArchitectureLLMMHA

0 likes · 29 min read

A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures

SuanNi

Feb 23, 2026 · Artificial Intelligence

How GLM‑5 Breaks New Ground with Sparse Attention and Asynchronous RL

GLM‑5, the 744‑billion‑parameter open‑source LLM, introduces DeepSeek Sparse Attention, Multi‑latent Attention, Muon Split optimizer, and a fully asynchronous agentic reinforcement‑learning framework, achieving state‑of‑the‑art performance on long‑context, code, math, and multimodal benchmarks while running efficiently on domestic Chinese chips.

GLM-5Sparse Attentionasynchronous reinforcement learning

0 likes · 12 min read

How GLM‑5 Breaks New Ground with Sparse Attention and Asynchronous RL

PaperAgent

Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionModel architecture

0 likes · 17 min read

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention

0 likes · 18 min read

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

AI Insight Log

Feb 12, 2026 · Artificial Intelligence

GLM-5 Unveiled: 744B Parameters, Claude Opus 4.5‑Level Performance, Epic Agent Upgrade

Z.ai released the open‑source GLM‑5 model with 744 billion parameters, 28.5 T tokens of training data, and new Sparse Attention and Slime RL infrastructure, achieving top open‑source rankings and near‑Claude Opus 4.5 performance on Vending Bench 2 and CC‑Bench‑V2 while adding multi‑scenario agent capabilities.

GLM-5Large Language ModelSparse Attention

0 likes · 6 min read

GLM-5 Unveiled: 744B Parameters, Claude Opus 4.5‑Level Performance, Epic Agent Upgrade

Machine Learning Algorithms & Natural Language Processing

Feb 10, 2026 · Artificial Intelligence

Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

The GLM-5 architecture, uncovered from a GitHub PR, doubles the previous model to 745 B parameters, adopts DeepSeek‑V3 sparse attention and multi‑token prediction, features a 78‑layer MoE with 256 experts, supports a 202K‑token context window, and its rumored test model "Pony Alpha" sparked a 60% rise in Zhipu AI's stock amid a crowded AI release season.

AI Stock ImpactDeepSeekGLM-5

0 likes · 6 min read

Inside GLM-5: 745B Parameters, DeepSeek‑style Sparse Attention, and a 60% Stock Surge

Baobao Algorithm Notes

Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention

0 likes · 22 min read

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

Alibaba Cloud Developer

Jan 15, 2026 · Artificial Intelligence

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

This article explains how a hierarchical sparse‑attention framework redesigns KVCache storage across GPU, CPU, and remote memory, eliminates bandwidth and capacity bottlenecks, and enables efficient inference for 128K‑token and larger contexts with dramatically reduced GPU memory usage and higher throughput.

Dynamic Sparse AttentionGPU memory optimizationHierarchical Storage

0 likes · 20 min read

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

Baidu Geek Talk

Dec 24, 2025 · Artificial Intelligence

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

The article explains how Baidu’s Baige team integrated a Context Parallelism strategy into DeepSeek V3.2, detailing the DSA architecture, the limitations of traditional tensor and sequence parallelism, and how CP distributes computation and memory across GPUs to achieve up to an 80 % reduction in token‑to‑first‑token latency for ultra‑long 128K‑token contexts.

Context ParallelismDeepSeekLLM

0 likes · 9 min read

Context Parallelism Slashes TTFT by 80% for 128K-Token LLMs

Baidu Intelligent Cloud Tech Hub

Dec 24, 2025 · Artificial Intelligence

How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens

The article explains how the newly merged Context Parallelism (CP) technique in SGLang, combined with DeepSeek V3.2's Sparse Attention architecture, reduces first‑token latency by up to 80% and alleviates memory pressure for ultra‑long 128K‑token sequences, detailing both algorithmic innovations and engineering solutions.

AI infrastructureContext ParallelismLLM

0 likes · 10 min read

How Context Parallelism Slashes LLM First‑Token Latency by 80% for 128K Tokens

Baidu Geek Talk

Dec 10, 2025 · Artificial Intelligence

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Cache offloadGPU MemoryLLM inference

0 likes · 20 min read

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Data Party THU

Dec 10, 2025 · Artificial Intelligence

How DeepSeek‑V3.2 Cuts Inference Cost and Boosts Agent Skills with Sparse Attention

DeepSeek's V3.2 release introduces a dual‑model lineup, a Sparse Attention architecture that halves long‑context inference cost, a post‑training reinforcement‑learning pipeline that exceeds 10% of pre‑training compute, and a revamped agent framework that dramatically improves tool‑use and reasoning performance across benchmarks.

DeepSeekLarge Language ModelSparse Attention

0 likes · 11 min read

How DeepSeek‑V3.2 Cuts Inference Cost and Boosts Agent Skills with Sparse Attention

Fun with Large Models

Dec 5, 2025 · Artificial Intelligence

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

This article provides a detailed, easy‑to‑understand analysis of DeepSeek‑Math‑V2’s self‑verification training method and DeepSeek‑V3.2’s GRPO framework, sparse‑attention DSA mechanism, massive agent data pipeline, and benchmark results that place both models among the world’s top open‑source large language models.

DeepSeekGRPOLLM

0 likes · 19 min read

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

Baidu Intelligent Cloud Tech Hub

Dec 4, 2025 · Artificial Intelligence

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Cache offloadGPU‑CPU optimizationLLM inference

0 likes · 21 min read

How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Baidu Intelligent Cloud Tech Hub

Nov 25, 2025 · Artificial Intelligence

Why DeepSeek‑V3.2‑Exp Lost Performance and How a Simple RoPE Fix Restored It

The Baidu Baige team discovered that DeepSeek‑V3.2‑Exp’s long‑context performance lagged behind the official report, traced the issue to a subtle RoPE layout mismatch in the open‑source inference demo, collaborated with DeepSeek to fix it, and verified that the model’s speed and accuracy fully recovered across multiple benchmarks.

AI infrastructureDeepSeekLLM inference

0 likes · 9 min read

Why DeepSeek‑V3.2‑Exp Lost Performance and How a Simple RoPE Fix Restored It

Huawei Cloud Developer Alliance

Oct 31, 2025 · Artificial Intelligence

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

AIEfficient AttentionHybrid Architecture

0 likes · 17 min read

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

Data Party THU

Oct 25, 2025 · Artificial Intelligence

How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

InfLLM‑V2 introduces a zero‑parameter, train‑efficient sparse‑attention framework that dramatically speeds up long‑sequence processing while requiring only 5 B tokens for training, and the open‑source MiniCPM4.1 model demonstrates comparable performance to dense attention on both long‑text understanding and deep‑thinking benchmarks.

EfficiencyInfLLM-V2MiniCPM4.1

0 likes · 10 min read

How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

BirdNest Tech Talk

Oct 14, 2025 · Artificial Intelligence

How DeepSeek’s Lightning Indexer Enables Efficient Sparse Attention for Long Texts

The article explains how DeepSeek’s Lightning Indexer acts as a memory‑filtering expert that computes index scores, selects the top‑k relevant tokens, and maps a compact formula to FP8 kernel code, reducing attention complexity from 128K to 2048 tokens for massive sequences.

DeepSeekFP8Lightning Indexer

0 likes · 7 min read

How DeepSeek’s Lightning Indexer Enables Efficient Sparse Attention for Long Texts

Fun with Large Models

Sep 30, 2025 · Artificial Intelligence

DeepSeek-V3.2 Architecture Breakthrough: A 5‑Minute Guide to Its Core Features

The article introduces DeepSeek-V3.2, highlighting its new DeepSeek Sparse Attention (DSA) that boosts training and inference efficiency by up to 50%, cuts model usage costs dramatically, explains the updated API endpoints, and details the four‑stage post‑training pipeline that underpins the model’s performance improvements.

AI ArchitectureDSADeepSeek-V3.2

0 likes · 8 min read

DeepSeek-V3.2 Architecture Breakthrough: A 5‑Minute Guide to Its Core Features

DataFunTalk

Sep 30, 2025 · Artificial Intelligence

DeepSeek‑V3.2‑Exp Unveiled: Million‑Token Context, Sparse Attention, and Cost‑Effective Inference

DeepSeek‑V3.2‑Exp, the latest experimental large‑language model, is open‑sourced with a paper, featuring a million‑token context window, a new sparse attention mechanism, GRPO‑enhanced reasoning, and detailed cost‑analysis showing up to ten‑fold inference savings.

DeepSeekGRPOInference Optimization

0 likes · 5 min read

DeepSeek‑V3.2‑Exp Unveiled: Million‑Token Context, Sparse Attention, and Cost‑Effective Inference

AI Frontier Lectures

Jul 19, 2025 · Artificial Intelligence

How Researchers Made Large Language Models Forget or Amplify Specific Concepts

A new study from Meta and NYU reveals a two‑step technique—SAMD to locate concept‑specific attention heads and SAMI to scale their influence—enabling precise, low‑cost editing of transformer models for tasks ranging from factual recall to safety control.

AI safetySparse Attentionconcept control

0 likes · 11 min read

How Researchers Made Large Language Models Forget or Amplify Specific Concepts

AI Algorithm Path

May 1, 2025 · Artificial Intelligence

Uncovering the Secrets of LLM Inference Optimization

This article dissects the major bottlenecks of large‑language‑model serving—prefill vs. decode, sparsity, memory bandwidth, KV‑cache growth—and walks through concrete engineering tricks such as paged attention, radix‑tree KV caches, compressed attention, speculative decoding, FlexGen weight scheduling, FastServe queuing, plus a runnable vLLM code snippet.

FastServeFlexGenInference Optimization

0 likes · 18 min read

Uncovering the Secrets of LLM Inference Optimization

AIWalker

Feb 25, 2025 · Artificial Intelligence

Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Sliding Tile Attention (STA) replaces costly full‑3D attention in video DiT models with a block‑wise sliding‑window scheme, achieving up to 10× attention speedup and a 3.53× end‑to‑end generation boost for HunyuanVideo without quality loss, as demonstrated by extensive benchmarks and kernel analyses.

GPU optimizationHunyuanVideoSliding Tile Attention

0 likes · 16 min read

Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Architect

Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training

0 likes · 13 min read

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

Architecture Digest

Feb 24, 2025 · Artificial Intelligence

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.

LLMMixture of ExpertsMoBA

0 likes · 12 min read

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

Architects' Tech Alliance

Feb 24, 2025 · Artificial Intelligence

NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington

The NSA mechanism introduces a three‑branch hardware‑optimized sparse attention architecture—token compression, token selection, and sliding window—combined with learnable gating to balance global and local context, dramatically improving inference speed and efficiency for long‑context large language models.

AI ArchitectureDeepSeekSparse Attention

0 likes · 5 min read

NSA: Hardware‑Optimized Sparse Attention Mechanism from DeepSeek, Peking University and University of Washington

ZhongAn Tech Team

Feb 22, 2025 · Artificial Intelligence

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

This issue reviews China's first open‑source short‑film model SkyReels‑V1, DeepSeek's Native Sparse Attention breakthrough, xAI's massive Grok‑3 deployment on 200k H100 GPUs, and a knowledge‑graph‑guided RAG framework, highlighting their performance gains, architectural innovations, and industry impact.

AILarge ModelsRAG

0 likes · 15 min read

How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave

AIWalker

Feb 19, 2025 · Artificial Intelligence

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

DeepSeek introduces the NSA sparse attention mechanism, combining dynamic hierarchical sparsity, coarse token compression and fine token selection to achieve up to 11.6× faster inference, lower pre‑training cost, and superior benchmark performance across general, long‑context, and chain‑of‑thought tasks.

DeepSeekLLM optimizationNSA

0 likes · 9 min read

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

IT Architects Alliance

Feb 8, 2025 · Artificial Intelligence

Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance

This article examines DeepSeek's advanced Transformer‑based architecture, dynamic routing, MoE system, multi‑stage training, efficient inference, multimodal capabilities, real‑world applications, technical challenges, and future prospects, providing a comprehensive technical analysis of the model's strengths and limitations.

AI ArchitectureDeepSeekLarge Language Model

0 likes · 15 min read

Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance

Alibaba Cloud Big Data AI Platform

Jul 11, 2022 · Artificial Intelligence

How Structure-Aware Sparse Attention Boosts Long-Code Transformers

The SASA model, a structure‑aware sparse‑attention Transformer developed by Alibaba Cloud PAI and Prof. Gao Ming’s team, improves long‑code sequence processing by sparsifying self‑attention using top‑k frequency and AST pattern matrices, achieving higher performance and lower memory/computation costs on CodeXGLUE benchmarks.

ASTCode UnderstandingLong Sequences

0 likes · 8 min read

How Structure-Aware Sparse Attention Boosts Long-Code Transformers

Baobao Algorithm Notes

Jan 14, 2022 · Artificial Intelligence

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.

BERTMaskingSelf-Attention

0 likes · 7 min read

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More