Tagged articles

16 articles

Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

May 28, 2026 · Artificial Intelligence

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

This article presents GQLA, a single‑author variant of MLA that eliminates three hardware‑related drawbacks of MLA, demonstrates how it achieves balanced compute‑memory performance on both high‑end H100 and more modest H20 GPUs, and details conversion methods (TransGQLA) and sparse extensions with concrete benchmark results.

GQLALLMMLA

0 likes · 16 min read

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

Lao Guo's Learning Space

May 13, 2026 · Artificial Intelligence

Can Trillion-Parameter Models Skip ‘Slow Thinking’? Ant’s Ling‑2.6‑1T Redefines Efficient LLMs

Ant’s newly released Ling‑2.6‑1T, a trillion‑parameter LLM, combines a hybrid MLA‑plus‑Linear Attention architecture to deliver 256K context, ultra‑low token cost and millisecond‑level latency, achieving GPT‑5.4‑level performance on multiple benchmarks while being open‑sourced for developers.

Ant AIFast ThinkingHybrid Architecture

0 likes · 10 min read

Can Trillion-Parameter Models Skip ‘Slow Thinking’? Ant’s Ling‑2.6‑1T Redefines Efficient LLMs

Machine Heart

Apr 25, 2026 · Artificial Intelligence

How DeepSeek and Kimi’s Open‑Source Collaboration Is Redefining China’s AI Landscape

The article analyses DeepSeek V4’s technical report, revealing repeated “encounters” between DeepSeek and Kimi—shared MLA attention, Muon optimizer, and divergent long‑context strategies—while highlighting their open‑source releases, hardware adaptations, and ecosystem impact that dramatically lower deployment costs for Chinese AI.

AIDeepSeekKimi

0 likes · 10 min read

How DeepSeek and Kimi’s Open‑Source Collaboration Is Redefining China’s AI Landscape

Machine Learning Algorithms & Natural Language Processing

Mar 24, 2026 · Artificial Intelligence

A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures

This article reviews and compares the most important attention variants used in modern large language models—including multi‑head attention, grouped‑query attention, multi‑head latent attention, sparse and sliding‑window attention, gated attention, and hybrid designs—detailing their motivations, memory trade‑offs, example architectures, and experimental findings.

Hybrid ArchitectureLLMMHA

0 likes · 29 min read

A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures

BirdNest Tech Talk

Oct 15, 2025 · Artificial Intelligence

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

This article walks through the DeepSeek‑V3.2‑Exp inference codebase, detailing its MoE architecture, Multi‑Head Latent Attention, FP8 quantization, custom CUDA kernels, and 8‑GPU NCCL‑based distributed execution from initialization through prefill and decode stages.

CUDAFP8 quantizationLLM

0 likes · 9 min read

How DeepSeek‑V3.2‑Exp Achieves Fast Distributed LLM Inference with FP8 and MoE

Data Party THU

Aug 11, 2025 · Artificial Intelligence

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

LLMMLAMoE

0 likes · 12 min read

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

Tech Freedom Circle

Jul 17, 2025 · Artificial Intelligence

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

This article provides a detailed technical analysis of DeepSeek‑V3, covering its MOE architecture, the novel Multi‑head Latent Attention (MLA) mechanism, the DualPipe pipeline‑parallel algorithm, mixed‑precision FP8 training, and the Multi‑Token Prediction (MTP) inference improvements that together boost performance and efficiency.

DeepSeekDualPipeFP8

0 likes · 44 min read

DeepSeek V3 Architecture Deep Dive: MoE, MLA, DualPipe, FP8 Mixed Precision & Multi‑Token Prediction

Baidu Tech Salon

Mar 13, 2025 · Artificial Intelligence

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.

DockerInference OptimizationMLA

0 likes · 13 min read

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

Architect

Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO

0 likes · 19 min read

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

IT Services Circle

Feb 27, 2025 · Artificial Intelligence

DeepSeek Announces FlashMLA: An Efficient Multi‑Layer Attention Decoding Kernel for Hopper GPUs

DeepSeek’s OpenSourceWeek introduced FlashMLA, a GPU‑optimized MLA decoding kernel for Hopper GPUs that leverages FlashAttention and CUTLASS to dramatically improve large‑model inference performance, with early adoption showing up to 30% higher compute utilization and doubled speed in some scenarios.

Artificial IntelligenceDeepSeekFlashMLA

0 likes · 3 min read

DeepSeek Announces FlashMLA: An Efficient Multi‑Layer Attention Decoding Kernel for Hopper GPUs

IT Architects Alliance

Feb 26, 2025 · Artificial Intelligence

DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies

The article provides an in‑depth overview of DeepSeek’s large language model, detailing its mixture‑of‑experts and Transformer foundations, novel attention mechanisms, load‑balancing, multi‑token prediction, FP8 mixed‑precision training, and various training regimes such as knowledge distillation and reinforcement learning.

DeepSeekFP8Knowledge Distillation

0 likes · 18 min read

DeepSeek Large Model: Core Architecture, Key Technologies, and Training Strategies

Alibaba Cloud Big Data AI Platform

Feb 25, 2025 · Artificial Intelligence

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

This tutorial walks users through installing FlashMLA, integrating it with the vLLM framework, downloading the DeepSeek‑V2‑Lite‑Chat model, benchmarking various MLA implementations, and running a local inference demo that shows FlashMLA’s speed advantage on long‑sequence generation.

DeepSeekFlashMLAInferenceOptimization

0 likes · 16 min read

Accelerate DeepSeek‑V2‑Lite Deployment with FlashMLA: A Step‑by‑Step Guide

AI Algorithm Path

Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU optimization

0 likes · 8 min read

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Baobao Algorithm Notes

Feb 17, 2025 · Artificial Intelligence

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

This article presents a theoretical and experimental analysis of converting Group Query Attention (GQA) models to Multi‑Head Linear Attention (MLA) using the TransMLA method, demonstrating superior expressiveness and performance on DeepSeek‑based large language models while keeping KV‑Cache costs unchanged.

DeepSeekMLATransMLA

0 likes · 11 min read

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

Data Thinking Notes

Feb 11, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power

This article analyzes DeepSeek's V3 and R1 large language models, detailing their low‑cost Mixture‑of‑Experts architecture, Multi‑Head Latent Attention redesign, distributed training optimizations, and reasoning‑focused innovations that together challenge traditional GPU/NPU compute demands.

AI inferenceDeepSeekMLA

0 likes · 15 min read

Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power

Cognitive Technology Team

Feb 9, 2025 · Artificial Intelligence

A Beginner’s Guide to the History and Key Concepts of Deep Learning

From the perceptron’s inception in 1958 to modern Transformer-based models like GPT, this article traces the evolution of deep learning, explaining foundational architectures such as DNNs, CNNs, RNNs, LSTMs, attention mechanisms, and recent innovations like DeepSeek’s MLA, highlighting their principles and impact.

GPTMLAattention

0 likes · 19 min read

A Beginner’s Guide to the History and Key Concepts of Deep Learning