Tagged articles

130 articles

Page 2 of 2

Feb 8, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact

This article analyses DeepSeek's V3 and R1 models, explaining how their innovative MoE architecture, Multi‑Head Latent Attention, low‑cost training strategies, and distributed‑training optimizations deliver high‑performance large language models while reducing GPU/NPU demand and sparking industry excitement.

AI inferenceDeepSeekLarge Language Models

0 likes · 16 min read

Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact

Alibaba Cloud Developer

Feb 7, 2025 · Artificial Intelligence

Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations

This article provides a comprehensive analysis of DeepSeek's large‑language‑model technology, covering the company's background, model capabilities, remarkably low training and inference costs, and the core architectural and algorithmic innovations such as MoE, MLA attention, FP8 mixed‑precision, and the DualPipe pipeline that enable efficient large‑scale AI deployment.

AI ArchitectureDeepSeekFP8 training

0 likes · 19 min read

Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations

Tencent Cloud Developer

Feb 6, 2025 · Artificial Intelligence

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

The article reviews DeepSeek’s V‑series papers, explaining how scaling‑law insights, Grouped Query Attention, a depth‑first design, loss‑free load balancing, multi‑token prediction and Multi‑Head Latent Attention together enable economical mixture‑of‑experts LLMs that rival closed‑source models while cutting compute and hardware costs.

DeepSeekGrouped Query AttentionLarge Language Models

0 likes · 13 min read

DeepSeek V Series: Technical Overview of Scaling Laws, Grouped Query Attention, and Mixture‑of‑Experts

AI2ML AI to Machine Learning

Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention

0 likes · 8 min read

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

Baobao Algorithm Notes

Jan 3, 2025 · Artificial Intelligence

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

The article examines DeepSeek‑V3’s architecture and training pipeline, highlighting its use of MLA and a highly granular MoE design, pioneering FP8 mixed‑precision training, fine‑grained per‑tile quantization, advanced parallelism strategies, and inference optimizations such as PD separation and NanoFlow to achieve unprecedented efficiency on limited GPU resources.

DeepSeek-V3FP8Inference Optimization

0 likes · 10 min read

How DeepSeek-V3 Achieves Massive Scale with FP8, MoE, and System Optimizations

Tencent Cloud Developer

Nov 6, 2024 · Artificial Intelligence

Overview of Tencent Hunyuan Large and 3D Generation Model Open‑Source Release

Tencent has open‑sourced its 389‑billion‑parameter Hunyuan Large Mixture‑of‑Experts model—featuring 52 B active parameters, 256 K token context, novel routing, KV‑cache compression, and advanced training optimizations that beat leading open‑source models—and its first text‑to‑3D/image‑to‑3D Hunyuan 3D Generation model, both downloadable via GitHub, Hugging Face, and Tencent Cloud.

3D generationAI researchLarge Language Model

0 likes · 9 min read

Overview of Tencent Hunyuan Large and 3D Generation Model Open‑Source Release

NewBeeNLP

Oct 21, 2024 · Artificial Intelligence

Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture

This article analyzes the polarization issues in industrial Mixture‑of‑Experts (MoE) frameworks, explains expert collapse, degradation, and under‑fitting, and details the HOME model’s input types, architectural innovations, normalization, gating mechanisms, and related DICE‑BN insights.

Expert NormalizationGating MechanismsMixture of Experts

0 likes · 10 min read

Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture

Baobao Algorithm Notes

Sep 9, 2024 · Artificial Intelligence

How MoSLoRA Reinvents Low‑Rank Adaptation with Mixer Matrices

This article analyzes the Mixture‑of‑Subspaces in Low‑Rank Adaptation (MoSLoRA) paper, explaining its motivation, design choices that replace LoRA's gate with a mixer matrix, connections to multi‑head attention, experimental findings on LLaMA‑3 fine‑tuning, and theoretical proofs of its re‑parameterization properties.

AILoRAMixture of Experts

0 likes · 12 min read

How MoSLoRA Reinvents Low‑Rank Adaptation with Mixer Matrices

Baobao Algorithm Notes

Jul 31, 2024 · Artificial Intelligence

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

This article compiles key technical details of the Mistral model family—including Mistral 7B, Mixtral 8×7B, Mixtral 8×22B, Mistral Nemo, and Mistral Large 2—covering their architectural innovations such as sliding‑window attention, grouped‑query attention, mixture‑of‑experts design, scaling parameters, performance benchmarks, quantization requirements, and practical deployment commands.

Grouped Query AttentionLarge Language ModelMistral

0 likes · 17 min read

What Makes Mistral’s 7B, Mixtral, and Large 2 Models Stand Out? A Deep Technical Dive

360 Smart Cloud

Jul 4, 2024 · Artificial Intelligence

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.

AILarge Language ModelsMixture of Experts

0 likes · 10 min read

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

NewBeeNLP

Jun 7, 2024 · Artificial Intelligence

Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?

In a recent round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data scarcity, explored alternatives to Transformers such as RNN‑based models and MOE, and examined techniques for handling long‑context inference efficiently.

Mixture of Expertsmodel architecturescaling laws

0 likes · 12 min read

Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?

Baobao Algorithm Notes

May 31, 2024 · Industry Insights

Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions

In a May 15 round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data bottlenecks, explored alternatives to the Transformer such as RNN‑based and hybrid designs, evaluated the practicality of Mixture‑of‑Experts models, and examined two main strategies—KV‑cache compression and input‑context reduction—to enable truly long‑context processing.

Mixture of Expertslong-context

0 likes · 13 min read

Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions

Kuaishou Tech

May 27, 2024 · Artificial Intelligence

What Kuaishou’s Four ACL Papers Reveal About the Future of Large Language Models

The 62nd ACL conference accepted four papers from Kuaishou that explore multi‑turn instruction following, self‑agreement reasoning, fine‑grained reinforcement learning, and dynamic routing in Mixture‑of‑Experts models, each with detailed methods, experimental results, author lists, and public arXiv links.

ACL 2024Kuaishou ResearchLarge Language Models

0 likes · 11 min read

What Kuaishou’s Four ACL Papers Reveal About the Future of Large Language Models

DeWu Technology

May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationLarge Language Models

0 likes · 17 min read

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Baobao Algorithm Notes

May 6, 2024 · Artificial Intelligence

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

DeepSeek‑V2 is a 236‑billion‑parameter mixture‑of‑experts language model that reduces training cost by 42.5 %, cuts KV‑cache usage by 93.3 %, and boosts generation throughput 5.76×, while achieving state‑of‑the‑art scores on benchmarks such as MMLU, C‑Eval, BBH, HumanEval, and GSM8K for both base and chat variants.

AIDeepSeek-V2Large Language Model

0 likes · 11 min read

DeepSeek-V2: 236B MoE LLM Delivers Higher Performance While Cutting Training Cost by 42%

NewBeeNLP

Apr 2, 2024 · Artificial Intelligence

Jamba: How AI21 Labs Merged Mamba and Transformer for 3× Faster 128k Contexts

Jamba, a hybrid Mamba‑Transformer model from AI21 Labs, combines state‑space and attention layers with Mixture‑of‑Experts to deliver up to three times the throughput of comparable 52‑billion‑parameter LLMs on 128k context windows while maintaining high output quality and low memory usage.

JambaLLMMamba

0 likes · 6 min read

Jamba: How AI21 Labs Merged Mamba and Transformer for 3× Faster 128k Contexts

21CTO

Mar 29, 2024 · Artificial Intelligence

Why Databricks’ Open‑Source DBRX LLM Is Outpacing GPT‑3.5 and Llama 2

Databricks unveiled the open‑source DBRX large language model, which leverages a mixed‑expert architecture to deliver faster, more cost‑effective inference and beats leading open‑source and proprietary models like Llama 2, Mixtral‑8x7B, and GPT‑3.5 on multiple benchmarks.

AIDBRXDatabricks

0 likes · 7 min read

Why Databricks’ Open‑Source DBRX LLM Is Outpacing GPT‑3.5 and Llama 2

Rare Earth Juejin Tech Community

Mar 20, 2024 · Artificial Intelligence

Elon Musk’s xAI Open‑Sources Grok‑1: A 314‑Billion‑Parameter MoE Large Language Model

Elon Musk’s xAI has open‑sourced Grok‑1, a 314‑billion‑parameter mixture‑of‑experts language model built with Rust and JAX, released under an Apache‑2.0 license, and the announcement includes detailed architecture specs, hardware requirements, and the broader context of Musk’s rivalry with OpenAI.

AIGrok-1Large Language Model

0 likes · 6 min read

Elon Musk’s xAI Open‑Sources Grok‑1: A 314‑Billion‑Parameter MoE Large Language Model

DataFunTalk

Mar 14, 2024 · Artificial Intelligence

Efficiency Challenges and Multi‑Layer Optimization for Large AI Models

The article examines how large AI models are moving toward a unified paradigm that reduces task‑algorithm coupling, outlines multi‑layer efficiency challenges—from model compression and sparsity to software and infrastructure optimization—and highlights NVIDIA’s GTC 2024 China AI Day sessions showcasing the latest LLM technologies and registration details.

AI efficiencyMixture of ExpertsNVIDIA GTC

0 likes · 13 min read

Efficiency Challenges and Multi‑Layer Optimization for Large AI Models

Baobao Algorithm Notes

Mar 10, 2024 · Artificial Intelligence

Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained

This article examines why ensemble methods are crucial for large language models, outlines five core fusion strategies—including model integration, probability integration, graft learning, crowdsourced voting, and Mixture of Experts—provides implementation details, pseudo‑code, and discusses practical challenges and recent research advances.

AI researchMixture of ExpertsModel Fusion

0 likes · 16 min read

Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained

Alibaba Cloud Big Data AI Platform

Jan 29, 2024 · Artificial Intelligence

Unlocking Sparse MoE Large Model Training with Megatron-Core on Alibaba Cloud

This article explains how Alibaba Cloud's PAI platform and NVIDIA's Megatron-Core enable efficient training of sparse Mixture-of-Experts (MoE) large language models, covering algorithm basics, the Megatron-Core MoE framework, weight conversion pipelines, and performance results on Mixtral‑8x7B.

Large Language ModelsMegatron-CoreMixture of Experts

0 likes · 18 min read

Unlocking Sparse MoE Large Model Training with Megatron-Core on Alibaba Cloud

Baobao Algorithm Notes

Jan 2, 2024 · Artificial Intelligence

Uncovering Mixtral‑8x7B: How MoE Experts Shape Performance and Training

This article analyses the Mixtral‑8x7B Mixture‑of‑Experts LLM, explains its gate‑driven 8‑expert architecture, presents a simplified PyTorch implementation, and reports a series of experiments that probe top‑2 gating during training, individual expert contributions, task‑specific pre‑training, the impact of expert count, and similarity with Mistral‑7B, ultimately offering hypotheses about its training pipeline.

LLMMixtralMixture of Experts

0 likes · 14 min read

Uncovering Mixtral‑8x7B: How MoE Experts Shape Performance and Training

Huawei Cloud Developer Alliance

Nov 3, 2023 · Artificial Intelligence

Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation

This article explains how large language models can achieve continual lifelong learning, outlines the key properties required, reviews mixture‑of‑experts (MoE) techniques—including sparse MoE, GShard, Switch Transformer, GLaM and PanGu‑Sigma—and discusses the remaining challenges such as model complexity, expert balancing and distributed communication overhead.

Artificial IntelligenceLLMLifelong Learning

0 likes · 9 min read

Can LLMs Master Lifelong Learning? Exploring MoE and Continuous Adaptation

Tencent Advertising Technology

Mar 2, 2023 · Artificial Intelligence

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

This article details Tencent's development of the 1‑trillion‑parameter HunYuan‑NLP model, covering its MoE architecture, cost‑effective pre‑training strategies, distributed training framework, model compression toolkit, and successful deployment across advertising, gaming, and other Tencent services.

AI infrastructureLarge Language ModelMixture of Experts

0 likes · 17 min read

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

Meituan Technology Team

Dec 8, 2022 · Artificial Intelligence

Contextualized Recommendation in Meituan Takeaway: Segmented & Unified Modeling, Long‑Sequence Retrieval, and Multi‑Expert Networks

Meituan Takeaway’s recommendation system partitions user contexts such as time, location, entry page, and business type, then uses a unified model with long‑sequence retrieval and a multi‑expert Mixture‑of‑Experts network to deliver context‑aware food‑delivery suggestions, achieving notable CTR and conversion gains while maintaining low latency.

MeituanMixture of Expertscontextual modeling

0 likes · 32 min read

Contextualized Recommendation in Meituan Takeaway: Segmented & Unified Modeling, Long‑Sequence Retrieval, and Multi‑Expert Networks

IEG Growth Platform Technology Team

Nov 28, 2022 · Artificial Intelligence

Bidden-MarfNet: Feature Missing-aware Routing-and-Fusion Network for Customer Lifetime Value Prediction

This paper presents Bidden-MarfNet, a novel architecture that explicitly encodes feature‑missing information and dynamically re‑weights samples to address feature missingness and label sparsity in user‑level LTV prediction for advertising, demonstrating superior performance over existing methods through extensive experiments.

LTV predictionMixture of Expertsdynamic weighting

0 likes · 13 min read

Bidden-MarfNet: Feature Missing-aware Routing-and-Fusion Network for Customer Lifetime Value Prediction

DataFunSummit

Apr 19, 2022 · Artificial Intelligence

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

This article reviews DeepSpeed‑MoE, an end‑to‑end system that introduces new MoE architectures, model‑compression techniques, and highly optimized inference pipelines, detailing its motivation, design of PR‑MoE (Pyramid‑MoE and Residual‑MoE), distributed parallel strategies, communication and kernel optimizations, and performance gains over dense baselines.

AIDeepSpeedInference Optimization

0 likes · 11 min read

DeepSpeed‑MoE: End‑to‑End Training and Inference Solutions for Mixture‑of‑Experts Models

DataFunSummit

Aug 16, 2021 · Artificial Intelligence

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies

The article reviews how deep learning models have grown deeper and wider, discusses the memory and bandwidth limits of single GPUs, and explains pipeline and sharding techniques—including GPU clusters and TPU pods—to efficiently train large‑scale models in industrial settings.

GPUMixture of ExpertsModel Parallelism

0 likes · 6 min read

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies

DataFunTalk

Aug 7, 2021 · Artificial Intelligence

Multi-Category Mixture-of-Experts Model for JD Search Ranking

This article presents a multi‑category Mixture‑of‑Experts (MoE) approach for e‑commerce search ranking, addressing category‑specific behavior and small‑category learning by introducing hierarchical soft constraints and adversarial regularization, and demonstrates significant AUC and NDCG gains on Amazon and JD in‑house datasets.

Adversarial RegularizationHierarchical Soft ConstraintMixture of Experts

0 likes · 10 min read

Multi-Category Mixture-of-Experts Model for JD Search Ranking

ITPUB

Jun 25, 2021 · Artificial Intelligence

How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Alibaba’s DAMO Academy unveiled the low‑carbon M6 multimodal model, a trillion‑parameter AI trained on just 480 V100 GPUs, achieving over 80% energy reduction and 11‑fold speedup compared to prior trillion‑parameter efforts, and already powering e‑commerce and manufacturing design tools.

GPU efficiencyM6Mixture of Experts

0 likes · 5 min read

How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy