Tagged articles
5 articles
Page 1 of 1
Lao Guo's Learning Space
Lao Guo's Learning Space
May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Flash AttentionKV CacheMixture of Experts
0 likes · 10 min read
Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek
Geek Labs
Geek Labs
Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

Flash AttentionInference OptimizationKV Cache
0 likes · 5 min read
A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization
NewBeeNLP
NewBeeNLP
Oct 16, 2024 · Artificial Intelligence

Unlocking Long-Sequence LLMs: Position Embeddings, Scaling, and Efficient Attention

This article reviews recent advances in training and inference for long‑sequence large language models, comparing ALIBI and RoPE position embeddings, exploring RoPE scaling techniques, analyzing attention optimizations, and outlining practical data, evaluation, and system frameworks for scalable LLM deployment.

Flash AttentionLLMRoPE
0 likes · 14 min read
Unlocking Long-Sequence LLMs: Position Embeddings, Scaling, and Efficient Attention
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 19, 2023 · Artificial Intelligence

Efficient LLM Deployment: Low‑Precision, Flash Attention, and Architecture Tricks

This article reviews the main memory and compute challenges of deploying large language models and presents practical solutions—including low‑precision arithmetic, flash attention, advanced positional embeddings, key‑value caching, and quantization techniques—backed by code examples and performance measurements on models such as OctoCoder.

Flash AttentionLLMTransformers
0 likes · 35 min read
Efficient LLM Deployment: Low‑Precision, Flash Attention, and Architecture Tricks