Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

CODA rewrites Transformer blocks as GEMM‑epilogue programs, exposing five primitive building blocks that let both AI‑generated code and human programmers fuse memory‑intensive operations into the GEMM epilogue, eliminating costly tensor moves and achieving up to 1.8× speed‑ups on H100 GPUs for RMSNorm, SwiGLU, RoPE and other components, while preserving numerical accuracy.

Machine Heart
Machine Heart
Machine Heart
Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

On May 22, Tri Dao retweeted a post by Han Guo stating that after mathematical reformulation, all Transformer operations can be expressed as a sequence of GEMM plus an epilogue, enabling LLMs and novices to write lightning‑fast kernels.

The CODA paper ("CODA: Rewriting Transformer Blocks as GEMM‑Epilogue Programs", arXiv:2605.19269) introduces a programming abstraction that systematically absorbs the many small, memory‑intensive ops (RMSNorm, SwiGLU, RoPE, residual adds, etc.) into the GEMM epilogue, eliminating repeated writes to global memory.

Background: training a 1B‑parameter LLaMA‑style model on an NVIDIA H100 shows that while GEMM and attention dominate compute, a suite of "tiny ops" repeatedly move intermediate tensors between registers and DRAM, creating a memory‑bandwidth bottleneck that worsens as lower‑precision formats (FP8, FP4) accelerate the GEMM itself.

CODA’s insight is that many of these ops can be algebraically re‑parameterized and executed during the short window when GEMM results still reside in on‑chip registers. For example, in the common GEMM‑RMSNorm‑GEMM pattern, the per‑row scaling factor r of RMSNorm commutes with the second GEMM, allowing r to be applied in the second GEMM’s epilogue. This removes the explicit RMSNorm computation and replaces it with a lightweight partial‑RMS reduction.

Similar re‑parameterizations apply to SwiGLU, RoPE, cross‑entropy loss, and even backward passes. The paper proves that if the forward epilogue is "block‑local", the backward automatically inherits the same structure.

CODA defines five composable primitive types that can be placed in the epilogue:

Elementwise transforms (residual addition, activation, RoPE)

Vector load/store (broadcast RMSNorm weights)

Matrix block load/store (preserve activations for backward)

Block reductions (partial RMS, block log‑sum‑exp)

Stateful transforms (online max and sum‑exp statistics)

Using these primitives, virtually all non‑attention operations in a standard Transformer layer can be expressed without leaving the GEMM epilogue.

The authors evaluated two implementation pathways: (1) human‑written kernels following the CODA abstraction, and (2) kernels generated by Claude Code (an LLM) given the primitive specifications, example code, and an implementation log. Both pathways achieved performance comparable to or exceeding hand‑tuned baselines.

Benchmarks compare CODA kernels against cuBLAS + torch.compile, Liger Kernel, and FlashInfer. For the GEMM‑RMSNorm‑GEMM pattern across 1B, 7B, and 70B hidden dimensions, CODA outperforms the cuBLAS + PyTorch baseline. Backward kernels see 1.6–1.8× speed‑ups, while forward gains range from 5% to 20% on full Transformer layers, with larger models benefiting more.

Numerical accuracy remains on par with PyTorch reference implementations; in some configurations CODA’s accumulation in the GEMM main loop yields even lower error.

Limitations: CODA currently supports only single‑GPU training and is tailored to standard Transformer architectures; applicability to other model families is untested.

In conclusion, CODA demonstrates that the primary optimization opportunity in GPU‑accelerated Transformer training lies not in the compute‑heavy GEMM itself but in eliminating unnecessary data movement by fusing auxiliary ops into the GEMM epilogue, a strategy that can be automated by modern LLMs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMTransformerCUDAGPU optimizationGEMMCODA
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.