QuanTaichi: A Physical Compiler for Automatic Quantization of High‑Precision Simulations
QuanTaichi, built on the Taichi language, introduces custom numeric types, bit‑struct adapters, and compiler optimizations that dramatically reduce memory and bandwidth for particle‑based physics simulations, enabling high‑precision GPU rendering on a single card and even on mobile devices.
Advances in computer simulation now allow movies like Frozen to recreate realistic worlds, but achieving such visual fidelity requires ultra‑high‑precision physics that traditionally consumes massive GPU memory and expensive hardware.
Particle‑based representations, common in visual effects, can involve tens of millions of particles; simulating a 300‑meter dam breach may need dozens of gigabytes of VRAM, often requiring multiple high‑end GPUs such as four NVIDIA Quadro P6000 cards costing thousands of dollars each.
Researchers from Kuaishou, MIT, Zhejiang University, and Tsinghua have introduced QuanTaichi , a new language abstraction and compiler system for quantized simulation. By packing low‑precision numeric data types, QuanTaichi cuts memory usage and bandwidth, allowing high‑precision physical simulation to run on a single GPU.
QuanTaichi builds on the Taichi programming language and compiler originally created by Hu Yuanming et al. It lets developers switch easily between full‑precision and quantized simulators to find optimal trade‑offs, and the work was accepted at SIGGRAPH 2021 and released as open‑source on GitHub.
Paper: https://yuanming.taichi.graphics/publication/2021-quantaichi/quantaichi.pdf Project page: https://yuanming.taichi.graphics/publication/2021-quantaichi/ GitHub: https://github.com/taichi-dev/quantaichi
The accompanying video demonstrates a quantized simulation of two rabbit‑shaped smoke clouds (400 million voxels) whose visual quality rivals full‑precision floating‑point results while using only half the storage.
The same quantization techniques have been applied to mobile devices, delivering up to a 40 % speed‑up for physics simulations on smartphones.
Overall, QuanTaichi improves development efficiency for GPU‑based physical simulation, large‑scale image processing, media codecs, scientific computing, and more, while enhancing storage efficiency across the Taichi ecosystem.
Technical Details – Quantized Numeric Types
QuanTaichi defines several custom numeric types:
Custom Int : user‑specified bit‑width integer, signed or unsigned.
Custom Float : user‑specified bit‑width floating‑point with three implementations: Fixed‑point: a custom integer plus a scaling factor. Floating‑point: separate user‑defined mantissa and exponent bits. Shared‑exponent: multiple values share a single exponent, exploiting the fact that in many physical datasets a few large values dominate.
Illustrations of the memory layout for the three floating‑point formats are shown below:
Bit Adapter Types
Because current hardware does not support arbitrary bit‑width reads/writes, QuanTaichi provides two adapters:
Bit structs : pack multiple custom types (e.g., custom int 5 , custom float 12 ) into a native hardware word such as a 32‑bit integer.
Bit arrays : store many instances of the same custom type within a native word.
Compiler Optimizations
Bit‑struct fusion storage : analyze kernel access patterns to batch writes of bit‑struct members, reducing atomic memory operations.
Thread‑safety inference : detect when operations are inherently thread‑safe to avoid costly atomic writes. Supports element‑wise accesses and whole‑struct stores.
Bit‑array vectorization : transform per‑bit loops into 32‑wide vector operations, eliminating excessive atomicRMW instructions.
Experimental Results
Game of Life Test
The classic cellular automaton was used to evaluate storage efficiency. Each cell’s alive/dead state fits in one bit; traditional C implementations use an 8‑bit char, whereas QuanTaichi achieves an 8× reduction without code changes. On an NVIDIA RTX 3080 Ti, the system simulated over 200 billion cells across 70 × 70 OTCA meta‑pixels (2048 × 2048 cells each).
Euler Fluid Simulation Test
A sparse‑grid advection‑reflection solver reduced per‑grid storage from 84 bytes to 44 bytes via quantization, enabling over 420 million active sparse‑grid smoke cells on an NVIDIA Tesla V100 (32 GB).
MLS‑MPM Algorithm Test
Using a quantized scheme that shrank per‑particle storage from 68 bytes to 40 bytes, the system simulated over 230 million particles on an NVIDIA RTX 3090.
On an iPhone XS, the quantized MLS‑MPM implementation leveraged ti.quant.fixed(fration=32) to replace 32‑bit floating‑point atomic adds with native 32‑bit integer atomic adds, yielding a noticeable performance boost despite the device’s limited compute resources.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.