Fundamentals 13 min read

QuanTaichi: A Physical Compiler for Automatic Quantization of High‑Precision Simulations

QuanTaichi, built on the Taichi language, introduces custom numeric types, bit‑struct adapters, and compiler optimizations that dramatically reduce memory and bandwidth for particle‑based physics simulations, enabling high‑precision GPU rendering on a single card and even on mobile devices.

Kuaishou Tech

Jul 14, 2021

QuanTaichi: A Physical Compiler for Automatic Quantization of High‑Precision Simulations

Advances in computer simulation now allow movies like Frozen to recreate realistic worlds, but achieving such visual fidelity requires ultra‑high‑precision physics that traditionally consumes massive GPU memory and expensive hardware.

Particle‑based representations, common in visual effects, can involve tens of millions of particles; simulating a 300‑meter dam breach may need dozens of gigabytes of VRAM, often requiring multiple high‑end GPUs such as four NVIDIA Quadro P6000 cards costing thousands of dollars each.

Researchers from Kuaishou, MIT, Zhejiang University, and Tsinghua have introduced QuanTaichi , a new language abstraction and compiler system for quantized simulation. By packing low‑precision numeric data types, QuanTaichi cuts memory usage and bandwidth, allowing high‑precision physical simulation to run on a single GPU.

QuanTaichi builds on the Taichi programming language and compiler originally created by Hu Yuanming et al. It lets developers switch easily between full‑precision and quantized simulators to find optimal trade‑offs, and the work was accepted at SIGGRAPH 2021 and released as open‑source on GitHub.

Paper: https://yuanming.taichi.graphics/publication/2021-quantaichi/quantaichi.pdf Project page: https://yuanming.taichi.graphics/publication/2021-quantaichi/ GitHub: https://github.com/taichi-dev/quantaichi

The accompanying video demonstrates a quantized simulation of two rabbit‑shaped smoke clouds (400 million voxels) whose visual quality rivals full‑precision floating‑point results while using only half the storage.

The same quantization techniques have been applied to mobile devices, delivering up to a 40 % speed‑up for physics simulations on smartphones.

Overall, QuanTaichi improves development efficiency for GPU‑based physical simulation, large‑scale image processing, media codecs, scientific computing, and more, while enhancing storage efficiency across the Taichi ecosystem.

Technical Details – Quantized Numeric Types

QuanTaichi defines several custom numeric types:

Custom Int : user‑specified bit‑width integer, signed or unsigned.

Custom Float : user‑specified bit‑width floating‑point with three implementations:

Fixed‑point: a custom integer plus a scaling factor.

Floating‑point: separate user‑defined mantissa and exponent bits.

Shared‑exponent: multiple values share a single exponent, exploiting the fact that in many physical datasets a few large values dominate.

Illustrations of the memory layout for the three floating‑point formats are shown below:

Bit Adapter Types

Because current hardware does not support arbitrary bit‑width reads/writes, QuanTaichi provides two adapters:

Bit structs : pack multiple custom types (e.g., custom int 5, custom float 12) into a native hardware word such as a 32‑bit integer.

Bit arrays : store many instances of the same custom type within a native word.

Compiler Optimizations

Bit‑struct fusion storage : analyze kernel access patterns to batch writes of bit‑struct members, reducing atomic memory operations.

Thread‑safety inference : detect when operations are inherently thread‑safe to avoid costly atomic writes. Supports element‑wise accesses and whole‑struct stores.

Bit‑array vectorization : transform per‑bit loops into 32‑wide vector operations, eliminating excessive atomicRMW instructions.

Experimental Results

Game of Life Test

The classic cellular automaton was used to evaluate storage efficiency. Each cell’s alive/dead state fits in one bit; traditional C implementations use an 8‑bit char, whereas QuanTaichi achieves an 8× reduction without code changes. On an NVIDIA RTX 3080 Ti, the system simulated over 200 billion cells across 70 × 70 OTCA meta‑pixels (2048 × 2048 cells each).

Euler Fluid Simulation Test

A sparse‑grid advection‑reflection solver reduced per‑grid storage from 84 bytes to 44 bytes via quantization, enabling over 420 million active sparse‑grid smoke cells on an NVIDIA Tesla V100 (32 GB).

MLS‑MPM Algorithm Test

Using a quantized scheme that shrank per‑particle storage from 68 bytes to 40 bytes, the system simulated over 230 million particles on an NVIDIA RTX 3090.

On an iPhone XS, the quantized MLS‑MPM implementation leveraged ti.quant.fixed(fration=32) to replace 32‑bit floating‑point atomic adds with native 32‑bit integer atomic adds, yielding a noticeable performance boost despite the device’s limited compute resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Graphics Quantization GPU simulation Physical Simulation Taichi

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.