Artificial Intelligence 18 min read

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

DataFunSummit

Feb 11, 2024

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

In recent years, machine‑learning applications such as video, image, text, and search have grown far beyond the pace of CPU improvements, prompting a shift to GPU‑accelerated solutions.

Starting in 2021, Xiaohongshu began migrating its promotion‑search models to GPUs to improve inference performance and efficiency, confronting challenges such as smooth migration from CPU architectures, aligning with business scenarios, and achieving cost‑effective scaling.

Background

Xiaohongshu’s app homepage includes recommendation and search pages that use CTR, CVR, and relevance models, many of which now run on GPUs. The scale of computation grew from 400 B FLOPs per request in early 2021 to trillion‑level parameters by the end of 2022.

Model Service

1. Model Characteristics – Prior to large dense models, most parameters were sparsified via feature embeddings, leading to TB‑scale sparse matrices while keeping dense parts under 10 GB to fit a single GPU.

2. Training & Inference Frameworks – Two major model families exist: sparse CTR models with custom inference/training stacks, and CV/NLP models built on the PyTorch stack.

For inference, TensorFlow Serving was originally used, but a custom Lambda Service was built on the low‑level CTensor API to eliminate TensorProto overhead and integrate graph scheduling and compilation optimizations.

For training, a self‑developed Worker & PS framework based on TensorFlow was created, later upgraded to address GPU‑related bottlenecks.

GPU Characteristics

GPU kernel execution involves data transfer, kernel launch, computation, and result transfer. Excessive time spent on transfer or launch reduces GPU utilization.

GPU Optimization Practices

1. System Optimization – Physical Machine – Collaborated with cloud providers to isolate GPU interrupts, upgrade kernel versions, and enable direct instruction pass‑through, gaining 1‑2% performance.

2. Multi‑Card Optimization – Addressed NUMA latency by binding pods to specific NUMA nodes, improving CPU‑GPU data transfer speed.

3. Compilation Optimization – Performed cross‑compilation for different CPU instruction sets, achieving ~10% performance gains on specific cloud instances.

4. Redundant Computation Optimization – Merged duplicated calculations in CTR models, reduced multiple GPU submissions per request, and applied graph freezing to replace variable ops with constants, cutting overall compute cost by ~12%.

5. Hardware Upgrade – Replaced expensive A100 cards with cost‑effective A10s, achieving 1.5× performance at 1.2× cost, and planned future use of A30s.

6. DL Stack Auto‑Compilation – Integrated Alibaba’s BladeDISC and TensorFlow XLA to compile high‑level graphs into optimized GPU kernels, delivering significant speedups.

Training Optimization

Early‑stage optimizations focused on reducing I/O by converting row‑based TFRecord data to columnar format, prefetching data to hide CPU lookup latency, and decoupling backward updates to enable asynchronous gradient accumulation.

Additional training‑inference co‑optimizations included selective precision reduction (e.g., FP16/INT8) on GPU‑bound sub‑graphs while preserving overall model accuracy.

Future Directions

Plans include scaling sparse large‑model training with HPC‑style single‑machine runs for small parameters, A100 clusters with high‑speed interconnects for TB‑scale models, and multi‑GPU plus communication fabrics for even larger models.

Inference will see upgrades to hashing, multi‑level caching, model lightweighting, and continuous adoption of newer Nvidia drivers and hardware.

Long‑term, Xiaohongshu aims to build a drag‑and‑drop, canvas‑based end‑to‑end ML platform with DSL‑driven feature management.

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization GPU large-scale systems Model Serving training

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.