NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies
The article introduces the NorthStar large‑model training framework developed by DeWu, detailing its background challenges, pipeline architecture, rich API support, multi‑GPU training modes, multi‑level embedding storage, hardware selection considerations, and a brief Q&A on data versus model parallelism.
Background : DeWu’s recommendation (搜推) business processes massive, sparse, and multi‑goal data, with daily data volumes at the terabyte level and sample counts reaching billions. Training such models on a single GPU must be compressed to minutes, and daily models must be retrained within an hour, creating high demands on cost, efficiency, and capability.
NorthStar Framework Overview : NorthStar is DeWu’s self‑developed large‑scale sparse model training framework built on a pipeline architecture. It aims to provide ultra‑efficient single‑machine GPU training with lower cost and higher performance, supporting diverse business scenarios through flexible APIs and multi‑level storage.
API Support : The framework offers extensive APIs for handling CSV, ORC, Parquet, Kudu, etc. It defines complete training sources (train/eval, finetune, load/save, export) with many configurable parameters. Optimizers such as Adam, Adagrad, and Ftrl are supported, and embedding operators (sum‑concat, lookup, weighted_emb, seq_emb, etc.) have dedicated kernels.
Pipeline Design : By employing a pipeline, NorthStar keeps GPU compute units busy while overlapping I/O and data loading, thus reducing idle time and improving cost‑effectiveness. Asynchronous channels allow threads to consume and produce data streams without blocking.
Multi‑GPU Training : Two single‑machine multi‑GPU strategies are described. In the first, one GPU performs neural‑network computation and distributes gradients to three other GPUs handling embedding and backward passes, achieving high utilization for small models. The second follows a traditional all‑reduce approach where data is processed, embeddings are generated, and a single all‑reduce synchronizes them across GPUs.
Embedding Multi‑Level Storage : To handle the massive embedding tables, NorthStar implements a three‑tier storage hierarchy: HBM‑PS (GPU memory) for active NN weights and current batch embeddings, MEM‑PS (CPU memory) for high‑frequency embeddings, and SSD‑PS (local SSD) for long‑tail embeddings and full checkpoints, enabling elastic model scaling.
Hardware and Deployment Choices : CPUs are suitable for I/O‑intensive and scheduling tasks, while GPUs excel at compute‑intensive matrix operations. The framework favors single‑machine deployment to avoid distributed communication overhead, but also discusses trade‑offs between synchronous (high accuracy) and asynchronous (higher speed) training.
Q&A : A brief Q&A clarifies the difference between data parallelism (splitting data across identical model replicas) and model parallelism (splitting the model itself), recommending data parallelism for typical DNNs and model parallelism for very large models such as Transformers.
Overall, the presentation provides a comprehensive look at DeWu’s NorthStar framework, covering its motivation, design principles, API capabilities, performance optimizations, and practical considerations for large‑scale sparse model training.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.