Cost Optimization and Mixed‑Resource Deployment in Tencent Taiji Machine Learning Platform for Large‑Scale AI Models
The article describes how Tencent's Taiji machine learning platform leverages cloud‑native mixed‑resource strategies—including online idle, tidal, and compute resources—to reduce training costs, improve stability, and support large‑scale AI model training for advertising and other services.
Abstract Recent years have seen large‑model training become the standard paradigm for AI, especially in advertising where models with billions of parameters demand massive compute and storage resources.
Taiji Machine Learning Platform aims to lower cost and increase efficiency by using mixed‑deployment resources, providing 500,000 cheap core resources daily and cutting offline training costs by 30% while maintaining stability comparable to normal resources.
Platform Overview Taiji offers a one‑stop solution for feature processing, model training, and serving, supporting key Tencent businesses such as advertising, search, games, and cloud services. It supports both CPU and GPU training modes, high‑performance inference, and end‑to‑end pipelines.
Cost‑Optimization Implementation
To meet growing resource demand, Taiji integrates with the internal cloud‑native big‑data platform "FengLuan" which supplies three types of mixed resources:
Reuse of online idle resources.
Elastic borrowing of offline resources (tidal resources).
Reuse of low‑priority compute resources.
FengLuan abstracts heterogeneous resources via virtual clusters, shielding Taiji from underlying geographic differences.
Mixed‑Resource Schemes
Online Idle Resources – Caelus framework mixes online and offline jobs to exploit idle CPU cycles, ensuring online service quality through interference detection and resource isolation.
Tidal Resources – Daytime idle big‑data nodes are lent to Taiji; at night they are reclaimed. A coordinated hand‑off with HDFS ensures storage services remain unaffected.
Compute Resources – Low‑priority CVM instances provide exclusive compute but may be pre‑empted; Taiji mitigates instability via predictive profiling, scheduling optimizations, and hot migration.
Scheduling Optimizations
City‑level placement to reduce latency and cost.
CPU binding based on resource predictions.
Hierarchical resource tagging for high‑availability jobs.
Dynamic parameter tuning for consistent performance.
Runtime Fault‑Tolerance
Hot migration before pod eviction.
Task Manager restart to preserve job state.
Full recovery via Job Manager restart on Flink failures.
Checkpoint‑based resume when other strategies fail.
These measures raise task stability on mixed resources from <90% to 99.5%.
Application‑Layer Optimizations include three‑level fault tolerance, tidal scheduling aware of resource availability, and intelligent workload placement to keep extra overhead under 10%.
Online Results and Future Outlook Taiji now delivers 300,000 core‑hours of mixed resources and 200,000 core‑hours of tidal resources daily for ad‑ranking models, achieving ~70% cost of ordinary resources. Future work will expand mixed compute to GPU resources as online services become GPU‑accelerated.
Thank you for reading.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.