Handling Large-Scale Data in the Tencent Advertising Algorithm Competition: Model Choices, Data Splitting, and Feature Engineering
The article shares practical strategies for processing massive advertising data in the Tencent algorithm competition, covering model selection between GBDT and neural networks, efficient data partitioning methods for low‑resource environments, and the importance of feature engineering to achieve top rankings.
In the 2019 Tencent Advertising Algorithm Competition, contestant Guo Dayi ("Guo Da") shares his experience on how to handle large‑scale data effectively, even on low‑configuration hardware.
Model Selection : The competition mainly uses Gradient Boosting Decision Trees (GBDT) and neural networks. GBDT requires loading all data into memory, while neural networks support streaming training, allowing models to be trained with only a few gigabytes of RAM if a GPU is available. Neural networks excel in time and space efficiency, whereas GBDT offers stable performance and is beginner‑friendly; the best results often come from combining both.
Useful resources mentioned include the open‑source libraries DeepCTR (https://github.com/shenweichen/DeepCTR) and ctrNet-tool (https://github.com/guoday/ctrNet-tool).
Data Partitioning : For GBDT models (e.g., XGBoost, LightGBM, CatBoost), split the dataset into three parts and train separate models if only half of the data fits into memory, then ensemble them. For neural networks (e.g., NFFM, XDeepFM, DIN), divide the data into dozens of chunks stored as pickle files, loading and discarding each chunk sequentially; this approach works with multiple GPUs for parallel training.
In 2018, using only neural networks, Guo Da achieved 7th place, while GBDT models achieved higher accuracy but required much longer training time and massive memory (e.g., 256 GB for full‑data GBDT training). Neural networks can handle thousands of feature dimensions thanks to streaming, outperforming GBDT when feature richness is crucial.
Conclusion : Success in data‑driven competitions ultimately depends on strong feature engineering and a balanced team—one member proficient in neural networks and another in GBDT—to leverage the strengths of both model families.
Participants are encouraged to apply these techniques, experiment with novel features, and register for the upcoming competition.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.