Artificial Intelligence 9 min read

Practical Advice for Feature Engineering and Model Selection in a Social Advertising Algorithm Contest

The authors share their first‑time competition experience, detailing common pitfalls in feature engineering, data leakage, and model choice, and provide concrete suggestions for beginners and mid‑level teams to improve performance in a social advertising algorithm contest.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
Practical Advice for Feature Engineering and Model Selection in a Social Advertising Algorithm Contest

We are a group of first‑time participants in a social advertising algorithm competition. During the early weeks we struggled with feature engineering, often producing random‑looking features and encountering data leakage that caused our offline scores to be high but online results to explode; nevertheless we managed to reach the top ten and wish to share our lessons.

Advice for teams still near the baseline: Many newcomers rely only on raw data, simple statistics, and one‑hot encoding, which limits performance. While trying various models such as FFM, we recommend XGBoost as a powerful tool that can filter features, resist noisy features better than LR, and expose feature importance for pruning. Improving scores is rarely about switching models; understanding model principles and parameter meanings is essential.

Feature engineering should go beyond simple one‑hot or single‑feature statistics. For high‑dimensional features, compute basic statistics; for low‑dimensional ones, one‑hot is fine, but both miss feature interactions. Creating combined statistics—e.g., joint counts or conversion rates of positionid and connectiontype —can yield thousand‑fold improvements.

For the user_app_installed and user_app_actions tables, generate simple historical counts. Because userid has high cardinality, aggregating installation counts per user or per app, and joining with user.csv and app_category.csv , can provide additional useful features, typically adding a 0.0005‑level lift.

Beware of data leakage. A practical safeguard is to split the training data into time windows (e.g., first week for statistics, second week for model training) to prevent leakage from inflating offline metrics.

Advice for mid‑range teams: Many have exhausted basic statistical features and see diminishing returns. Learning from winning teams’ shares can reveal new tricks. Duplicate rows in train.csv should be handled carefully; instead of removing them, add features that mark repetition order, allowing the model to learn patterns automatically.

Regarding “tricks,” strong offline features that cause a performance explosion online should not be discarded; think about how to reconstruct such features for the test set, possibly by extending the duplicate‑handling ideas.

In closing, we feel a bit embarrassed to share as we are not the top team, but we hope to receive feedback and improve together. We wish everyone success in the competition.

The organizers commend the sharing spirit, emphasizing that the goal of the contest is knowledge exchange, mutual learning, and enjoyment of the competition process, and they encourage participants to like and comment on helpful posts.

For more details, visit the official contest website http://algo.tpai.qq.com and follow the official WeChat account TSA-Contest for updates and gifts.

competitionXGBoostdata leakage
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.