Ctrip's Automated Iterative Anti‑Fraud Modeling Framework for Payment Risk
The article describes Ctrip's payment fraud risk characteristics, a comprehensive automated iterative anti‑fraud model framework—including variable system, GAN‑augmented sample generation, RNN behavior encoding, and tree‑based classifiers—and demonstrates how this approach restores recall performance compared with traditional static models.
Payment fraud risk, caused by leaked card or account information, threatens both users and Ctrip's platform; the financial risk control team must accurately identify and block such transactions without hindering legitimate travel.
The fraud scenario exhibits three key traits: high adversarial nature, complex user‑behavior mimicry, and a scarcity of labeled bad samples.
To combat these challenges, Ctrip built an automated iterative anti‑fraud model system that speeds up model updates, reduces manual engineering effort, and employs Generative Adversarial Networks (GANs) to synthesize additional fraud samples, enabling a "see‑and‑counter" capability.
The risk‑variable system draws from account, payment, travel, finance, and IP‑location data, combining real‑time computed variables with offline T+1 cleaned variables to form a rich feature pool.
The iterative framework consists of: (1) trigger conditions (time‑based or performance‑driven); (2) a variable library of recent samples; (3) variable processing (PSI stability check, missing‑value/abnormal‑value filling, one‑hot encoding for categorical features); (4) algorithm‑derived variables via deep learning; (5) GAN‑generated synthetic fraud cases; (6) a main model—typically tree‑based classifiers such as Random Forest, XGBoost, or LightGBM; (7) deployment outputting PMML models, feature‑engineering code, and derived‑variable methods; (8) threshold setting based on short‑term production performance; and (9) a monitoring suite for variable drift, model PSI, and business‑effect metrics.
For sequential user behavior, an RNN is trained on UBT action and pageview data; its hidden‑layer outputs become additional features for the main model, capturing order‑sensitive patterns that traditional aggregated features miss.
Empirical results show that the traditional "Tianyan‑I" model’s recall fell from >12% to ~7.9% across OOT datasets, while the automated iterative "Tianyan‑II" framework restored recall to ~11.5% at 80% precision, confirming the benefit of rapid model refresh and synthetic sample augmentation.
The authors conclude that, despite automation, human oversight remains essential for new variable configuration and detailed fraud case analysis to maintain model controllability while achieving high effectiveness.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.