Financial Big Data Risk Control Models: Techniques, Applications, and COVID‑19 Challenges
This article presents a comprehensive overview of financial big‑data risk control models at Du Xiaoman, covering traditional scoring cards, AI‑driven time‑series and text processing, graph‑based networks, model interpretability, probability calibration, stability analysis, and the specific challenges introduced by the COVID‑19 pandemic.
Guest: Yan Cheng, Head of Risk Modeling at Du Xiaoman Financial
Editor: Huang Leping
Introduction
Financial AI is a key avenue for traditional industry transformation. This session focuses on the technical methods and practical issues of big‑data risk control models at Du Xiaoman, with a discussion of model evolution under the COVID‑19 background.
1. FinTech in Risk Management
FinTech in risk management consists of two parts:
Traditional financial scoring cards: Application Scorecard (A‑card), Behavior Scorecard (B‑card), Collection Scorecard (C‑card).
Information technology capabilities: AI (algorithmic power), Big Data (digital behavior storage), Cloud (resource sharing).
These technologies enhance the effectiveness of traditional scoring‑card models.
2. Du Xiaoman Credit Risk
Du Xiaoman has accumulated extensive data and modeling experience. Core risk identification relies on three layers:
Base user profile (age, gender, education, income, assets, credit history).
Behavioral demand patterns (recent financial actions correlated with past behavior).
Social activity networks (detecting fraud rings and peer influence).
Combining these layers builds a discriminative risk model.
3. Time‑Series Processing: Pre‑Loan
Credit applications obtain credit reports; analyzing the temporal sequence of user actions (e.g., loan queries, disbursements) reveals cash‑flow needs. A deep neural network (LSTM) ingests items composed of timestamp, action type, and features, learning richer representations and improving KS by ~2 points.
4. Time‑Series Processing: In‑Loan
In‑loan behavior feeds B‑card modeling. For each transaction slice, features such as total limit, remaining principal, action type, amount, days to next repayment, etc., are generated and fed into RNNs, significantly boosting B‑card performance.
5. Text Data Processing
Unstructured text from internet behavior is handled via an attention‑based framework that scores each information unit independently of order, allowing flexible integration of new data and improving model robustness.
6. Graph Networks
Graph‑based methods are applied by constructing dense neighbor networks (1‑, 2‑, 3‑hop) and performing graph convolution using user features and neighbor information, followed by supervised learning, enhancing risk identification when combined with other models.
7. Application‑Layer Issues
Model Interpretability
Simple functional form (e.g., logistic regression).
Strong correlation between input features X and prediction Y.
Limited number of variables (≤20).
Complex models (e.g., XGBoost) are decomposed into sub‑models; each sub‑model’s score is combined via logistic regression or a simple decision tree, preserving interpretability and easing monitoring.
Probability Calibration
Calibration steps: segment predictions, compute logit of true delinquency rates per segment, compute average logit of predictions, fit a curve (linear or quadratic), and transform to a credit score (e.g., FICO‑style), making the model independent of sample bad‑rate.
Score Stability
Stability includes distribution stability (monthly score distribution), performance stability (monthly bad‑rate per score), and individual score volatility (sensitive to recent borrowing/repayment behavior).
8. COVID‑19 Impact on Models
COVID‑19 serves as a stress test: while feature X remains unchanged, the associated risk Y rises, especially for multi‑loan variables. Challenges include capturing macro‑economic signals, deciding whether to include pandemic‑era samples in training, and adjusting strategies based on short‑ or long‑term pandemic effects.
Q&A
Q1: Which features reflect macro‑economic conditions under the pandemic? A: Re‑employment indices derived from location migration correlate strongly with user income.
Q2: How does high‑dimensional feature inclusion compare with separate sub‑model scoring? A: KS difference is ~0.5%; high‑dimensional models have more parameters, harder monitoring, and lower interpretability.
Thank you for attending.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.