Artificial Intelligence 26 min read

Observational Data Causal Inference and Quasi‑Experimental Methods: Theory, Challenges, and Tencent Case Studies

This article introduces the fundamentals of causal inference with observational data, explains confounding and collider structures, compares observational and experimental approaches, discusses challenges such as Simpson’s paradox, and presents Tencent’s quasi‑experimental applications including DID, regression discontinuity, and uplift modeling.

DataFunSummit

Dec 26, 2021

Observational Data Causal Inference and Quasi‑Experimental Methods: Theory, Challenges, and Tencent Case Studies

The presentation begins with an overview of causal inference, emphasizing that correlation does not imply causation and that confounding and collider structures can create spurious relationships in observational data.

It explains two classic bias structures: confounding (e.g., "wear shoes while sleeping" correlated with "next‑day headache" due to the hidden variable "drank alcohol last night") and collider (e.g., talent and beauty appearing inversely related in the entertainment industry because selection into the industry acts as a collider).

The speaker then discusses why experiments are needed to break these biases, describing how random assignment removes the influence of parent nodes in a causal graph, allowing the average treatment effect (ATE) to be identified.

A comparison between observational and experimental data is illustrated with the "shoes‑sleeping" example, showing how imbalance in confounders leads to incorrect causal conclusions in observational studies, while randomized experiments reveal the true null effect.

The limitations of purely experimental approaches are acknowledged (ethical constraints, infeasibility, historical data), motivating the use of causal inference on observational data.

Three major challenges for observational causal inference are outlined: Simpson’s paradox (e.g., gender confounding in smoking‑lung‑cancer studies), unobserved confounders (industrialization, mood, genetics), and collider bias (e.g., conditioning on asthma when studying smoking).

An overall analytical framework is presented, prioritizing quasi‑experimental methods—Differences‑in‑Differences (DID), Instrumental Variables, and Regression Discontinuity—when their assumptions hold, and falling back to Propensity Score Matching (PSM) or confounded‑PSM otherwise.

Two concrete Tencent case studies are described:

DID analysis of extreme‑weather notifications shows a 1.4% causal lift in next‑day retention after accounting for parallel trends.

Regression discontinuity on novel‑reading time identifies a causal effect of completing the first chapter on user retention.

For "startup‑reset" problems (e.g., home‑page reset, splash‑screen ads), a three‑step analysis is proposed: short‑term impact via regression discontinuity, long‑term impact via PSM/confounder control, and user heterogeneity analysis using uplift modeling.

The short‑term effect shows a clear breakpoint around a 40‑minute interval between visits, indicating a causal drop in session length and search time when the reset occurs.

Long‑term analysis reveals that naïve PSM can be misleading due to residual confounding; instead, a quasi‑experimental variable is constructed by comparing the frequency of visits falling into the 40‑60 minute window versus the 20‑40 minute window, creating an unbiased long‑term instrument.

Heterogeneity is addressed by building uplift models (CatBoost) on transformed outcomes (Y* and G*). The workflow is:

Step1: Transform original outcome Y and treatment G into Y* and G*; fit CatBoost models.
Step2: Extract top‑15 important features for interpretation.
Step3: Classify users into four quadrants based on uplift signs for two metrics (total session length, search time) and compare feature means.
Step4: Perform one‑dimensional searches within each quadrant to obtain quantitative uplift values and confidence intervals.

Model comparisons using Gini scores show that the Transform‑Outcome + CatBoost pipeline achieves the highest uplift prediction accuracy (Gini ≈ 0.1387), outperforming single‑model baselines.

Final recommendations include applying the quasi‑experimental framework to other startup‑reset scenarios, customizing strategies based on user heterogeneity (e.g., disabling reset for high‑search‑activity users), and improving UI cues to mitigate negative user experience.

The talk concludes by inviting the audience to join DataFunTalk for further discussions on causal inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

causal inference Propensity Score Matching observational data Uplift Modeling regression discontinuity Simpson's paradox DID Quasi-experiment

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.