Privacy-Preserving Gradient Boosting Decision Trees via Multi-Party Computation and the Squirrel Framework
This article introduces a privacy-preserving gradient boosting decision tree (GBDT) solution built on multi‑party computation, detailing its background, training steps, the MPC tools used, and the Squirrel framework’s workflow, while discussing performance challenges and experimental results demonstrating scalability to millions of samples.
Background : The work addresses the need to train decision tree models in a vertically distributed environment where data from banks and financial companies must be combined without exposing raw features, aiming to support loan‑approval predictions while preserving user privacy.
Gradient Boosting Decision Trees (GBDT) : GBDT is an ensemble learning method that iteratively adds weak decision trees to improve prediction accuracy. The training process includes initializing a model, computing residuals, training a weak learner on those residuals, updating the model, and repeating until convergence.
Secure Multi‑Party Computation (MPC) : To protect privacy, the solution employs semi‑honest two‑party computation, using additive secret sharing for non‑linear operations and linear homomorphic encryption for linear ones. A hybrid approach combines the strengths of both, favoring lattice‑based schemes (e.g., BFV) over Paillier due to efficiency and modulus compatibility.
Squirrel Framework Workflow : The framework proceeds through several steps: (1) Private Set Intersection (PSI) aligns samples from both parties; (2) Feature bucketization and encoding; (3) Secret‑shared gradient computation across parties; (4) Encrypted vector multiplication to accumulate gradients; (5) Efficient sigmoid approximation using a Fourier‑series‑based method that requires only a single round of multiplication.
Performance Challenges : While MPC enables privacy, it incurs high computational cost and communication overhead, especially for large datasets. The proposed method mitigates this by converting RLWE ciphertexts to LWE for bucket summation, then recombining them to reduce communication, and by using polynomial coefficient encryption for vector operations.
Experimental Results : Experiments demonstrate that the approach can train pure two‑party decision tree models on datasets ranging from hundreds of thousands to tens of millions of samples within minutes per tree, achieving high accuracy and interpretability comparable to XGBoost and LightGBM.
Conclusion : By integrating RLWE↔LWE conversion, indicator vectors for split tracking, and a Fourier‑based sigmoid approximation, the system offers a scalable, privacy‑preserving GBDT solution. The work has been accepted at USENIX Security 2023.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.