Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies
This presentation explores the background and current state of privacy computing, its relevance to big data and AI, discusses SGX and LibOS technologies, introduces the BigDL PPML solution for secure Spark/Flink workloads, and reviews real-world applications and future outlook.
The talk, presented by Intel senior architect Dr. Gong Qiyuan and organized by DataFunTalk, focuses on the macro issue of privacy computing, especially its applications in the big data and AI domains.
Privacy computing has become a necessity due to growing user concerns and stringent regulations such as GDPR, CCPA, and China’s Personal Information Protection Law, which have led to substantial fines and stricter enforcement.
In this regulatory backdrop, data security is now mandatory, driving rapid development of technologies like differential privacy, trusted execution environments (TEE), homomorphic encryption, secure multi‑party computation, and federated learning, with market forecasts reaching billions of dollars.
Big data frameworks are widely deployed, but integrating privacy‑preserving techniques across the entire data pipeline remains challenging, especially when ensuring compatibility, performance, and scalability for AI workloads.
BigDL PPML was introduced to enable standard big‑data and AI solutions to run securely inside Intel SGX enclaves, providing end‑to‑end encryption, remote attestation, and confidentiality for both storage and network traffic.
While Apache Spark offers encrypted communication and storage, its computation remains plaintext; PPML leverages SGX and LibOS to protect Spark’s execution without rewriting core logic.
SGX, Intel’s hardware‑based TEE, offers a small attack surface, minimal performance impact, and up to 1 TB of enclave memory, ensuring code integrity even if parts of the system are compromised.
LibOS solutions such as Occlum, Gramine, and sgx‑lkl act as a compatibility layer, translating system calls to SGX‑compatible ones, thus simplifying migration of existing applications into enclaves.
Remote attestation can be implemented either via a full driver‑executor mutual attestation requiring Spark modifications, or through a centralized attestation server that only tweaks launch scripts, eliminating the need for code changes.
The one‑stop PPML solution automates many migration steps, allowing data scientists to work unchanged while cluster administrators handle SGX deployment, thereby reducing compatibility and migration costs.
Real‑world deployments include the Tianchi competition, where over 4,000 teams used SGX‑protected Spark/Flink inference, and Korea Telecom’s low‑latency, SGX‑secured real‑time inference platform with less than 5 % performance loss.
Performance tests with the TPC‑DS benchmark showed end‑to‑end overheads under 2×, confirming acceptable overhead even for I/O‑intensive workloads.
In summary, privacy‑computing pain points in big data AI are addressed by LibOS for compatibility, SGX for secure execution, and Spark/Flink for data processing, with BigDL PPML offering an integrated, one‑stop solution; future advances such as Intel TDX, confidential containers, micro‑kernel security, and accelerator support are expected to further enhance TEE usability and performance.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.