Data Privacy and Differential Privacy Techniques in Machine Learning
This article reviews recent data privacy challenges in machine learning, explains the distinction between privacy and security, presents classic attacks and anonymization methods such as K‑anonymity, L‑diversity and T‑closeness, and details differential privacy techniques and their impact on model performance.
Guest Speaker: Guo Xiawei, Senior Researcher at Fourth Paradigm
Editor: Jiang Ruiyao
Source: Fourth Paradigm | Xianjian
Platform: DataFun
Introduction: In recent years, with the enactment of GDPR and several high‑profile data leakage incidents, data privacy protection has become a hot topic in industry applications. The security of machine‑learning algorithms that rely on data poses a huge challenge. This article introduces privacy‑related work in machine learning and Fourth Paradigm’s efforts to improve differential privacy algorithm performance.
Main topics:
Privacy problems and cases
Data‑centric privacy protection techniques: data anonymization
Privacy protection techniques for machine‑learning model training: differential privacy
1. Information Privacy
Privacy refers to the process and rules for protecting sensitive information when it is used by authorized personnel for tasks such as data analysis or model training.
Data privacy and data security are not equivalent. Data security usually means preventing illegal access to data, whereas data privacy means protecting legally accessed data from being reverse‑engineered to reveal sensitive information.
2. Information Privacy Issues
Many fields that use personal sensitive data face privacy problems. When machine‑learning techniques are applied to personal data, they may expose sensitive information and cause negative impacts.
Strictly speaking, absolute personal privacy protection is impossible.
1977: Statistician Tore Dalenius defined privacy as: an attacker cannot obtain any personal information from the private data that they did not know before accessing the data.
2006: Computer scientist Cynthia Dwork proved that such a strict definition is unattainable. For example, if an attacker knows that Alice is two inches taller than the average Lithuanian female, they can infer Alice’s exact height by obtaining the average height from a dataset, even if Alice is not in that dataset.
3. Harms of Privacy Leakage
Privacy information used for fraud and harassment (e.g., credit‑card theft, phone scams, identity theft)
User safety threatened by leaked information
Illegal organizations may manipulate users
User trust crisis
Violation of relevant laws
Generally, for non‑extreme cases we can still protect data privacy to a large extent during machine‑learning processes.
Case Study: In 1997, Massachusetts GIC released a medical dataset containing 5‑digit ZIP code, gender, and birthdate. The governor claimed the data was anonymized because direct identifiers were removed. A MIT graduate combined this data with a $20 voter‑registration dataset and identified the governor’s medical records, demonstrating that 5‑digit ZIP, gender, and birthdate can uniquely identify 87% of U.S. citizens.
This shows that any dataset with enough information can lead to privacy leakage; simple anonymization is insufficient.
Data Anonymization Techniques
Data anonymization protects privacy at the data layer, often by hashing identifiers like names. However, simple hashing can be reversed using other attributes, so additional techniques are needed.
1. Attribute Types (by privacy level)
Key Attribute (KA): A single column (e.g., ID number, name) that can directly identify an individual.
Quasi‑Identifier (QID): Cannot identify an individual alone, but can when combined with other attributes (e.g., birthday, ZIP code).
Sensitive Attribute (SA): Contains sensitive information such as disease or income.
Only protecting key attributes is insufficient.
2. Attack Methods and Defenses
① Linkage Attack
Linkage Attack: By obtaining external information, an attacker can link records in the target table to specific individuals.
② K‑Anonymity
K‑Anonymity ensures that for each record there are at least K‑1 other records sharing the same quasi‑identifier values, preventing unique linkage.
Example: After applying 3‑anonymity on ZIP, age, and sex, three records share the same masked values (ZIP: 47677**, age: 2*, sex: *), satisfying the condition.
③ Homogeneity Attack
Homogeneity Attack can break K‑anonymity when all records in a group share the same sensitive attribute, allowing inference of that attribute.
④ L‑Diversity
L‑Diversity requires that each equivalence class (records sharing the same quasi‑identifier) contain at least L distinct sensitive attribute values, reducing homogeneity risk.
⑤ Similarity Attack
Similarity Attack targets L‑diverse data by using external background information to narrow down the target.
⑥ T‑Closeness
T‑Closeness extends K‑anonymity by requiring the distribution of sensitive attributes in each equivalence class to be close (within a threshold T) to the overall distribution.
Differential Privacy (DP) Techniques
Beyond improper anonymization, models trained on data can also leak private information. Differential privacy is widely used in the modeling process of machine‑learning.
1. Privacy Risks of Models
Unprotected models may reveal sensitive information from the training data.
Membership Inference Attack
This attack determines whether a given sample belongs to the training set of a target model.
Model Inversion Attack
The attacker uses the model and auxiliary features to infer a sensitive attribute of a specific individual.
2. Differential Privacy in Model Training
For two datasets D₁ and D₂ differing by one sample, a mechanism M is ε‑DP if the probability of any output differs by at most e^ε. Smaller ε means less influence of any single sample on the model.
Adding noise to the objective function, gradients, or model outputs can achieve DP. However, stronger privacy (smaller ε) typically degrades model performance.
3. Feature‑Split Differential Privacy
Data is split into two halves; one half is further split by features into K subsets to train K sub‑models. These sub‑models are combined on the second half using DP, guaranteeing overall DP while allocating more privacy budget to important features.
4. DP in Transfer Learning
The feature‑split DP method can be applied to the transfer‑learning stage, protecting the transferred part with DP before merging.
Thank you for listening.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.