Artificial Intelligence 17 min read

Data Privacy and Differential Privacy Techniques in Machine Learning

This article reviews recent data privacy challenges in machine learning, explains the distinction between privacy and security, presents classic attacks and anonymization methods such as K‑anonymity, L‑diversity and T‑closeness, and details differential privacy techniques and their impact on model performance.

DataFunSummit

Aug 20, 2021

Data Privacy and Differential Privacy Techniques in Machine Learning

Guest Speaker: Guo Xiawei, Senior Researcher at Fourth Paradigm

Editor: Jiang Ruiyao

Source: Fourth Paradigm | Xianjian

Platform: DataFun

Introduction: In recent years, with the enactment of GDPR and several high‑profile data leakage incidents, data privacy protection has become a hot topic in industry applications. The security of machine‑learning algorithms that rely on data poses a huge challenge. This article introduces privacy‑related work in machine learning and Fourth Paradigm’s efforts to improve differential privacy algorithm performance.

Main topics:

Privacy problems and cases

Data‑centric privacy protection techniques: data anonymization

Privacy protection techniques for machine‑learning model training: differential privacy

1. Information Privacy

Privacy refers to the process and rules for protecting sensitive information when it is used by authorized personnel for tasks such as data analysis or model training.

Data privacy and data security are not equivalent. Data security usually means preventing illegal access to data, whereas data privacy means protecting legally accessed data from being reverse‑engineered to reveal sensitive information.

2. Information Privacy Issues

Many fields that use personal sensitive data face privacy problems. When machine‑learning techniques are applied to personal data, they may expose sensitive information and cause negative impacts.

Strictly speaking, absolute personal privacy protection is impossible.

1977: Statistician Tore Dalenius defined privacy as: an attacker cannot obtain any personal information from the private data that they did not know before accessing the data.

2006: Computer scientist Cynthia Dwork proved that such a strict definition is unattainable. For example, if an attacker knows that Alice is two inches taller than the average Lithuanian female, they can infer Alice’s exact height by obtaining the average height from a dataset, even if Alice is not in that dataset.

3. Harms of Privacy Leakage

Privacy information used for fraud and harassment (e.g., credit‑card theft, phone scams, identity theft)

User safety threatened by leaked information

Illegal organizations may manipulate users

User trust crisis

Violation of relevant laws

Generally, for non‑extreme cases we can still protect data privacy to a large extent during machine‑learning processes.

Case Study: In 1997, Massachusetts GIC released a medical dataset containing 5‑digit ZIP code, gender, and birthdate. The governor claimed the data was anonymized because direct identifiers were removed. A MIT graduate combined this data with a $20 voter‑registration dataset and identified the governor’s medical records, demonstrating that 5‑digit ZIP, gender, and birthdate can uniquely identify 87% of U.S. citizens.

This shows that any dataset with enough information can lead to privacy leakage; simple anonymization is insufficient.

Data Anonymization Techniques

Data anonymization protects privacy at the data layer, often by hashing identifiers like names. However, simple hashing can be reversed using other attributes, so additional techniques are needed.

1. Attribute Types (by privacy level)

Key Attribute (KA): A single column (e.g., ID number, name) that can directly identify an individual.

Quasi‑Identifier (QID): Cannot identify an individual alone, but can when combined with other attributes (e.g., birthday, ZIP code).

Sensitive Attribute (SA): Contains sensitive information such as disease or income.

Only protecting key attributes is insufficient.

2. Attack Methods and Defenses

① Linkage Attack

Linkage Attack: By obtaining external information, an attacker can link records in the target table to specific individuals.

② K‑Anonymity

K‑Anonymity ensures that for each record there are at least K‑1 other records sharing the same quasi‑identifier values, preventing unique linkage.

Example: After applying 3‑anonymity on ZIP, age, and sex, three records share the same masked values (ZIP: 47677**, age: 2*, sex: *), satisfying the condition.

③ Homogeneity Attack

Homogeneity Attack can break K‑anonymity when all records in a group share the same sensitive attribute, allowing inference of that attribute.

④ L‑Diversity

L‑Diversity requires that each equivalence class (records sharing the same quasi‑identifier) contain at least L distinct sensitive attribute values, reducing homogeneity risk.

⑤ Similarity Attack

Similarity Attack targets L‑diverse data by using external background information to narrow down the target.

⑥ T‑Closeness

T‑Closeness extends K‑anonymity by requiring the distribution of sensitive attributes in each equivalence class to be close (within a threshold T) to the overall distribution.

Differential Privacy (DP) Techniques

Beyond improper anonymization, models trained on data can also leak private information. Differential privacy is widely used in the modeling process of machine‑learning.

1. Privacy Risks of Models

Unprotected models may reveal sensitive information from the training data.

Membership Inference Attack

This attack determines whether a given sample belongs to the training set of a target model.

Model Inversion Attack

The attacker uses the model and auxiliary features to infer a sensitive attribute of a specific individual.

2. Differential Privacy in Model Training

For two datasets D₁ and D₂ differing by one sample, a mechanism M is ε‑DP if the probability of any output differs by at most e^ε. Smaller ε means less influence of any single sample on the model.

Adding noise to the objective function, gradients, or model outputs can achieve DP. However, stronger privacy (smaller ε) typically degrades model performance.

3. Feature‑Split Differential Privacy

Data is split into two halves; one half is further split by features into K subsets to train K sub‑models. These sub‑models are combined on the second half using DP, guaranteeing overall DP while allocating more privacy budget to important features.

4. DP in Transfer Learning

The feature‑split DP method can be applied to the transfer‑learning stage, protecting the transferred part with DP before merging.

Thank you for listening.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Information Security differential privacy anonymization

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.