Artificial Intelligence 13 min read

Confident Learning: Detecting and Cleaning Noisy Labels with cleanlab

This article introduces confident learning, a principled framework for identifying and correcting mislabeled data in machine‑learning datasets, explains its three‑step process (count, clean, re‑training), demonstrates usage of the open‑source cleanlab library with code examples, and presents experimental results showing its effectiveness on benchmarks such as CIFAR‑10 and ImageNet.

DataFunTalk
DataFunTalk
DataFunTalk
Confident Learning: Detecting and Cleaning Noisy Labels with cleanlab

01 Definition of Confident Learning

Confident Learning (CL) is a principled framework proposed by MIT and Google to identify label errors, characterize label noise, and enable noisy‑label learning. It estimates the joint distribution of noisy and true labels without requiring strong assumptions about noise.

Note: The identified "noisy" samples are based on uncertainty estimates and may not always be true errors.

Major advantage: can discover mislabeled samples.

No iterative data cleaning needed; the cleanlab Python package can find errors in minutes on large datasets like ImageNet.

Provides a theoretically sound estimate of the noise‑label joint distribution.

Requires no hyper‑parameters; uses cross‑validation predictions.

Does not assume uniformly random label noise.

Model‑agnostic; works with any classifier.

02 Confident Learning Open‑Source Tool: cleanlab

Install the library with:

pip install cleanlab

Example to obtain noisy‑label indices:

from cleanlab.pruning import get_noise_indices
ordered_label_errors = get_noise_indices(
    s=numpy_array_of_noisy_labels,
    psx=numpy_array_of_predicted_probabilities,
    sorted_index_method='normalized_margin'
)

Visual inspection on MNIST shows the tool can quickly surface mislabeled samples.

03 The Three Steps of Confident Learning

1. Count: Estimate the Joint Distribution

Define noisy label and true label . Using cross‑validation, obtain predicted class probabilities for each sample. Compute the count matrix C, then calibrate it so its total matches the number of annotated samples, yielding an estimate of the joint distribution Q.

2. Clean: Filter Out Noisy Samples

Five pruning methods are provided:

Method 1 – Prune by highest predicted probability that disagrees with the given label.

Method 2 – Prune samples that fall into off‑diagonal cells of the count matrix.

Method 3 – Prune by Class (PBC): select a fixed number of lowest‑confidence samples per class.

Method 4 – Prune by Noise Rate (PBNR): select samples with largest margin in off‑diagonal cells.

Method 5 – C+NR: combine Method 3 and Method 4.

Example code:

import cleanlab
baseline_cl_pbc = cleanlab.pruning.get_noise_indices(s, psx, prune_method='prune_by_class', n_jobs=1)
baseline_cl_pbnr = cleanlab.pruning.get_noise_indices(s, psx, prune_method='prune_by_noise_rate', n_jobs=1)
baseline_cl_both = cleanlab.pruning.get_noise_indices(s, psx, prune_method='both', n_jobs=1)

3. Re‑Training: Adjust Class Weights and Retrain

After removing noisy samples, recompute class‑wise loss weights using the estimated joint distribution Q and train with a robust method such as Co‑Teaching:

from cleanlab.classification import LearningWithNoisyLabels
from sklearn.linear_model import LogisticRegression
lnl = LearningWithNoisyLabels(clf=LogisticRegression())
lnl.fit(X_train_data, train_noisy_labels)
predicted_test_labels = lnl.predict(X_test)

04 Experimental Results

On CIFAR‑10 with 40% label noise, confident learning improves accuracy by up to 34% over the previous SOTA (MentorNet). Visualizations show the estimated joint distribution closely matches the true distribution. On ImageNet, the method discovers real labeling errors, multi‑label issues, and ontological inconsistencies.

05 Summary

Confident learning provides a theoretically grounded way to estimate the noisy‑label distribution, detect mislabeled samples, and improve model performance without altering loss functions directly. The open‑source cleanlab package makes the entire pipeline accessible in a few lines of code.

06 References

Confident Learning: Estimating Uncertainty in Dataset Labels.

Co‑Teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels.

Original article: https://zhuanlan.zhihu.com/p/146557232

machine learningdata cleaningnoisy labelscleanlabconfident learninglabel noise
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.