Big Data 9 min read

Open Source Data Quality Software – Curated List

This article presents a curated table of open‑source data‑quality tools, describing each project's purpose, programming language, documentation, popularity metrics and providing inclusion criteria for selecting suitable software across diverse data‑processing environments.

Architects Research Society
Architects Research Society
Architects Research Society
Open Source Data Quality Software – Curated List

The table below lists available open‑source data‑quality software releases, covering various aspects of data‑quality assessment.

Inclusion criteria

Any open‑source release that is publicly accessible in at least one repository; for brevity, only one link is provided when a repository contains many tools.

The library/framework does not have to focus solely on data quality, as functionality is often bundled with data cleaning or exploratory data analysis.

Data‑quality assessment is important in a wide range of environments/workflows (from validating Excel sheets to big‑data pipelines, offline/online, etc.), so the list includes diverse collections.

Star/issue/fork counts are used as a rough measure of maturity; use at your own risk.

Open source data quality software

1. Name

2. Description

3. Language

4. Online Docs

5. URL

6. Stars

7. Issues

8. Forks

awslabs/

deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets

Scala

github

1328

90

256

data-cleaning/

validate

validate

: Data cleaning for statistical purposes

R

docs

github

236

21

18

datacleaner/

DataCleaner

DataCleaner

Community Edition

Java

docs

github

371

172

136

daveoncode/

pyvaru

pyvaru

: Rule based data validation library for python

Python

docs

github

14

1

3

great-

expectations/

great_

expectations

Great Expectations

helps data teams eliminate pipeline debt, through data testing, documentation, and profiling

Python

docs

github

3127

147

348

OpenRefine/

OpenRefine

openRefine

is a tool for working with messy data

Java

docs

github

7735

595

1376

pandas-

profiling/

pandas-

profiling

pandas-profiling

generates profile reports from a pandas DataFrame

Python

docs

github

6338

44

962

pyeve/cerberus

cerberus

is a lightweight, extensible data validation library for Python

Python

docs

github

2246

33

202

ResidentMario/

missingno

missingno

is a missing data visualization module for Python

Python

github

2540

15

334

WeBankFinTech/

Qualitis

Qualitis

is a data quality management platform that supports quality verification, notification, and management for various datasources

Java

docs

github

208

16

107

whylabs/

whylogs-python

whylogs-python

is a Python implementation of whylogs

Python

docs

github

191

10

7

Source: http://jiagoushi.pro/open-source-data-quality-software

Compiled by Super Engineer, shared across the network.

big datadata qualityopen sourcedata validationsoftware
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.