Open Source Data Quality Software – Curated List
This article presents a curated table of open‑source data‑quality tools, describing each project's purpose, programming language, documentation, popularity metrics and providing inclusion criteria for selecting suitable software across diverse data‑processing environments.
The table below lists available open‑source data‑quality software releases, covering various aspects of data‑quality assessment.
Inclusion criteria
Any open‑source release that is publicly accessible in at least one repository; for brevity, only one link is provided when a repository contains many tools.
The library/framework does not have to focus solely on data quality, as functionality is often bundled with data cleaning or exploratory data analysis.
Data‑quality assessment is important in a wide range of environments/workflows (from validating Excel sheets to big‑data pipelines, offline/online, etc.), so the list includes diverse collections.
Star/issue/fork counts are used as a rough measure of maturity; use at your own risk.
Open source data quality software
1. Name
2. Description
3. Language
4. Online Docs
5. URL
6. Stars
7. Issues
8. Forks
awslabs/
deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets
Scala
github
1328
90
256
data-cleaning/
validate
validate
: Data cleaning for statistical purposes
R
docs
github
236
21
18
datacleaner/
DataCleaner
DataCleaner
Community Edition
Java
docs
github
371
172
136
daveoncode/
pyvaru
pyvaru
: Rule based data validation library for python
Python
docs
github
14
1
3
great-
expectations/
great_
expectations
Great Expectations
helps data teams eliminate pipeline debt, through data testing, documentation, and profiling
Python
docs
github
3127
147
348
OpenRefine/
OpenRefine
openRefine
is a tool for working with messy data
Java
docs
github
7735
595
1376
pandas-
profiling/
pandas-
profiling
pandas-profiling
generates profile reports from a pandas DataFrame
Python
docs
github
6338
44
962
pyeve/cerberus
cerberus
is a lightweight, extensible data validation library for Python
Python
docs
github
2246
33
202
ResidentMario/
missingno
missingno
is a missing data visualization module for Python
Python
github
2540
15
334
WeBankFinTech/
Qualitis
Qualitis
is a data quality management platform that supports quality verification, notification, and management for various datasources
Java
docs
github
208
16
107
whylabs/
whylogs-python
whylogs-python
is a Python implementation of whylogs
Python
docs
github
191
10
7
Source: http://jiagoushi.pro/open-source-data-quality-software
Compiled by Super Engineer, shared across the network.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.