Efficient Data Profiling of Large Database Tables Using Python
This article presents a Python-based method for quickly profiling dozens of large database tables by generating and executing column‑wise SQL queries, avoiding full table reads, and exporting the results to an Excel file for easy analysis.
When taking on new business tasks, you may need to understand the structure of dozens or hundreds of database tables, each containing millions of rows, and you want to generate a summary table quickly.
Reading entire tables into Python is inefficient for such volumes; a single table can take half an hour to load, making the process impractical. The solution is to avoid full table reads and instead query each table’s metadata and column statistics using grouped SQL statements generated via string formatting.
First, import the required packages:
import pymysql
import pandas as pd
import datetime as dt
start = dt.datetime.now() # for measuring execution timeDefine a reusable function that executes a given SQL statement and returns the result as a pandas DataFrame:
# Define data query function
def sql_inquiry(sql):
conn = pymysql.connect(host="localhost", port=****, user='****',
passwd='****', db="test_titanic", charset="utf8")
cur = conn.cursor()
cur.execute(sql)
result = cur.fetchall()
cols = [a[0] for a in cur.description] # get column names
df = pd.DataFrame(result, columns=cols)
conn.commit()
cur.close()
conn.close()
return dfSpecify the tables you want to profile:
table_name = ['test', 'train']Generate a list of SQL statements that retrieve each table’s column name, comment, type, and row count without loading the full data:
table_sql = []
for i in range(len(table_name)):
table_sql.append('''
SELECT t1.table_name AS table_name,
t1.column_name AS column_name,
t1.column_type AS column_type,
t1.column_comment AS column_comment,
t2.table_rows AS table_rows
FROM information_schema.columns AS t1,
information_schema.`tables` AS t2
WHERE t1.table_schema = t2.table_schema
AND t1.table_name = t2.table_name
AND t1.table_name = '{}'
AND t1.table_schema = 'test_titanic'
'''.format(table_name[i]))Execute each SQL statement, then for every column compute minimum, maximum, non‑null count, null count, and total rows, and calculate the null‑rate percentage:
writer = pd.ExcelWriter("table_des.xlsx") # export to Excel
table_col_name = []
row = 0
for j in range(len(table_sql)):
table_col_name.append(sql_inquiry(table_sql[j]))
table_col_stat = []
col_sql = []
for s in range(len(table_col_name[j])):
col_sql.append('''
SELECT MIN({1}) AS value_min,
MAX({1}) AS value_max,
SUM(CASE WHEN {1} IS NOT NULL THEN 1 END) AS num_classification,
SUM(CASE WHEN {1} IS NULL THEN num ELSE 0 END) AS null_num,
SUM(num) AS rows_num
FROM (
SELECT {1}, COUNT(1) AS num
FROM {0} GROUP BY {1}
) t1
'''.format(table_name[j], table_col_name[j].iloc[s, 1]))
col = sql_inquiry(col_sql[s]).iloc[0, :].tolist()
table_col_stat.append([table_col_name[j].iloc[s, 1]] + col)
table_des = pd.DataFrame(table_col_stat, columns=["name", "value_min", "value_max", "num_classification", "null_num", "rows_num"])
table_des["null_rate"] = table_des["null_num"] / table_des["rows_num"]
table_des["null_rate"] = table_des["null_rate"].apply(lambda x: format(x, ".2%"))
table_des = pd.merge(table_col_name[j], table_des, how="inner", left_on="column_name", right_on="name")
table_des.drop(["table_rows", "name"], axis=1, inplace=True)
table_des.to_excel(writer, startcol=1, startrow=row, index=False)
row += table_des.shape[0] + 4
writer.save()
elapsed = (dt.datetime.now() - start)
print("Time used:", elapsed)
print("\a")After the script finishes, an Excel workbook containing a comprehensive data‑profiling sheet for each large table is generated, allowing you to quickly understand table structures, column statistics, and null‑rate percentages without the overhead of loading full datasets.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.