Databases 7 min read

Efficient Data Profiling of Large Database Tables Using Python

This article presents a Python-based method for quickly profiling dozens of large database tables by generating and executing column‑wise SQL queries, avoiding full table reads, and exporting the results to an Excel file for easy analysis.

Python Programming Learning Circle

Jun 28, 2020

Efficient Data Profiling of Large Database Tables Using Python

When taking on new business tasks, you may need to understand the structure of dozens or hundreds of database tables, each containing millions of rows, and you want to generate a summary table quickly.

Reading entire tables into Python is inefficient for such volumes; a single table can take half an hour to load, making the process impractical. The solution is to avoid full table reads and instead query each table’s metadata and column statistics using grouped SQL statements generated via string formatting.

First, import the required packages:

import pymysql
import pandas as pd
import datetime as dt
start = dt.datetime.now()  # for measuring execution time

Define a reusable function that executes a given SQL statement and returns the result as a pandas DataFrame:

# Define data query function

def sql_inquiry(sql):
    conn = pymysql.connect(host="localhost", port=****, user='****',
                           passwd='****', db="test_titanic", charset="utf8")
    cur = conn.cursor()
    cur.execute(sql)
    result = cur.fetchall()
    cols = [a[0] for a in cur.description]  # get column names
    df = pd.DataFrame(result, columns=cols)
    conn.commit()
    cur.close()
    conn.close()
    return df

Specify the tables you want to profile: table_name = ['test', 'train'] Generate a list of SQL statements that retrieve each table’s column name, comment, type, and row count without loading the full data:

table_sql = []
for i in range(len(table_name)):
    table_sql.append('''
        SELECT t1.table_name AS table_name,
               t1.column_name AS column_name,
               t1.column_type AS column_type,
               t1.column_comment AS column_comment,
               t2.table_rows AS table_rows
        FROM information_schema.columns AS t1,
             information_schema.`tables` AS t2
        WHERE t1.table_schema = t2.table_schema
          AND t1.table_name = t2.table_name
          AND t1.table_name = '{}'
          AND t1.table_schema = 'test_titanic'
    '''.format(table_name[i]))

Execute each SQL statement, then for every column compute minimum, maximum, non‑null count, null count, and total rows, and calculate the null‑rate percentage:

writer = pd.ExcelWriter("table_des.xlsx")  # export to Excel
table_col_name = []
row = 0
for j in range(len(table_sql)):
    table_col_name.append(sql_inquiry(table_sql[j]))
    table_col_stat = []
    col_sql = []
    for s in range(len(table_col_name[j])):
        col_sql.append('''
            SELECT MIN({1}) AS value_min,
                   MAX({1}) AS value_max,
                   SUM(CASE WHEN {1} IS NOT NULL THEN 1 END) AS num_classification,
                   SUM(CASE WHEN {1} IS NULL THEN num ELSE 0 END) AS null_num,
                   SUM(num) AS rows_num
            FROM (
                SELECT {1}, COUNT(1) AS num
                FROM {0} GROUP BY {1}
            ) t1
        '''.format(table_name[j], table_col_name[j].iloc[s, 1]))
        col = sql_inquiry(col_sql[s]).iloc[0, :].tolist()
        table_col_stat.append([table_col_name[j].iloc[s, 1]] + col)
    table_des = pd.DataFrame(table_col_stat, columns=["name", "value_min", "value_max", "num_classification", "null_num", "rows_num"])
    table_des["null_rate"] = table_des["null_num"] / table_des["rows_num"]
    table_des["null_rate"] = table_des["null_rate"].apply(lambda x: format(x, ".2%"))
    table_des = pd.merge(table_col_name[j], table_des, how="inner", left_on="column_name", right_on="name")
    table_des.drop(["table_rows", "name"], axis=1, inplace=True)
    table_des.to_excel(writer, startcol=1, startrow=row, index=False)
    row += table_des.shape[0] + 4
writer.save()
elapsed = (dt.datetime.now() - start)
print("Time used:", elapsed)
print("\a")

After the script finishes, an Excel workbook containing a comprehensive data‑profiling sheet for each large table is generated, allowing you to quickly understand table structures, column statistics, and null‑rate percentages without the overhead of loading full datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Python sql Database Profiling

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.