Big Data 7 min read

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

This article demonstrates how to download Google’s massive N‑gram dataset, load the 1.4 billion 1‑gram records with Python and the PyTubes library, use NumPy to efficiently compute yearly word frequencies, and reproduce Google Ngram Viewer charts for Python and other programming languages.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

The Google Ngram Viewer visualizes word usage over time using a huge corpus of scanned books; the author reproduces the Python‑keyword trend by downloading the public n‑gram dataset (covering the 16th‑century to 2008) and processing it with Python.

Because the 1‑gram files total 27 GB on disk (about 1.43 billion rows across 38 files), the author leverages NumPy’s fast array operations and a new data‑loading library called PyTubes to read the tab‑separated data on a modest 8 GB, 2016 MacBook Pro.

After extracting the three fields (word, year, count) and filtering for the capitalised form “Python”, the script builds a NumPy array of the relevant rows, computes yearly total word counts, and derives the percentage of “Python” occurrences per year.

Using these percentages, the author recreates the Google Ngram chart for Python, then extends the analysis to compare three programming languages—Python, Pascal, and Perl—by normalising counts to a common baseline (1800‑1960 average) and plotting their relative trends.

The author notes performance differences (Google’s chart renders in ~1 s versus the script’s ~8 min) and suggests future improvements to PyTubes, such as supporting smaller integer dtypes, richer filtering combinators, and enhanced string‑matching utilities.

big datadata analysisVisualizationngramPyTubes
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.