Big Data 7 min read

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

This article demonstrates how to download Google’s massive N‑gram dataset, load the 1.4 billion 1‑gram records with Python and the PyTubes library, use NumPy to efficiently compute yearly word frequencies, and reproduce Google Ngram Viewer charts for Python and other programming languages.

Python Programming Learning Circle

Mar 7, 2022

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

The Google Ngram Viewer visualizes word usage over time using a huge corpus of scanned books; the author reproduces the Python‑keyword trend by downloading the public n‑gram dataset (covering the 16th‑century to 2008) and processing it with Python.

Because the 1‑gram files total 27 GB on disk (about 1.43 billion rows across 38 files), the author leverages NumPy’s fast array operations and a new data‑loading library called PyTubes to read the tab‑separated data on a modest 8 GB, 2016 MacBook Pro.

After extracting the three fields (word, year, count) and filtering for the capitalised form “Python”, the script builds a NumPy array of the relevant rows, computes yearly total word counts, and derives the percentage of “Python” occurrences per year.

Using these percentages, the author recreates the Google Ngram chart for Python, then extends the analysis to compare three programming languages—Python, Pascal, and Perl—by normalising counts to a common baseline (1800‑1960 average) and plotting their relative trends.

The author notes performance differences (Google’s chart renders in ~1 s versus the script’s ~8 min) and suggests future improvements to PyTubes, such as supporting smaller integer dtypes, richer filtering combinators, and enhanced string‑matching utilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Analysis visualization NGram PyTubes

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.