Big Data 8 min read

Analyzing Google Ngram Data with Python and PyTubes

This article demonstrates how to download the Google Ngram 1‑gram dataset, load the roughly 1.4 billion rows with Python and the PyTubes library, use NumPy to compute yearly word‑frequency percentages, filter and plot the trends for the word “Python” and compare it with other programming languages.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Analyzing Google Ngram Data with Python and PyTubes

Google Ngram Viewer is a useful tool that visualizes the frequency of words over time by scanning a massive collection of books digitized by Google.

The article uses the word “Python” (case‑sensitive) as an example, showing the original Google Ngram chart and explaining that the underlying data spans from the 16th century to 2008 and can be downloaded for free.

The 1‑gram dataset expands to about 27 GB on disk, containing 1,430,727,243 rows across 38 source files and roughly 24 million distinct word‑POS pairs, which poses challenges for loading and processing in Python.

Using a 2016 MacBook Pro with 8 GB RAM, the author loads the data with Python and the new PyTubes library, noting that better hardware would improve performance.

Each line in the dataset is tab‑separated and includes fields such as word, year, match count, volume count, and part‑of‑speech; only a subset of these fields is needed for the analysis.

NumPy is employed to efficiently compute the total number of words per year, enabling the calculation of the percentage of occurrences for “Python” each year, which is more informative than raw counts.

After plotting the yearly total word counts, the author discards data before 1800 to avoid distortion caused by the rapid decline in total volume in earlier centuries, resulting in about 1.3 billion rows for analysis.

Performance is compared with Google’s own chart generation (≈1 second) versus the Python script (≈8 minutes), and suggestions for speed‑up such as pre‑computing yearly totals and indexing are discussed.

To illustrate a more complex example, the author compares the mentions of three programming languages—Python, Pascal, and Perl—adjusting for case sensitivity and normalizing percentages from 1800 onward.

Finally, future improvements to PyTubes are outlined, including support for smaller integer types, enhanced filtering logic, and better string‑matching utilities.

big datadata analysisNumPyGoogle NgramPyTubes
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.