Reproducing Google Ngram Viewer Trends with Python, NumPy, and PyTubes
This article demonstrates how to download the Google 1‑gram dataset, load the ~1.4 billion rows with Python and NumPy (using the PyTubes library), compute yearly word frequencies, visualize the rise of "Python" and compare it with Pascal and Perl, while discussing performance challenges and future improvements.
Google Ngram Viewer is a tool that visualizes word usage over time using a massive corpus of scanned books. The author explains how to replicate the "Python" usage chart by downloading the Google n‑gram dataset (covering the 16th century to 2008) and processing it with Python, NumPy, and the new data‑loading library PyTubes.
The 1‑gram files are tab‑separated and total about 27 GB when expanded. They contain over 1.4 billion rows across 38 source files, representing roughly 24 million distinct word‑POS pairs. Loading such a volume requires careful handling; NumPy’s efficient array operations make the task feasible on a machine with 8 GB RAM.
╒═══════════╤════════╤═════════╕ │ Is_Word │ Year │ Count │ ╞═══════════╪════════╪═════════╡ │ 0 │ 1799 │ 2 │ ├───────────┼────────┼─────────┤ │ 0 │ 1804 │ 1 │ ├───────────┼────────┼─────────┤ │ 0 │ 1805 │ 1 │ ├───────────┼────────┼─────────┤ │ 0 │ 1811 │ 1 │ ├───────────┼────────┼─────────┤ │ 0 │ 1820 │ ... │ ╘═══════════╧════════╧═════════╛
After loading, the script computes the total word count per year to obtain percentages (word count / total words for that year). This normalization mirrors Google’s approach and allows direct comparison of the relative popularity of "Python" across centuries.
To avoid distortions caused by the sparse data before 1800, the analysis discards earlier years, focusing on 1800 onward, which retains about 96 % of the data. The resulting plot shows a clear rise of "Python" after the early 2000s.
The author also extends the analysis to compare three programming languages—Python, Pascal, and Perl—by filtering case‑sensitive capitalized forms and normalizing counts to percentages from 1800 to 1960. The comparative chart highlights differing adoption curves.
Performance observations note that Google’s own rendering takes about one second, whereas the Python script takes roughly eight minutes on the same hardware. Suggestions for speedups include pre‑computing yearly totals, indexing, and using smaller integer dtypes (e.g., 1‑, 2‑, or 4‑bit integers) to reduce memory usage.
Future improvements for PyTubes are outlined: adding support for lower‑bit integer types, richer filtering logic (combining conditions with AND/OR/NOT), and enhanced string‑matching utilities such as startswith, endswith, contains, and is_one_of.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.