Fundamentals 14 min read

How to Crawl Static Web Pages and Retrieve Historical Weather Data with Python

This tutorial explains the fundamentals of web crawling by distinguishing static and dynamic pages, outlining a four‑step process for scraping static sites, and providing a complete Python example that extracts historical weather data, parses HTML with BeautifulSoup, and stores results in CSV files.

Python Programming Learning Circle

Aug 13, 2022

How to Crawl Static Web Pages and Retrieve Historical Weather Data with Python

Data acquisition is the first step of empirical research, and with the exponential growth of internet data, web crawling has become an essential method for gathering information. This guide introduces the basics of crawling static web pages using Python.

Static vs. Dynamic Pages : A static page contains data directly in its HTML source, while a dynamic page loads data via requests to a remote database, making the data invisible in the source code. Static pages can be identified by the presence of data in the source and by URL changes when paging.

Four‑step workflow for static page crawling :

Analyze the page structure.

Request the page content.

Parse the retrieved HTML.

Store the extracted data.

If the target spans multiple pages, you also need to discover the pagination pattern and loop over the generated URLs.

Example: Scraping historical weather data

We will scrape the historical weather records for Beijing in March 2022 from the site https://lishi.tianqi.com/beijing/202203.html. The required fields are date, high temperature, low temperature, weather description, wind direction, and the source URL.

1. Request the page

# Import modules
import requests

url = "https://lishi.tianqi.com/beijing/202203.html"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
}
response = requests.get(url, headers=headers)
print(response)  # <Response [200]> if successful

2. Parse the HTML

# Import modules
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")
# Locate the <ul class="thrui"> that contains the rows
data_table = soup.find('ul', class_="thrui").find_all('li')
weather_list = []
for li in data_table[1:]:  # skip header row
    th_list = li.find_all('div')
    weather = {
        'date': th_list[0].get_text(),
        'temp_high': th_list[1].get_text(),
        'temp_low': th_list[2].get_text(),
        'weather': th_list[3].get_text(),
        'wind': th_list[4].get_text(),
        'url': response.url
    }
    weather_list.append(weather)

3. (Optional) Use NumPy for table‑style data

import numpy as np
weather_list = []
for li in data_table[1:]:
    th_list = li.find_all('div')
    for th in th_list:
        s = th.get_text()
        weather_list.append("".join(s.split()))
result = np.array(weather_list).reshape(-1, 5)  # 5 columns: date, high, low, weather, wind

4. Save the results to CSV

import csv
save_path = 'weather.csv'
with open(save_path, 'a', newline='', encoding='utf-8') as fp:
    csv_header = ['date', 'temp_high', 'temp_low', 'weather', 'wind', 'url']
    csv_writer = csv.DictWriter(fp, csv_header)
    if fp.tell() == 0:
        csv_writer.writeheader()
    csv_writer.writerows(weather_list)

5. Crawl multiple pages

# Build URL list
url_pattern = 'https://lishi.tianqi.com/{}/{}.html'
city_list = ['beijing', 'shanghai']
years = [x for x in range(2020, 2022)]
months = [str(x).zfill(2) for x in range(1, 13)]
month_list = [str(year) + str(month) for year in years for month in months]
url_list = []
for c in city_list:
    for m in month_list:
        url_list.append(url_pattern.format(c, m))

Finally, we wrap all steps into reusable functions and run the crawler with a short delay between requests to avoid over‑loading the server.

# Full script (functions omitted for brevity)
if __name__ == '__main__':
    urls = generate_urls()
    save_file = 'weather.csv'
    for u in urls:
        time.sleep(2)
        crawler(u, save_file)

By following this workflow you can efficiently collect structured data from static websites, transform it into a tabular format, and store it for further analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Extraction CSV

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.