How to Crawl Static Web Pages and Retrieve Historical Weather Data with Python
This tutorial explains the fundamentals of web crawling by distinguishing static and dynamic pages, outlining a four‑step process for scraping static sites, and providing a complete Python example that extracts historical weather data, parses HTML with BeautifulSoup, and stores results in CSV files.
Data acquisition is the first step of empirical research, and with the exponential growth of internet data, web crawling has become an essential method for gathering information. This guide introduces the basics of crawling static web pages using Python.
Static vs. Dynamic Pages : A static page contains data directly in its HTML source, while a dynamic page loads data via requests to a remote database, making the data invisible in the source code. Static pages can be identified by the presence of data in the source and by URL changes when paging.
Four‑step workflow for static page crawling :
Analyze the page structure.
Request the page content.
Parse the retrieved HTML.
Store the extracted data.
If the target spans multiple pages, you also need to discover the pagination pattern and loop over the generated URLs.
Example: Scraping historical weather data
We will scrape the historical weather records for Beijing in March 2022 from the site https://lishi.tianqi.com/beijing/202203.html . The required fields are date, high temperature, low temperature, weather description, wind direction, and the source URL.
1. Request the page
# Import modules
import requests
url = "https://lishi.tianqi.com/beijing/202203.html"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
}
response = requests.get(url, headers=headers)
print(response) # <Response [200]> if successful2. Parse the HTML
# Import modules
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Locate the <ul class="thrui"> that contains the rows
data_table = soup.find('ul', class_="thrui").find_all('li')
weather_list = []
for li in data_table[1:]: # skip header row
th_list = li.find_all('div')
weather = {
'date': th_list[0].get_text(),
'temp_high': th_list[1].get_text(),
'temp_low': th_list[2].get_text(),
'weather': th_list[3].get_text(),
'wind': th_list[4].get_text(),
'url': response.url
}
weather_list.append(weather)3. (Optional) Use NumPy for table‑style data
import numpy as np
weather_list = []
for li in data_table[1:]:
th_list = li.find_all('div')
for th in th_list:
s = th.get_text()
weather_list.append("".join(s.split()))
result = np.array(weather_list).reshape(-1, 5) # 5 columns: date, high, low, weather, wind4. Save the results to CSV
import csv
save_path = 'weather.csv'
with open(save_path, 'a', newline='', encoding='utf-8') as fp:
csv_header = ['date', 'temp_high', 'temp_low', 'weather', 'wind', 'url']
csv_writer = csv.DictWriter(fp, csv_header)
if fp.tell() == 0:
csv_writer.writeheader()
csv_writer.writerows(weather_list)5. Crawl multiple pages
# Build URL list
url_pattern = 'https://lishi.tianqi.com/{}/{}.html'
city_list = ['beijing', 'shanghai']
years = [x for x in range(2020, 2022)]
months = [str(x).zfill(2) for x in range(1, 13)]
month_list = [str(year) + str(month) for year in years for month in months]
url_list = []
for c in city_list:
for m in month_list:
url_list.append(url_pattern.format(c, m))Finally, we wrap all steps into reusable functions and run the crawler with a short delay between requests to avoid over‑loading the server.
# Full script (functions omitted for brevity)
if __name__ == '__main__':
urls = generate_urls()
save_file = 'weather.csv'
for u in urls:
time.sleep(2)
crawler(u, save_file)By following this workflow you can efficiently collect structured data from static websites, transform it into a tabular format, and store it for further analysis.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.