Python Web Scraping Tutorial: Using requests and BeautifulSoup to Extract Weather Data
This article demonstrates how to use Python's requests library and BeautifulSoup to inspect webpage source, set request headers, fetch weather page HTML, parse it with CSS selectors, extract daytime and nighttime temperatures, and extend the script to handle multiple cities, providing complete code examples.
This guide introduces three essential web‑scraping techniques: inspecting page source and elements, using the requests library, and parsing HTML with BeautifulSoup . It shows how to retrieve the weather page for Beijing, extract daytime and nighttime temperatures, and then generalize the script for multiple Chinese cities.
First, the script sets a custom User‑Agent header, sends a GET request to the weather URL, forces UTF‑8 encoding, and obtains the raw HTML. The HTML is parsed with the lxml parser to create a Soup object, which can be linked to the browser's "Inspect Element" view.
Using CSS selector syntax p.tem span , the script selects the temperature elements, extracts their text, and prints the results.
Code example for a single city (Beijing):
# -*- coding: utf-8 -*- __author__ = 'duohappy' import requests # import requests module from bs4 import BeautifulSoup # import BeautifulSoup # Set request headers with a common User‑Agent headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36"} url = "http://www.weather.com.cn/weather1d/101010100.shtml" web_data = requests.get(url, headers=headers) web_data.encoding = 'utf-8' content = web_data.text soup = BeautifulSoup(content, 'lxml') tag_list = soup.select('p.tem span') day_temp = tag_list[0].text night_temp = tag_list[1].text print('白天温度为{0}℃n晚上温度为{1}℃'.format(day_temp, night_temp))
To scrape multiple cities, a dictionary maps city names to their weather codes. The user inputs a city name, the URL is formatted accordingly, and the same extraction logic is applied.
Code example for multiple cities:
# -*- coding: utf-8 -*- __author__ = 'duohappy' import requests from bs4 import BeautifulSoup headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36"} weather_code = {'北京':'101010100','上海':'101020100','深圳':'101280601','广州':'101280101','杭州':'101210101'} city = input('请输入城市名:') # only accepts the listed cities url = "http://www.weather.com.cn/weather1d/{}.shtml".format(weather_code[city]) web_data = requests.get(url, headers=headers) web_data.encoding = 'utf-8' content = web_data.text soup = BeautifulSoup(content, 'lxml') tag_list = soup.select('p.tem span') day_temp = tag_list[0].text night_temp = tag_list[1].text print('白天温度为{0}℃n晚上温度为{1}℃'.format(day_temp, night_temp))
The article also briefly covers using BeautifulSoup methods like find and find_all with regular expressions to extract text from specific tags, emphasizing that many web‑page contents are directly embedded in the HTML source.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.