Backend Development 10 min read

Simple Python Web Scraping with urllib and Beautiful Soup

This tutorial demonstrates how to use Python's urllib module to simulate browser requests, parse HTML with Beautiful Soup, extract text and image URLs, and store the scraped data locally using file I/O and the with statement, providing complete code examples.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Simple Python Web Scraping with urllib and Beautiful Soup

Python's urllib.request module is introduced as a fundamental tool for sending HTTP requests, handling URLs, and retrieving web page content, with a brief description of its functions and classes.

An example shows how to fetch the Baidu homepage by creating a Request object with a custom User-Agent header and reading the response:

from urllib import request
url = 'http://www.baidu.com'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
page = request.Request(url, headers=headers)
page_info = request.urlopen(page).read().decode('utf-8')
print(page_info)

The article explains that many websites check request headers to block crawlers, so mimicking a real browser's headers helps bypass basic anti‑scraping measures.

Beautiful Soup is presented as a Python library for parsing HTML/XML, enabling easy navigation and extraction of elements. A complete example extracts article titles from the Jianshu homepage:

# -*- coding:utf-8 -*-
from urllib import request
from bs4 import BeautifulSoup
url = r'http://www.jianshu.com'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
page = request.Request(url, headers=headers)
page_info = request.urlopen(page).read().decode('utf-8')
soup = BeautifulSoup(page_info, 'html.parser')
titles = soup.find_all('a', 'title')
for title in titles:
    print(title.string)

To persist scraped data, the guide covers writing to a text file using both the traditional open / close pattern and the more Pythonic with statement, emphasizing the importance of closing files.

# Using try/finally
file = open(r'E:\titles.txt', 'w')
for title in titles:
    file.write(title.string + '\n')
file.close()

# Using with statement
with open(r'E:\titles.txt', 'w') as file:
    for title in titles:
        file.write(title.string + '\n')

The tutorial also shows how to download images by locating img tags with specific classes and .jpg extensions, then saving them with urllib.request.urlretrieve while avoiding filename collisions using timestamps.

import time
from urllib import request
import re
url = r'https://www.zhihu.com/question/22918070'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
page = request.Request(url, headers=headers)
page_info = request.urlopen(page).read().decode('utf-8')
soup = BeautifulSoup(page_info, 'html.parser')
links = soup.find_all('img', "origin_image zh-lightbox-thumb", src=re.compile(r'.jpg$'))
local_path = r'E:\Pic'
for link in links:
    request.urlretrieve(link.attrs['src'], local_path + r'\%s.jpg' % time.time())

Overall, the article provides a step‑by‑step guide for building simple web crawlers in Python, covering request handling, HTML parsing, data extraction, and local storage of both text and images.

file-ioimage-downloadingurllibBeautifulSoupweb-scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.