Master Web Scraping with Python: Requests + BeautifulSoup Step‑by‑Step
This tutorial walks you through using Python's requests library to fetch a web page and BeautifulSoup4 to parse HTML, covering object creation, common attributes, tag properties, and the find() / find_all() methods for extracting specific content.
After fetching an HTML page with requests , you need to parse it to extract useful information.
BeautifulSoup4 (bs4) is a library for parsing HTML and XML.
1. Creating a BeautifulSoup object
Import the class and instantiate it with the page text.
<code>import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
print(type(soup))
</code>The result shows the object type.
2. Common attributes
The BeautifulSoup object represents the HTML tree; you can access tags like soup.title etc.
head : content of the <head> tag.
title : page title inside <title>.
body : content of the <body>.
p : first paragraph.
strings : all strings displayed on the page.
stripped_strings : non‑empty strings.
Example: extracting the slogan “百度一下,你就知道”. First fetch the page, then access soup.title .
<code>import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
title = soup.title
print(title)
</code>3. Tag object attributes
Each tag is a Tag object with attributes: name , attrs , contents , string . The string property follows rules based on nesting.
Example: get the string of the first <a> tag.
<code>import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
print(soup.a.string)
</code>4. Using find() and find_all()
find() returns the first match, while find_all() returns all matches. Parameters include name, attrs, recursive, string, limit.
Example: find all tags containing the word “百度” using a regular expression.
<code>import requests
import re
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
w = soup.find_all(string=re.compile("百度"))
print(w)
</code>Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.