Backend Development 6 min read

Master Web Scraping with Python: Requests + BeautifulSoup Step‑by‑Step

This tutorial walks you through using Python's requests library to fetch a web page and BeautifulSoup4 to parse HTML, covering object creation, common attributes, tag properties, and the find() / find_all() methods for extracting specific content.

Python Programming Learning Circle

Dec 12, 2019

Master Web Scraping with Python: Requests + BeautifulSoup Step‑by‑Step

After fetching an HTML page with requests, you need to parse it to extract useful information.

BeautifulSoup4 (bs4) is a library for parsing HTML and XML.

1. Creating a BeautifulSoup object

Import the class and instantiate it with the page text.

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
print(type(soup))

The result shows the object type.

2. Common attributes

The BeautifulSoup object represents the HTML tree; you can access tags like soup.title etc.

head : content of the <head> tag.

title : page title inside <title>.

body : content of the <body>.

p : first paragraph.

strings : all strings displayed on the page.

stripped_strings : non‑empty strings.

Example: extracting the slogan “百度一下，你就知道”. First fetch the page, then access soup.title.

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
title = soup.title
print(title)

3. Tag object attributes

Each tag is a Tag object with attributes: name , attrs , contents , string . The string property follows rules based on nesting.

Example: get the string of the first <a> tag.

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
print(soup.a.string)

4. Using find() and find_all()

find()

returns the first match, while find_all() returns all matches. Parameters include name, attrs, recursive, string, limit.

Example: find all tags containing the word “百度” using a regular expression.

import requests
import re
from bs4 import BeautifulSoup
r = requests.get("https://www.baidu.com/")
r.encoding = "utf-8"
soup = BeautifulSoup(r.text)
w = soup.find_all(string=re.compile("百度"))
print(w)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python beautifulsoup find_all

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.