Introduction to BeautifulSoup (bs4) for HTML/XML Parsing in Python
This article introduces BeautifulSoup, a Python library for parsing HTML/XML, explains how to import it, choose among parsers, demonstrates tag navigation, searching with find/find_all, CSS selection, and tree traversal methods, and provides extensive code examples.
What is BeautifulSoup
BeautifulSoup is a Python library that extracts data from HTML or XML files, providing a convenient way to navigate, search, and modify the parse tree.
Using bs4
1. Import the module
<code>from bs4 import BeautifulSoup</code>2. Create a soup object with a chosen parser
Typical usage: soup = BeautifulSoup(content, parser) . Common parsers include html.parser , lxml , xml , and html5lib . Some parsers need to be installed, e.g., pip3 install lxml .
Parser differences
Different parsers produce slightly different trees. For example:
<code>BeautifulSoup("<a><b/></a>")
# <html><head></head><body><a></a></body></html>
</code>Using the XML parser preserves the empty <b/> tag and adds an XML declaration:
<code>BeautifulSoup("<a><b/></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>
</code>When the source HTML is malformed, parsers behave differently. The lxml parser may drop an unmatched </p> tag, while html5lib attempts to fix the markup and adds missing <head> and <html> elements.
3. Basic operations
Assuming soup = BeautifulSoup(content, parser) , the variable soup represents the parsed document.
3.1 Accessing tags by name
soup.tag_name – get the first occurrence of a tag.
soup.tag_name.name – obtain the tag’s name.
soup.tag_name.attrs – dictionary of all attributes.
soup.tag_name["attr"] or soup.tag_name.get("attr") – specific attribute value.
soup.tag_name.text , soup.tag_name.string , soup.tag_name.get_text() – retrieve text content.
Nested access works as well, e.g., print(soup.p.a) returns the a tag inside a p tag.
3.2 Searching with find and find_all
find(name, attrs, recursive, text, **kwargs) returns the first matching tag; find_all(...) returns a list of all matches.
Common parameters:
name : tag name.
attrs : attribute dictionary (use class_ for the class attribute).
text : filter by exact text content.
recursive : if False , only direct children are searched.
<code>html = """
<html lang="en">
<head><meta charset="UTF-8"><title>Title</title></head>
<body>
<p class="news"><a>123456</a></p>
<a id='i2'>78910</a>
<p class="contents" id="i1"></p>
<a href="http://www.baidu.com" rel="external nofollow">advertisements</a>
<span class="span1" id='i4'>aspan</span>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('a'))
print(soup.find_all(attrs={'id': 'i1'}))
print(soup.find_all(class_='news'))
print(soup.find_all('a', text='123456'))
print(soup.find_all(id='i2', recursive=False))
</code>3.3 Selecting with CSS selectors
soup.select('tag') returns a list of tags matching the CSS selector. IDs are selected with #id , classes with .class , and combinations can be chained (e.g., #id .class ).
<code>soup.select('#i2')
# returns the element with id="i2"
soup.select('.news')
# returns all elements with class="news"
soup.select('#i2 .news')
# selects .news elements inside the element with id="i2"
</code>3.4 Tree navigation
soup.tag.contents – list of direct children (including newline strings).
soup.tag.children – iterator over direct children.
soup.tag.descendants – iterator over all descendants.
soup.tag.parent – immediate parent.
soup.tag.parents – iterator over all ancestors.
soup.tag.next_sibling / previous_sibling – adjacent siblings.
soup.tag.next_siblings / previous_siblings – iterators over subsequent or preceding siblings.
3.5 Pretty‑printing
If the original markup is incomplete, soup.prettify() can automatically close missing tags and format the HTML.
<code>html = """
<html lang="en"><head><meta charset="UTF-8"><title>Title</title></head>
<body><p class="news"><a>123456</a></p>
<a id='i2'>78910</a>
</body>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
</code>For more detailed information, refer to the official documentation (including a Chinese translation) at https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html .
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.