Fundamentals 13 min read

Introduction to BeautifulSoup (bs4) for HTML/XML Parsing in Python

This article introduces BeautifulSoup, a Python library for parsing HTML/XML, explains how to import it, choose among parsers, demonstrates tag navigation, searching with find/find_all, CSS selection, and tree traversal methods, and provides extensive code examples.

Python Programming Learning Circle

Apr 10, 2020

Introduction to BeautifulSoup (bs4) for HTML/XML Parsing in Python

What is BeautifulSoup

BeautifulSoup is a Python library that extracts data from HTML or XML files, providing a convenient way to navigate, search, and modify the parse tree.

Using bs4

1. Import the module

from bs4 import BeautifulSoup

2. Create a soup object with a chosen parser

Typical usage: soup = BeautifulSoup(content, parser). Common parsers include html.parser, lxml, xml, and html5lib. Some parsers need to be installed, e.g., pip3 install lxml.

Parser differences

Different parsers produce slightly different trees. For example:

BeautifulSoup("<a><b/></a>")
# <html><head></head><body><a></a></body></html>

Using the XML parser preserves the empty <b/> tag and adds an XML declaration:

BeautifulSoup("<a><b/></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>

When the source HTML is malformed, parsers behave differently. The lxml parser may drop an unmatched </p> tag, while html5lib attempts to fix the markup and adds missing <head> and <html> elements.

3. Basic operations

Assuming soup = BeautifulSoup(content, parser), the variable soup represents the parsed document.

3.1 Accessing tags by name

soup.tag_name

– get the first occurrence of a tag. soup.tag_name.name – obtain the tag’s name. soup.tag_name.attrs – dictionary of all attributes. soup.tag_name["attr"] or soup.tag_name.get("attr") – specific attribute value. soup.tag_name.text, soup.tag_name.string, soup.tag_name.get_text() – retrieve text content.

Nested access works as well, e.g., print(soup.p.a) returns the a tag inside a p tag.

3.2 Searching with find and find_all

find(name, attrs, recursive, text, **kwargs)

returns the first matching tag; find_all(...) returns a list of all matches.

Common parameters:

name : tag name.

attrs : attribute dictionary (use class_ for the class attribute).

text : filter by exact text content.

recursive : if False, only direct children are searched.

html = """
<html lang="en">
<head><meta charset="UTF-8"><title>Title</title></head>
<body>
<p class="news"><a>123456</a></p>
<a id='i2'>78910</a>
<p class="contents" id="i1"></p>
<a href="http://www.baidu.com" rel="external nofollow">advertisements</a>
<span class="span1" id='i4'>aspan</span>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('a'))
print(soup.find_all(attrs={'id': 'i1'}))
print(soup.find_all(class_='news'))
print(soup.find_all('a', text='123456'))
print(soup.find_all(id='i2', recursive=False))

3.3 Selecting with CSS selectors

soup.select('tag')

returns a list of tags matching the CSS selector. IDs are selected with #id, classes with .class, and combinations can be chained (e.g., #id .class).

soup.select('#i2')
# returns the element with id="i2"

soup.select('.news')
# returns all elements with class="news"

soup.select('#i2 .news')
# selects .news elements inside the element with id="i2"

3.4 Tree navigation

soup.tag.contents

– list of direct children (including newline strings). soup.tag.children – iterator over direct children. soup.tag.descendants – iterator over all descendants. soup.tag.parent – immediate parent. soup.tag.parents – iterator over all ancestors. soup.tag.next_sibling / previous_sibling – adjacent siblings. soup.tag.next_siblings / previous_siblings – iterators over subsequent or preceding siblings.

3.5 Pretty‑printing

If the original markup is incomplete, soup.prettify() can automatically close missing tags and format the HTML.

html = """
<html lang="en"><head><meta charset="UTF-8"><title>Title</title></head>
<body><p class="news"><a>123456</a></p>
<a id='i2'>78910</a>
</body>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

For more detailed information, refer to the official documentation (including a Chinese translation) at https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing beautifulsoup bs4

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.