Fundamentals 11 min read

Master HTML Parsing in Python: BeautifulSoup, lxml, and html.parser Compared

Learn why HTML parsing is essential for web scraping, explore three popular Python libraries—BeautifulSoup, lxml, and the built‑in html.parser—covering installation, core usage, advanced techniques, and a comparative analysis to help you choose the right tool for your project.

Code Mala Tang

Apr 19, 2025

Master HTML Parsing in Python: BeautifulSoup, lxml, and html.parser Compared

This article explains how to parse HTML using three popular Python tools—BeautifulSoup, lxml, and html.parser—each offering distinct advantages.

Why parse HTML?

When you visit a webpage, the content is structured with HTML tags that define headings, paragraphs, images, links, and more. Extracting specific information such as titles, prices, or comments requires navigating this structure. Manual inspection is tedious, especially for large or multiple pages, so parsing tools automate locating and extracting the needed data.

Python tools for parsing HTML

Python provides several libraries for HTML parsing, each suited to different scenarios. Below are three widely used options.

BeautifulSoup

BeautifulSoup is one of the most popular libraries for parsing HTML and XML in Python. It simplifies data extraction by allowing easy navigation of the HTML tree.

Installation

pip install beautifulsoup4

BeautifulSoup is often used together with requests to fetch page content.

pip install requests

How to use BeautifulSoup

Example that extracts a page title:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

title = soup.title.text
print("页面标题：", title)

In this code:

We use requests.get to retrieve the HTML.

BeautifulSoup parses the HTML content. soup.title extracts the page title.

Traversing the HTML structure

After parsing, you can navigate the tree with methods such as: soup.find finds the first occurrence of a tag. soup.find_all returns all occurrences of a specific tag.

Example that extracts all links:

links = soup.find_all('a')
for link in links:
    print(link.get('href'))

This prints every hyperlink URL found on the page.

lxml

lxml is a powerful library for parsing HTML and XML, known for its speed and accuracy. It is ideal when performance is a priority.

Installation

pip install lxml

How to use lxml

Example that extracts a page title using XPath:

from lxml import html
import requests

url = "https://example.com"
response = requests.get(url)

tree = html.fromstring(response.content)

title = tree.findtext('.//title')
print("页面标题：", title)

In this example:

We parse the page with lxml.html.

The findtext function retrieves the text inside the <title> tag.

Using XPath

XPath provides a flexible way to query HTML documents.

Example that extracts all link URLs:

links = tree.xpath('//a/@href')
for link in links:
    print(link)

The XPath expression //a/@href selects the href attribute of every <a> tag.

html.parser

The built‑in html.parser module is a standard‑library alternative. It may be slower and less feature‑rich than the other options but requires no extra installation.

How to use html.parser

Example of a custom parser that reports start tags, end tags, and data:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("开始标签：", tag)
    def handle_endtag(self, tag):
        print("结束标签：", tag)
    def handle_data(self, data):
        print("数据：", data)

html_content = """
<html>
<head><title>示例</title></head>
<body><p>你好，世界！</p></body>
</html>
"""

parser = MyHTMLParser()
parser.feed(html_content)

In this example we subclass HTMLParser and override methods to handle different parts of the document.

Comparing these libraries

BeautifulSoup

Ease of use : Very beginner‑friendly.

Flexibility : Handles simple and complex parsing tasks.

Performance : Slower on large documents compared to lxml.

lxml

Speed : Extremely fast HTML parsing.

Accuracy : Handles malformed HTML gracefully.

XPath support : Enables powerful queries.

html.parser

Built‑in : No external dependencies.

Basic parsing : Suitable for simple tasks but less flexible than the other two.

Choosing the right tool

If you need a quick, simple solution without extra installations, html.parser is a good choice.

For large, complex documents or high performance, lxml is optimal.

If you prefer an easy‑to‑use, feature‑rich library with strong community support, BeautifulSoup is ideal.

Advanced parsing techniques

For more complex scenarios you may combine these libraries with other tools. For example, use BeautifulSoup with requests for static pages, or employ Selenium or Playwright to render JavaScript‑heavy sites before parsing.

Example with Selenium and BeautifulSoup

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
url = "https://example.com"
driver.get(url)

html_content = driver.page_source
soup = BeautifulSoup(html_content, "html.parser")

title = soup.title.text
print("页面标题：", title)

driver.quit()

This script opens the page in a real browser, lets JavaScript execute, then extracts the title using BeautifulSoup.

Conclusion

HTML parsing is a crucial skill for web crawling, data extraction, and automation. Python offers powerful libraries—BeautifulSoup, lxml, and the built‑in html.parser—that make the task straightforward. Choose BeautifulSoup for ease of use, lxml for speed and XPath support, or html.parser for simple, dependency‑free parsing.

Understanding each tool’s strengths enables efficient HTML parsing and data extraction tailored to your project’s needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python beautifulsoup lxml

Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.