Backend Development 16 min read

Getting Started with requests-html: Installation, Basic Usage, Advanced Features, and Web Scraping Examples

This article introduces the Python requests-html library, covering its installation, basic operations such as fetching pages, extracting links and elements, advanced capabilities like JavaScript rendering, pagination, custom requests, and provides practical web‑scraping examples for sites like Jianshu and Tianya.

Python Programming Learning Circle

Dec 28, 2024

Python has a popular HTTP library called requests, and its author released a new library named requests-html that combines HTTP requests with HTML parsing, offering a convenient way to scrape web pages.

Installation

Install requests-html with a single command; it requires Python 3.6+ because it uses type annotations.

pip install requests-html

Basic Usage

Fetching a Web Page

requests-html

automatically downloads the page and returns a response object whose html attribute is an HTML instance for parsing.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.qiushibaike.com/text/')
print(r.html.html)

Getting Links

The links and absolute_links attributes return all relative and absolute URLs found in the page.

print(r.html.links)
print(r.html.absolute_links)

Getting Elements

Use CSS selectors with find or XPath with xpath to locate elements. The find function accepts parameters such as selector, clean, containing, first, and _encoding.

# CSS selector example
print(r.html.find('div#menu', first=True).text)
# XPath example
print(r.html.xpath("//div[@id='menu']", first=True).text)

Advanced Usage

JavaScript Support

For pages rendered by JavaScript, call r.html.render(). This downloads Chromium via pyppeteer (first run only) and executes the scripts.

r = session.get('http://python-requests.org/')
r.html.render()
print(r.html.search('Python 2 will retire in only {months} months!')['months'])

The render method accepts parameters like retries, script, wait, scrolldown, sleep, reload, and keep_page for fine‑grained control.

Smart Pagination

Iterate over r.html objects to follow pagination links automatically.

for html in r.html:
    print(html)
next_url = r.html.next()

Direct HTML Usage

You can create an HTML object from a string without making a network request.

from requests_html import HTML
doc = """<a href='https://httpbin.org'>Link</a>"""
html = HTML(html=doc)
print(html.links)

Custom Requests

All HTTP methods support additional **kwargs to pass custom headers, cookies, etc. Example of changing the User‑Agent:

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
r = session.get('http://httpbin.org/get', headers={'User-Agent': ua})
print(r.html.html)

Form Login Simulation

Use session.post with form data to simulate a login.

r = session.post('http://httpbin.org/post', data={'username': 'yitian', 'passwd': 123456})
print(r.html.html)

Web‑Scraping Examples

Scraping Jianshu User Articles

Render the page, scroll down to load all articles, then extract titles and URLs.

r = session.get('https://www.jianshu.com/u/7753478e1554')
r.html.render(scrolldown=50, sleep=2)
titles = r.html.find('a.title')
for i, title in enumerate(titles):
    print(f"{i+1} [{title.text}](https://www.jianshu.com{title.attrs['href']})")

Scraping Tianya Forum Threads

Iterate through paginated forum pages, collect the author’s posts, and write them to a text file.

url = 'http://bbs.tianya.cn/post-culture-488321-1.shtml'
r = session.get(url)
author = r.html.find('div.atl-info span a', first=True).text
# Determine total pages
div = r.html.find('div.atl-pages', first=True)
links = div.find('a')
total_page = 1 if not links else int(links[-2].text)
# Loop through pages and save posts
with io.open(f"{title}.txt", 'x', encoding='utf-8') as f:
    for i in range(1, total_page + 1):
        page_url = f"{url.rsplit('-', 1)[0]}-{i}.shtml"
        r = session.get(page_url)
        items = r.html.find(f"div.atl-item[_host={author}]")
        for item in items:
            content = item.find('div.bbs-content', first=True).text
            if not content.startswith('@'):
                f.write(content + "
")

These examples demonstrate how requests-html simplifies web scraping tasks compared to using raw requests plus BeautifulSoup or a full‑featured framework like Scrapy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing automation requests-html

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.