Backend Development 16 min read

Getting Started with requests-html: Installation, Basic Usage, and Advanced Features

This article introduces the Python requests-html library, covering its installation, basic operations such as fetching pages, extracting links and elements with CSS and XPath selectors, advanced capabilities like JavaScript rendering, pagination handling, custom request options, and practical web‑scraping examples.

Python Programming Learning Circle

Mar 31, 2021

The requests-html library extends the popular requests HTTP client with built‑in HTML parsing, allowing developers to download a page and immediately work with its DOM without a separate parser.

Installation

Install the library with a single pip command (Python 3.6+ is required): pip install requests-html Basic usage

First create a session and request a page. The response object contains an html attribute of type HTML, which provides the parsing API.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.qiushibaike.com/text/')
print(r.html.html)  # raw HTML

To obtain all links on the page:

print(r.html.links)          # set of relative URLs
print(r.html.absolute_links)  # set of absolute URLs

To select elements you can use CSS selectors via find or XPath via xpath. The find method accepts parameters selector, clean, containing, first, and _encoding.

# CSS example – get the menu text
print(r.html.find('div#menu', first=True).text)
# Get all anchor tags inside the menu
print(r.html.find('div#menu a'))

XPath works similarly:

# XPath example – get the menu text
print(r.html.xpath("//div[@id='menu']", first=True).text)
# Get all <a> elements inside the menu
print(r.html.xpath("//div[@id='menu']/a"))

To read an element’s text, attributes or raw HTML you can use .text, .attrs and .html respectively:

e = r.html.find('div#hd_logo', first=True)
print(e.text)      # element text
print(e.attrs)     # attribute dict
print(e.html)      # raw element HTML

Advanced usage

Some sites render content with JavaScript. Calling r.html.render() launches a headless Chromium instance (downloaded on first use) to execute the page’s scripts.

r = session.get('http://python-requests.org/')
r.html.render()
print(r.html.html)  # fully rendered HTML

The render method accepts parameters such as retries, script, wait, scrolldown, sleep, reload, and keep_page to control the rendering process.

Pagination can be handled by iterating over the HTML object returned by r.html.next() or by manually constructing the next URL.

r = session.get('https://reddit.com')
for html in r.html:
    print(html)  # each page object
next_url = r.html.next()
print(next_url)

Custom request options are passed through **kwargs. For example, you can set a custom User‑Agent header:

ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
r = session.get('http://httpbin.org/get', headers={'User-Agent': ua})
print(r.html.html)

Form submission works like the underlying requests library:

r = session.post('http://httpbin.org/post', data={'username': 'yitian', 'passwd': '123456'})
print(r.html.html)

Practical examples

• Scraping a Jianshu user’s article list (requires JavaScript rendering and scrolling):

r = session.get('https://www.jianshu.com/u/7753478e1554')
r.html.render(scrolldown=50, sleep=2)
titles = r.html.find('a.title')
for i, title in enumerate(titles):
    print(f"{i+1} [{title.text}](https://www.jianshu.com{title.attrs['href']})")

• Downloading a multi‑page Tianya forum thread, extracting only the original author’s posts:

url = 'http://bbs.tianya.cn/post-culture-488321-1.shtml'
session = HTMLSession()
r = session.get(url)
author = r.html.find('div.atl-info span a', first=True).text
# Determine total pages
div = r.html.find('div.atl-pages', first=True)
links = div.find('a')
total_page = 1 if not links else int(links[-2].text)
title = r.html.find('span.s_title span', first=True).text
with open(f'{title}.txt', 'x', encoding='utf-8') as f:
    for i in range(1, total_page + 1):
        page_url = f"{url.rsplit('-', 1)[0]}-{i}.shtml"
        page = session.get(page_url)
        items = page.html.find(f"div.atl-item[_host={author}]")
        for item in items:
            content = item.find('div.bbs-content', first=True).text
            if not content.startswith('@'):
                f.write(content + "
")

These examples demonstrate how requests-html bridges the gap between simple HTTP requests and full‑featured web scraping, offering a concise API for both static and dynamic pages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python requests-html JavaScript rendering

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.