Getting Started with requests-html: Installation, Basic Usage, and Advanced Features
This article introduces the Python requests-html library, covering its installation, basic operations such as fetching pages, extracting links and elements with CSS and XPath selectors, advanced capabilities like JavaScript rendering, pagination handling, custom request options, and practical web‑scraping examples.
The requests-html library extends the popular requests HTTP client with built‑in HTML parsing, allowing developers to download a page and immediately work with its DOM without a separate parser.
Installation
Install the library with a single pip command (Python 3.6+ is required):
pip install requests-htmlBasic usage
First create a session and request a page. The response object contains an html attribute of type HTML , which provides the parsing API.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.qiushibaike.com/text/')
print(r.html.html) # raw HTMLTo obtain all links on the page:
print(r.html.links) # set of relative URLs
print(r.html.absolute_links) # set of absolute URLsTo select elements you can use CSS selectors via find or XPath via xpath . The find method accepts parameters selector , clean , containing , first , and _encoding .
# CSS example – get the menu text
print(r.html.find('div#menu', first=True).text)
# Get all anchor tags inside the menu
print(r.html.find('div#menu a'))XPath works similarly:
# XPath example – get the menu text
print(r.html.xpath("//div[@id='menu']", first=True).text)
# Get all <a> elements inside the menu
print(r.html.xpath("//div[@id='menu']/a"))To read an element’s text, attributes or raw HTML you can use .text , .attrs and .html respectively:
e = r.html.find('div#hd_logo', first=True)
print(e.text) # element text
print(e.attrs) # attribute dict
print(e.html) # raw element HTMLAdvanced usage
Some sites render content with JavaScript. Calling r.html.render() launches a headless Chromium instance (downloaded on first use) to execute the page’s scripts.
r = session.get('http://python-requests.org/')
r.html.render()
print(r.html.html) # fully rendered HTMLThe render method accepts parameters such as retries , script , wait , scrolldown , sleep , reload , and keep_page to control the rendering process.
Pagination can be handled by iterating over the HTML object returned by r.html.next() or by manually constructing the next URL.
r = session.get('https://reddit.com')
for html in r.html:
print(html) # each page object
next_url = r.html.next()
print(next_url)Custom request options are passed through **kwargs . For example, you can set a custom User‑Agent header:
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
r = session.get('http://httpbin.org/get', headers={'User-Agent': ua})
print(r.html.html)Form submission works like the underlying requests library:
r = session.post('http://httpbin.org/post', data={'username': 'yitian', 'passwd': '123456'})
print(r.html.html)Practical examples
• Scraping a Jianshu user’s article list (requires JavaScript rendering and scrolling):
r = session.get('https://www.jianshu.com/u/7753478e1554')
r.html.render(scrolldown=50, sleep=2)
titles = r.html.find('a.title')
for i, title in enumerate(titles):
print(f"{i+1} [{title.text}](https://www.jianshu.com{title.attrs['href']})")• Downloading a multi‑page Tianya forum thread, extracting only the original author’s posts:
url = 'http://bbs.tianya.cn/post-culture-488321-1.shtml'
session = HTMLSession()
r = session.get(url)
author = r.html.find('div.atl-info span a', first=True).text
# Determine total pages
div = r.html.find('div.atl-pages', first=True)
links = div.find('a')
total_page = 1 if not links else int(links[-2].text)
title = r.html.find('span.s_title span', first=True).text
with open(f'{title}.txt', 'x', encoding='utf-8') as f:
for i in range(1, total_page + 1):
page_url = f"{url.rsplit('-', 1)[0]}-{i}.shtml"
page = session.get(page_url)
items = page.html.find(f"div.atl-item[_host={author}]")
for item in items:
content = item.find('div.bbs-content', first=True).text
if not content.startswith('@'):
f.write(content + "\n")These examples demonstrate how requests-html bridges the gap between simple HTTP requests and full‑featured web scraping, offering a concise API for both static and dynamic pages.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.