Getting Started with requests-html: Installation, Basic Usage, Advanced Features, and Web Scraping Examples
This article introduces the Python requests-html library, covering its installation, basic operations such as fetching pages, extracting links and elements, advanced capabilities like JavaScript rendering, pagination, custom requests, and provides practical web‑scraping examples for sites like Jianshu and Tianya.
Python has a popular HTTP library called requests , and its author released a new library named requests-html that combines HTTP requests with HTML parsing, offering a convenient way to scrape web pages.
Installation
Install requests-html with a single command; it requires Python 3.6+ because it uses type annotations.
<code>pip install requests-html</code>Basic Usage
Fetching a Web Page
requests-html automatically downloads the page and returns a response object whose html attribute is an HTML instance for parsing.
<code>from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.qiushibaike.com/text/')
print(r.html.html)</code>Getting Links
The links and absolute_links attributes return all relative and absolute URLs found in the page.
<code>print(r.html.links)
print(r.html.absolute_links)</code>Getting Elements
Use CSS selectors with find or XPath with xpath to locate elements. The find function accepts parameters such as selector , clean , containing , first , and _encoding .
<code># CSS selector example
print(r.html.find('div#menu', first=True).text)
# XPath example
print(r.html.xpath("//div[@id='menu']", first=True).text)</code>Advanced Usage
JavaScript Support
For pages rendered by JavaScript, call r.html.render() . This downloads Chromium via pyppeteer (first run only) and executes the scripts.
<code>r = session.get('http://python-requests.org/')
r.html.render()
print(r.html.search('Python 2 will retire in only {months} months!')['months'])</code>The render method accepts parameters like retries , script , wait , scrolldown , sleep , reload , and keep_page for fine‑grained control.
Smart Pagination
Iterate over r.html objects to follow pagination links automatically.
<code>for html in r.html:
print(html)
next_url = r.html.next()
</code>Direct HTML Usage
You can create an HTML object from a string without making a network request.
<code>from requests_html import HTML
doc = """<a href='https://httpbin.org'>Link</a>"""
html = HTML(html=doc)
print(html.links)
</code>Custom Requests
All HTTP methods support additional **kwargs to pass custom headers, cookies, etc. Example of changing the User‑Agent:
<code>ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
r = session.get('http://httpbin.org/get', headers={'User-Agent': ua})
print(r.html.html)
</code>Form Login Simulation
Use session.post with form data to simulate a login.
<code>r = session.post('http://httpbin.org/post', data={'username': 'yitian', 'passwd': 123456})
print(r.html.html)
</code>Web‑Scraping Examples
Scraping Jianshu User Articles
Render the page, scroll down to load all articles, then extract titles and URLs.
<code>r = session.get('https://www.jianshu.com/u/7753478e1554')
r.html.render(scrolldown=50, sleep=2)
titles = r.html.find('a.title')
for i, title in enumerate(titles):
print(f"{i+1} [{title.text}](https://www.jianshu.com{title.attrs['href']})")
</code>Scraping Tianya Forum Threads
Iterate through paginated forum pages, collect the author’s posts, and write them to a text file.
<code>url = 'http://bbs.tianya.cn/post-culture-488321-1.shtml'
r = session.get(url)
author = r.html.find('div.atl-info span a', first=True).text
# Determine total pages
div = r.html.find('div.atl-pages', first=True)
links = div.find('a')
total_page = 1 if not links else int(links[-2].text)
# Loop through pages and save posts
with io.open(f"{title}.txt", 'x', encoding='utf-8') as f:
for i in range(1, total_page + 1):
page_url = f"{url.rsplit('-', 1)[0]}-{i}.shtml"
r = session.get(page_url)
items = r.html.find(f"div.atl-item[_host={author}]")
for item in items:
content = item.find('div.bbs-content', first=True).text
if not content.startswith('@'):
f.write(content + "\n")
</code>These examples demonstrate how requests-html simplifies web scraping tasks compared to using raw requests plus BeautifulSoup or a full‑featured framework like Scrapy.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.