Backend Development 20 min read

Common Regular Expressions and Methods for Python Web Scraping

This article presents a practical collection of Python regular‑expression techniques for extracting HTML elements such as table rows, links, titles, images, and scripts, showing how to filter tags and handle URL parameters during web crawling.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Common Regular Expressions and Methods for Python Web Scraping

This article introduces frequently used regular‑expression patterns and Python code for web scraping, aiming to solve common crawling problems and help readers extract information from HTML pages.

1. Extract content between <tr> and </tr> tags

res_tr = r'<tr>(.*?)</tr>'
m_tr = re.findall(res_tr, language, re.S|re.M)

Example:

# coding=utf-8
import re
language = '''<tr><th>性別:</th><td>男</td></tr><tr>'''
res_tr = r'<tr>(.*?)</tr>'
m_tr = re.findall(res_tr, language, re.S|re.M)
for line in m_tr:
print line
res_th = r'<th>(.*?)</th>'
m_th = re.findall(res_th, line, re.S|re.M)
for mm in m_th:
print unicode(mm, 'utf-8')
res_td = r'<td>(.*?)</td>'
m_td = re.findall(res_td, line, re.S|re.M)
for nn in m_td:
print unicode(nn, 'utf-8')

Output:

>> <th>性別:</th><td>男</td>
性別: 男

2. Extract text between <a href=..> and </a> tags

res = r'<a .*?>(.*?)</a>'
mm = re.findall(res, content, re.S|re.M)
urls = re.findall(r"<a.*?href=.*?</a>", content, re.I|re.S|re.M)

Example:

# coding=utf-8
import re
content = '''<td><a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a><a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a></td>'''
print u'获取链接文本内容:'
res = r'<a .*?>(.*?)</a>'
mm = re.findall(res, content, re.S|re.M)
for value in mm:
print value
print u'\n获取完整链接内容:'
urls = re.findall(r"<a.*?href=.*?</a>", content, re.I|re.S|re.M)
for i in urls:
print i
print u'\n获取链接中URL:'
res_url = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
link = re.findall(res_url, content, re.I|re.S|re.M)
for url in link:
print url

Output:

获取链接文本内容:
浙江省主题介绍
贵州省主题介绍
获取完整链接内容:
&lt;a href="https://www.baidu.com/articles/zj.html" title="浙江省"&gt;浙江省主题介绍&lt;/a&gt;
&lt;a href="https://www.baidu.com//articles/gz.html" title="贵州省"&gt;贵州省主题介绍&lt;/a&gt;
获取链接中URL:
https://www.baidu.com/articles/zj.html
https://www.baidu.com//articles/gz.html

3. Obtain the last segment of a URL for naming images or parameters

urls = "http://i1.hoopchina.com.cn/blogfile/201411/11/BbsImg141568417848931_640*640.jpg"
values = urls.split('/')[-1]
print values

Output:

BbsImg141568417848931_640*640.jpg

For query strings:

url = 'http://localhost/test.py?a=hello&b=world'
values = url.split('?')[-1]
print values
for key_value in values.split('&'):
print key_value.split('=')

Output:

a=hello&b=world
['a', 'hello']
['b', 'world']

4. Crawl all URL links from a page

# coding=utf-8
import re, urllib
url = "http://www.csdn.net/"
content = urllib.urlopen(url).read()
urls = re.findall(r"&lt;a.*?href=.*?&lt;/a&gt;", content, re.I)
for url in urls:
print unicode(url, 'utf-8')
link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", content)
for url in link_list:
print url

Sample output shows several anchor tags and raw URLs.

5. Extract the page &lt;title&gt; using two methods

# coding=utf-8
import re, urllib
url = "http://www.csdn.net/"
content = urllib.urlopen(url).read()
print u'方法一:'
title_pat = r'(?<=<title>).*?(?=</title>)'
title_ex = re.compile(title_pat, re.M|re.S)
title_obj = re.search(title_ex, content)
title = title_obj.group()
print title
print u'方法二:'
title = re.findall(r'<title>(.*?)</title>', content)
print title[0]

Both methods return the same CSDN title string.

6. Locate a table and extract attribute‑value pairs

# coding=utf-8
import re
s = '''&lt;table&gt; &lt;tr&gt; &lt;td&gt;序列号&lt;/td&gt;&lt;td&gt;DEIN3-39CD3-2093J3&lt;/td&gt; &lt;td&gt;日期&lt;/td&gt;&lt;td&gt;2013年1月22日&lt;/td&gt; &lt;td&gt;售价&lt;/td&gt;&lt;td&gt;392.70 元&lt;/td&gt; &lt;td&gt;说明&lt;/td&gt;&lt;td&gt;仅限5用户使用&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt;'''
res = r'&lt;td&gt;(.*?)&lt;/td&gt;&lt;td&gt;(.*?)&lt;/td&gt;'
m = re.findall(res, s, re.S|re.M)
for line in m:
print unicode(line[0], 'utf-8'), unicode(line[1], 'utf-8')

Output shows each attribute name with its value.

7. Filter &lt;span&gt; and similar tags

elif "span" in nn:
    res_value = r'&lt;span .*?&gt;(.*?)&lt;/span&gt;'
    m_value = re.findall(res_value, nn, re.S|re.M)
    for value in m_value:
        print unicode(value, 'utf-8'),

Example input: &lt;td&gt;&lt;span class="nickname"&gt;(字) 翔宇&lt;/span&gt;&lt;/td&gt; produces (字) 翔宇 .

8. Extract content inside &lt;script&gt; tags (e.g., image URLs)

# coding=utf-8
import re, os, urllib
content = '''&lt;script&gt;var images = [{ "big":"...", "thumb":"...", "original":"http://example.com/img1.jpg" }, { "original":"http://example.com/img2.jpg" }];&lt;/script&gt;'''
html_script = r'&lt;script&gt;(.*?)&lt;/script&gt;'
m_script = re.findall(html_script, content, re.S|re.M)
for script in m_script:
res_original = r'"original":"(.*?)"'
m_original = re.findall(res_original, script)
for pic_url in m_original:
        print pic_url
        filename = os.path.basename(pic_url)
        urllib.urlretrieve(pic_url, 'E:\\' + filename)

The script prints each original image URL and downloads the file.

9. Remove &lt;br /&gt; tags using replace

if '&lt;br /&gt;' in value:
    value = value.replace('&lt;br /&gt;', '')
    value = value.replace('\n', ' ')

Transforms strings like 達洪阿 異名:(字) 厚菴&lt;br /&gt; (諡) 武壯&lt;br /&gt; (勇號) 阿克達春巴圖魯 into a clean single line.

10. Extract src from &lt;img&gt; tags and filter the tags

value = re.sub('&lt;[^>]+&gt;', '', value)
test = '''&lt;img alt="中國國民黨" src="../images/Kuomintang.png" width="19" height="19" border="0" /&gt;'''
print re.findall('src="(.*?)"', test)

Output: ['../images/Kuomintang.png'] .

The article concludes by encouraging readers to apply these regex patterns for efficient web data extraction.

HTML parsingPythonData Extractionregexweb scrapingre module
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.