Backend Development 5 min read

Python Web Scraper for Downloading Emoji Images from DouTuBa

This article demonstrates how to use Python, urllib, BeautifulSoup, regular expressions, and multithreading to crawl the DouTuBa website, extract the URLs of emoji images, and download over a hundred thousand pictures automatically.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python Web Scraper for Downloading Emoji Images from DouTuBa

In this tutorial the author explains how to build a Python web scraper that collects emoji images from the site “斗图吧”, starting from a personal need to share memes.

By opening the page in Chrome DevTools (F12) the image element’s src attribute is identified, as shown in the screenshots.

The script begins with a function def askURL(url): that sets a User‑Agent header, sends a request with urllib.request , and returns the raw HTML.

<code>def askURL(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
    }
    req = urllib.request.Request(url=url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(req)
        html = response.read()
    except Exception as result:
        print(result)
    return html</code>

Next, a regular expression is compiled to match the img tags, and def getimgsrcs(url): uses BeautifulSoup to parse the HTML, iterates over all img elements, extracts the image name and URL, and stores them in two lists which are returned.

<code># 取出图片src的正则式
imglink = re.compile(
    r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
    re.S)

def getimgsrcs(url):
    html = askURL(url)
    bs = BeautifulSoup(html, "html.parser")
    names = []
    srcs = []
    # 找到所有的img标签
    for item in bs.find_all('img'):
        item = str(item)
        # 根据上面的正则表达式规则把图片的src以及图片名拿下来
        imgsrc = re.findall(imglink, item)
        # 这里是因为拿取的img标签可能不是我们想要的,所以匹配正则规则之后可能返回空值,因此判断一下
        if (len(imgsrc) != 0):
            imgname = ""
            if imgsrc[0][0] != '':
                imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
            else:
                imgname = getFileName(imgsrc[0][1])
            names.append(imgname)
            srcs.append(imgsrc[0][1])
    return names, srcs</code>

After obtaining the lists of filenames and URLs, the article shows a multithreaded download example using ThreadPoolExecutor(max_workers=50) and calling FileDownload.downloadFile(url, filelocation) for each item.

<code>pool = ThreadPoolExecutor(max_workers=50)
for j in range(len(names)):
    pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])</code>

The final result is a collection of more than one hundred thousand emoji images, confirming the scraper’s success.

multithreadingregexweb scrapingimage-downloadingBeautifulSoup
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.