Backend Development 5 min read

Python Web Scraper for Downloading Emoji Images from DouTuBa

This article demonstrates how to use Python, urllib, BeautifulSoup, regular expressions, and multithreading to crawl the DouTuBa website, extract the URLs of emoji images, and download over a hundred thousand pictures automatically.

Python Programming Learning Circle

Jul 14, 2022

Python Web Scraper for Downloading Emoji Images from DouTuBa

In this tutorial the author explains how to build a Python web scraper that collects emoji images from the site “斗图吧”, starting from a personal need to share memes.

By opening the page in Chrome DevTools (F12) the image element’s src attribute is identified, as shown in the screenshots.

The script begins with a function def askURL(url): that sets a User‑Agent header, sends a request with urllib.request, and returns the raw HTML.

def askURL(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
    }
    req = urllib.request.Request(url=url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(req)
        html = response.read()
    except Exception as result:
        print(result)
    return html

Next, a regular expression is compiled to match the img tags, and def getimgsrcs(url): uses BeautifulSoup to parse the HTML, iterates over all img elements, extracts the image name and URL, and stores them in two lists which are returned.

# 取出图片src的正则式
imglink = re.compile(
    r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
    re.S)

def getimgsrcs(url):
    html = askURL(url)
    bs = BeautifulSoup(html, "html.parser")
    names = []
    srcs = []
    # 找到所有的img标签
    for item in bs.find_all('img'):
        item = str(item)
        # 根据上面的正则表达式规则把图片的src以及图片名拿下来
        imgsrc = re.findall(imglink, item)
        # 这里是因为拿取的img标签可能不是我们想要的，所以匹配正则规则之后可能返回空值，因此判断一下
        if (len(imgsrc) != 0):
            imgname = ""
            if imgsrc[0][0] != '':
                imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
            else:
                imgname = getFileName(imgsrc[0][1])
            names.append(imgname)
            srcs.append(imgsrc[0][1])
    return names, srcs

After obtaining the lists of filenames and URLs, the article shows a multithreaded download example using ThreadPoolExecutor(max_workers=50) and calling FileDownload.downloadFile(url, filelocation) for each item.

pool = ThreadPoolExecutor(max_workers=50)
for j in range(len(names)):
    pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])

The final result is a collection of more than one hundred thousand emoji images, confirming the scraper’s success.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading regex image-downloading web-scraping

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.