Backend Development 5 min read

Python Web Scraping Tutorial: Downloading Emoji Images from DouTuBa with Multithreading

This tutorial demonstrates how to crawl the DouTuBa emoji website using Python, extract image URLs with regular expressions and BeautifulSoup, and download tens of thousands of images efficiently through a multithreaded downloader.

Python Programming Learning Circle

May 29, 2023

Python Web Scraping Tutorial: Downloading Emoji Images from DouTuBa with Multithreading

Preface – The author describes a situation where a friend needed emoji images to lighten a chat, discovered that the local collection was insufficient, and decided to build a web crawler to fetch emojis from the DouTuBa website.

Page Analysis – The target site contains a massive number of emoji images. By opening the browser’s developer tools (F12), the author shows how to locate the src attribute of an img tag that holds the actual image URL.

Implementation – Fetching Page Content

def askURL(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
    }
    req = urllib.request.Request(url=url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(req)
        html = response.read()
    except Exception as result:
        print(result)
    return html

Implementation – Parsing HTML

# 取出图片src的正则式
imglink = re.compile(
    r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
    re.S)

def getimgsrcs(url):
    html = askURL(url)
    bs = BeautifulSoup(html, "html.parser")
    names = []
    srcs = []
    # 找到所有的img标签
    for item in bs.find_all('img'):
        item = str(item)
        # 根据上面的正则表达式规则把图片的src以及图片名拿下来
        imgsrc = re.findall(imglink, item)
        # 这里是因为拿取的img标签可能不是我们想要的，所以匹配正则规则之后可能返回空值，因此判断一下
        if len(imgsrc) != 0:
            imgname = ""
            if imgsrc[0][0] != '':
                imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
            else:
                imgname = getFileName(imgsrc[0][1])
            names.append(imgname)
            srcs.append(imgsrc[0][1])
    return names, srcs

After obtaining the image URLs and filenames, the author proceeds to download the files.

File Download – Multithreaded Approach

pool = ThreadPoolExecutor(max_workers=50)
for j in range(len(names)):
    pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])

Result – The script successfully scraped and saved over one hundred thousand emoji images, making the author a major collector of emoji resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Web Scraping beautifulsoup Image Download

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.