Python Web Scraping Tutorial: Downloading Emoji Images from DouTuBa with Multithreading
This tutorial demonstrates how to crawl the DouTuBa emoji website using Python, extract image URLs with regular expressions and BeautifulSoup, and download tens of thousands of images efficiently through a multithreaded downloader.
Preface – The author describes a situation where a friend needed emoji images to lighten a chat, discovered that the local collection was insufficient, and decided to build a web crawler to fetch emojis from the DouTuBa website.
Page Analysis – The target site contains a massive number of emoji images. By opening the browser’s developer tools (F12), the author shows how to locate the src attribute of an img tag that holds the actual image URL.
Implementation – Fetching Page Content
<code>def askURL(url):
head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
}
req = urllib.request.Request(url=url, headers=head)
html = ""
try:
response = urllib.request.urlopen(req)
html = response.read()
except Exception as result:
print(result)
return html</code>Implementation – Parsing HTML
<code># 取出图片src的正则式
imglink = re.compile(
r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
re.S)
def getimgsrcs(url):
html = askURL(url)
bs = BeautifulSoup(html, "html.parser")
names = []
srcs = []
# 找到所有的img标签
for item in bs.find_all('img'):
item = str(item)
# 根据上面的正则表达式规则把图片的src以及图片名拿下来
imgsrc = re.findall(imglink, item)
# 这里是因为拿取的img标签可能不是我们想要的,所以匹配正则规则之后可能返回空值,因此判断一下
if len(imgsrc) != 0:
imgname = ""
if imgsrc[0][0] != '':
imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
else:
imgname = getFileName(imgsrc[0][1])
names.append(imgname)
srcs.append(imgsrc[0][1])
return names, srcs</code>After obtaining the image URLs and filenames, the author proceeds to download the files.
File Download – Multithreaded Approach
<code>pool = ThreadPoolExecutor(max_workers=50)
for j in range(len(names)):
pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])</code>Result – The script successfully scraped and saved over one hundred thousand emoji images, making the author a major collector of emoji resources.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.