Scraping Zhihu "Beauty" Topic Images with Python and Baidu AI Face Detection
This article explains how to collect images from Zhihu's "beauty" topic using Python's Requests and lxml libraries, filter them with Baidu AI's AipFace face detection service, and store the qualified pictures locally, detailing the required environment, logic, and preparation steps.
1. Data Source – The target data are all images appearing in answers to questions under Zhihu's "beauty" topic.
2. Scraping Tool – Implemented with Python 3 and third‑party libraries requests , lxml , and Baidu's AipFace SDK; the script consists of about 100 lines of code.
3. Required Environment
Operating system: macOS, Linux (theoretically), or Windows (with filename character restrictions handled by regex).
No Zhihu login needed.
A Baidu Cloud account is required for the face‑detection service.
4. Face‑Detection Library – AipFace is Baidu AI's Python SDK for face detection, accessible via HTTP and free to use.
5. Filtering Conditions
Discard images without any detected face (e.g., landscapes, non‑portrait photos).
Keep only female faces; male images are mostly celebrities and are ignored.
Exclude non‑real persons such as anime characters (AipFace confidence < 0.6).
Remove low‑beauty scores (beauty < 45) to save storage.
6. Implementation Logic
Use requests to fetch a list of discussions under the "beauty" topic.
Parse each discussion's HTML with lxml to extract all img tag src URLs.
Download each image via requests (ignoring animated GIFs).
Send the image to AipFace for face detection.
Apply the filtering rules from step 5.
Save the remaining images to the local file system with filenames composed of beauty score, author, question title, and an index.
Repeat the process from step 1.
7. Scraping Results – Images are stored in a folder; the highest beauty score observed (aside from a celebrity) is 88. The author notes personal disagreement with the ranking order.
8. Preparation for Running
Install Python 3.
Install the required libraries with a single pip install requests lxml baidu-aip command.
Apply for a free Baidu Cloud face‑detection service (Baidu AI – Face Recognition).
The article also includes promotional material for a free Python public course and related learning resources.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.