Artificial Intelligence 4 min read

Scraping Zhihu "Beauty" Topic Images with Python and Baidu AI Face Detection

This tutorial explains how to use Python 3 with Requests, lxml, and Baidu's AipFace SDK to crawl images from Zhihu's "beauty" topic, filter them by face detection and gender criteria, and store the qualified pictures locally.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scraping Zhihu "Beauty" Topic Images with Python and Baidu AI Face Detection

The article presents a Python 3 script that crawls images from Zhihu's "美女" (beauty) topic, extracts image URLs from answer pages, and uses Baidu AI's AipFace service to detect faces and evaluate beauty scores.

Data source : All questions under the Zhihu topic "美女" and the images appearing in their answers.

Tools : Python 3 with the third‑party libraries requests , lxml , and Baidu's AipFace SDK (approximately 100 lines of code).

Required environment : Runs on macOS, Linux (theoretically), or Windows (filename character restrictions handled by regex). No Zhihu login is needed. A Baidu Cloud account is required for the face‑detection service.

Face‑detection library : AipFace, a free Python SDK provided by Baidu AI Open Platform, accessible via HTTP.

Filtering conditions : Discard images without any detected face (e.g., landscapes, non‑portrait photos). Keep only female faces (male images are mostly celebrities and are ignored). Exclude non‑real persons such as anime characters (human confidence < 0.6). Exclude low‑beauty scores (beauty attribute < 45) to save storage.

Implementation logic : Use requests to fetch a list of discussions under the "美女" topic. Parse each discussion’s HTML with lxml to extract all &lt;img&gt; tag src attributes. Download each image (static images only) via requests . Send the image to AipFace for face detection. Apply the filtering rules from section 5. Save the remaining images to the local file system with filenames composed of beauty score, author, question title, and an index. Repeat the process from step 1.

Scraping results : Images are stored in a folder (e.g., "angelababy"); the highest beauty score observed is 88, with most images ranked lower.

Running preparation : Install Python 3. Install the required packages with a single pip command: pip install requests lxml baidu-aip . Apply for a free Baidu Cloud face‑detection service (Baidu AI – Face Recognition).

The original article includes screenshots of the code and sample output images.

data collectionImage Processingface detectionbaidu-aiweb-scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.