Backend Development 7 min read

Scraping iQiyi Bullet Comments and Generating a Word Cloud with Python

This article demonstrates how to scrape bullet comments from iQiyi for the first episode of a popular mystery series, decode the binary files, extract the text, and use Python's jieba and wordcloud libraries to clean the data and generate a visual word cloud of audience sentiments.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scraping iQiyi Bullet Comments and Generating a Word Cloud with Python

Recently a popular mystery drama "The Hidden Corner" (Douban rating 9.0) was selected for analysis; the author crawled the bullet comments of its first episode from iQiyi and created a word cloud to visualize audience feedback.

The article is divided into two parts: (1) crawling the bullet comments from iQiyi, and (2) processing the comments and generating a word cloud.

iQiyi bullet files are harder to crawl because the downloaded files appear as garbled binary data. The author explains how to open the browser’s Network panel, search for "bullet", locate the binary files, and note that each episode loads a new bullet file every 5 minutes.

The URL pattern for bullet files is: https://cmts.iqiyi.com/bullet/{tvid_first_two}/{tvid_last_two}/{tvid}_300_{x}.z where x is the ceiling of total duration divided by 300 seconds (5‑minute intervals). For the first episode (77 minutes) this results in 16 files.

Scraping code (Python): import zlib import requests for x in range(16): x += 1 url = 'https://cmts.iqiyi.com/bullet/92/00/9000000005439200_300_' + str(x) + '.z' bulletold = requests.get(url).content # garbled binary bulletnew = bytearray(bulletold) # re‑encode binary xml = zlib.decompress(bulletnew, 15+32).decode('utf-8') with open('./iqiyi' + str(x) + '.xml', 'a+', encoding='utf-8') as f: f.write(xml) f.close()

The resulting XML files contain content fields that hold the actual comments. To extract these, the following code is used:

from xml.dom.minidom import parse import xml.dom.minidom for x in range(16): x += 1 DOMTree = xml.dom.minidom.parse(r"C:\Users\dmj\PycharmProjects\test\iqiyi" + str(x) + ".xml") collection = DOMTree.documentElement entrys = collection.getElementsByTagName("entry") for entry in entrys: content = entry.getElementsByTagName('content')[0] i = content.childNodes[0].data with open("dan_mu.txt", mode="a+", encoding="utf-8") as f: f.write(i) f.write("\n")

The extracted dan_mu.txt file contains all bullet comments, which are then processed for word‑cloud generation.

Word‑cloud creation uses the wordcloud and jieba libraries. The code performs Chinese word segmentation, removes stop words, and generates the cloud:

from wordcloud import WordCloud import jieba import matplotlib.pyplot as plt with open('./dan_mu.txt', encoding='utf-8', mode='r') as f: myText = f.read() myText = " ".join(jieba.cut(myText)) words = myText.split(" ") # remove unwanted tokens for i in range(len(words)-1, -1, -1): if len(words[i]) == 1 or words[i] in ["这个", "不是", "这么", "怎么", "这是", "还是"]: words.pop(i) myText = " ".join(words) wordcloud = WordCloud(background_color="white", font_path="simsun.ttf", height=300, width=400).generate(myText) plt.imshow(wordcloud) plt.axis("off") plt.show() wordcloud.to_file("wordCloudMo.png")

The author notes that installing wordcloud may produce various errors; a linked CSDN article provides troubleshooting steps.

The final word cloud highlights frequent terms such as "真实" (real), "孩子" (child), "演技" (acting), indicating positive audience sentiment toward the drama.

PythonData ProcessingiQIYItext miningWeb Scrapingword cloud
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.