Analyzing Bilibili Comment API and Building a Search Tool
This article describes how to discover Bilibili's comment XML APIs, decode the embedded CRC32‑hashed user IDs, choose an appropriate database schema, and implement a Python‑PHP tool that retrieves, filters, and displays comments based on video CID and keyword.
It is well known that Bilibili does not expose the sender of a danmaku (bullet comment), which makes it impossible to directly block abusive users, but the platform does allow blocking a specific user’s comments, indicating that user information is present in the data interface. The author, learning web crawling, set out to locate the comment API.
After inspecting a video with the browser’s developer tools, two XML‑based endpoints were found:
https://comment.bilibili.com/ +cid https://api.bilibili.com/x/v1/dm/list.so?oid= +cid
The cid is a unique numeric identifier for each video part (P). It can be obtained by searching the page source (usually an 8‑9 digit number). An additional endpoint can retrieve the cid from an aid (video ID): https://www.bilibili.com/widget/getPageList?aid= +aid .
Once the comment XML is fetched, the author examined the data fields. The 6th field is a timestamp, and the 8th field is a CRC32‑based hash of the user’s UID, converted to hexadecimal. This hash cannot be reversed without a rainbow table.
For storing the hash, the author considered VARCHAR versus BIGINT . Given Bilibili’s ~600 million users, a BIGINT (or unsigned INT ) is sufficient because the 8‑character hex string represents a 32‑bit value (0‑0xffffffff). The decision was to use an unsigned INT as the primary key and build a rainbow table on the server.
Estimating storage, 600 million records would require roughly 27 GB, which fits within the author’s 40 GB server.
To expose the data, a Python script was written that accepts two arguments (video cid and a keyword) and outputs the matching comments, the CRC32‑hashed UID, and the timestamp. The script is invoked from PHP via exec , the results are looked up in the database to retrieve the original UID, and a JSON response is sent to the front‑end.
The Python code (kept verbatim) is:
<code>import requests
from bs4 import BeautifulSoup
import re
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
req = requests.get('https://comment.bilibili.com/'+sys.argv[1]+'.xml')
req.encoding = req.apparent_encoding
soup = BeautifulSoup(req.text, 'html.parser').find_all(name='d')
result = ""
for i in soup:
s = re.sub('<(.*?)>', '', str(i))
index = 0
if(len(sys.argv[2])>0):
index = s.find(str(sys.argv[2]))
if(index!=-1):
result+=str(i).split(",")[6]+","+s+","+str(i).split(",")[4]+","
print(result)
</code>The front‑end code is minimal but functional, displaying the retrieved comments. The author notes that the database is still being populated with the rainbow table (estimated 4 days) and that a brute‑force lookup feature has been added, which slows queries.
Overall, the article provides a step‑by‑step walkthrough of discovering undocumented APIs, decoding hashed identifiers, designing a suitable storage schema, and implementing a full‑stack tool for searching Bilibili comments.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.