Backend Development 6 min read

Analyzing Bilibili Comment API and Building a Search Tool

This article describes how to discover Bilibili's comment XML APIs, decode the embedded CRC32‑hashed user IDs, choose an appropriate database schema, and implement a Python‑PHP tool that retrieves, filters, and displays comments based on video CID and keyword.

Python Programming Learning Circle

May 30, 2020

Analyzing Bilibili Comment API and Building a Search Tool

It is well known that Bilibili does not expose the sender of a danmaku (bullet comment), which makes it impossible to directly block abusive users, but the platform does allow blocking a specific user’s comments, indicating that user information is present in the data interface. The author, learning web crawling, set out to locate the comment API.

After inspecting a video with the browser’s developer tools, two XML‑based endpoints were found:

https://comment.bilibili.com/ +cid https://api.bilibili.com/x/v1/dm/list.so?oid= +cid The cid is a unique numeric identifier for each video part (P). It can be obtained by searching the page source (usually an 8‑9 digit number). An additional endpoint can retrieve the cid from an aid (video ID): https://www.bilibili.com/widget/getPageList?aid= +aid.

Once the comment XML is fetched, the author examined the data fields. The 6th field is a timestamp, and the 8th field is a CRC32‑based hash of the user’s UID, converted to hexadecimal. This hash cannot be reversed without a rainbow table.

For storing the hash, the author considered VARCHAR versus BIGINT. Given Bilibili’s ~600 million users, a BIGINT (or unsigned INT) is sufficient because the 8‑character hex string represents a 32‑bit value (0‑0xffffffff). The decision was to use an unsigned INT as the primary key and build a rainbow table on the server.

Estimating storage, 600 million records would require roughly 27 GB, which fits within the author’s 40 GB server.

To expose the data, a Python script was written that accepts two arguments (video cid and a keyword) and outputs the matching comments, the CRC32‑hashed UID, and the timestamp. The script is invoked from PHP via exec, the results are looked up in the database to retrieve the original UID, and a JSON response is sent to the front‑end.

The Python code (kept verbatim) is:

import requests
from bs4 import BeautifulSoup
import re
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
req = requests.get('https://comment.bilibili.com/'+sys.argv[1]+'.xml')
req.encoding = req.apparent_encoding
soup = BeautifulSoup(req.text, 'html.parser').find_all(name='d')
result = ""
for i in soup:
    s = re.sub('<(.*?)>', '', str(i))
    index = 0
    if(len(sys.argv[2])>0):
        index = s.find(str(sys.argv[2]))
    if(index!=-1):
        result+=str(i).split(",")[6]+","+s+","+str(i).split(",")[4]+","
print(result)

The front‑end code is minimal but functional, displaying the retrieved comments. The author notes that the database is still being populated with the rainbow table (estimated 4 days) and that a brute‑force lookup feature has been added, which slows queries.

Overall, the article provides a step‑by‑step walkthrough of discovering undocumented APIs, decoding hashed identifiers, designing a suitable storage schema, and implementing a full‑stack tool for searching Bilibili comments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Database API Web Scraping Bilibili CRC32

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.