Python‑Based Scraping, Cleaning, Sentiment Analysis and Visualization of Douban Movie Reviews
The article walks through a full Python workflow that scrapes up to 500 Douban movie reviews for "Dying to Survive" and "Hidden Blade," cleans and stores them in pandas, performs SnowNLP sentiment analysis, and visualizes city distribution, rating trends, and word clouds with pyecharts.
This article demonstrates a complete data‑analysis workflow on Chinese movie reviews from Douban, using the films "Dying to Survive" (《我不是药神》) and "Hidden Blade" (《邪不压正》) as case studies.
0. Requirement Analysis
Obtain review data via web scraping.
Clean and store the data.
Analyze city distribution, sentiment, and rating trends.
Practice pandas, web‑scraping and visualization skills.
1. Preparation
1.1 Web‑page analysis
Douban limits crawling: only 500 comments are publicly accessible, with a maximum of 40 requests per minute during the day and 60 at night. The start parameter in the URL controls pagination; each click on the “next page” button adds 20 to start , but manually incrementing by 10 also works.
1.2 Layout analysis
Key fields to extract:
User ID
Comment content
Score
Comment date
User city (requires visiting the user’s profile page)
2. Data acquisition – crawling
2.1 Get cookies
Douban requires authentication cookies. The cookies can be copied from Chrome’s developer tools.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
cookies = {
'cookie': 'bid=GOOb4vXwNcc; douban-fav-remind=1; viewed="27611266_26886337"; ps=y; ue="citpys原创分享@163.com"; push_noty_num=0; push_doumail_num=0; ap=1; loc-last-index-location-id="108288"; ll="108288"; dbcl2="187285881:N/y1wyPpmA8"; ck=4wlL'
}
url = "https://movie.douban.com/subject/" + str(id) + "/comments?start=" + str(page * 10) + "&limit=20&sort=new_score&status=P"
res = requests.get(url, headers=headers, cookies=cookies)
res.encoding = "utf-8"
if res.status_code == 200:
print("\n第{}页短评爬取成功!".format(page + 1))
print(url)
else:
print("\n第{}页爬取失败!".format(page + 1))2.3 Anti‑scraping delay
time.sleep(round(random.uniform(1, 2), 2))2.4 Parsing logic
Because some comments have no score, the XPath for score may actually return the date. The script checks the format and swaps values when necessary.
name = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/text()'.format(i))
score = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[2]/@title'.format(i))
date = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[3]/@title'.format(i))
if not re.compile('\d{4}-\d{2}-\d{2}').match(score[0]):
date = score
score = ["null"]
content = x.xpath('//*[@id="comments"]/div[{}]/div[2]/p/span/text()'.format(i))2.5 Movie name extraction
pattern = re.compile('
.*?
.*?
(.*?) 短评
', re.S)
global movie_name
movie_name = re.findall(pattern, res.text)[0]3. Data storage
The collected fields are stored in a pandas DataFrame and saved as a CSV file.
infos = {'name': name_list, 'city': city_list, 'content': content_list, 'score': score_list, 'date': date_list}
data = pd.DataFrame(infos, columns=['name', 'city', 'content', 'score', 'date'])
data.to_csv(str(ID) + "_comments.csv")4. Data cleaning
City information is noisy (empty, overseas, malformed). The script filters Chinese characters, removes punctuation, and matches the remaining strings against the city list provided by pyecharts.
line = str.strip()
p2 = re.compile('[^\u4e00-\u9fa5]')
zh = " ".join(p2.split(line)).strip()
zh = ",".join(zh.split())
line = re.sub('[A-Za-z0-9!!,%\[\],。]', "", zh)After cleaning, the script builds a dictionary result that counts occurrences of each city.
5. Sentiment analysis with SnowNLP
SnowNLP provides Chinese word segmentation, POS tagging, sentiment scoring, text classification, keyword extraction, summarization, etc. The sentiment score ranges from 0 (negative) to 1 (positive); scores below 0.5 are treated as negative.
attr, val = [], []
info = count_sentiment(csv_file)
info = sorted(info.items(), key=lambda x: x[0], reverse=False)
for each in info[:-1]:
attr.append(each[0])
val.append(each[1])
line = Line(csv_file+":影评情感分析")
line.add("", attr, val, is_smooth=True, is_more_utils=True)
line.render(csv_file+"_情感分析曲线图.html")6. Visualization and interpretation
Using pyecharts, the following charts are generated:
Geo map (dot map) of comment‑origin cities.
Geo heatmap of comment density.
Bar chart ranking cities by comment count.
Pie chart of city distribution.
Line chart of daily rating trends.
Word‑clouds for each film.
Key observations:
Top 10 cities for "Dying to Survive": Beijing, Shanghai, Nanjing, Hangzhou, Shenzhen, Guangzhou, Chengdu, Changsha, Chongqing, Xi’an.
Top 10 cities for "Hidden Blade": Beijing, Shanghai, Guangzhou, Chengdu, Hangzhou, Nanjing, Xi’an, Shenzhen, Changsha, Harbin.
Sentiment distribution shows a strong positive bias (most scores >0.5).
Rating spikes occur within the first week of release, with a small pre‑release “preview” segment.
Word‑clouds highlight themes such as "China", "reality", "social", "hope" for "Dying to Survive" and frequent mentions of director Jiang Wen for "Hidden Blade".
7. Conclusion
The project reinforces pandas manipulation and web‑scraping techniques.
Building a domain‑specific sentiment corpus would improve analysis accuracy.
Pyecharts provides an attractive way to present geographic and statistical results.
All source code is available on GitHub: https://github.com/Ctipsy/DA_projects/tree/master/我不是药神
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.