Big Data 7 min read

Weibo Analysis of 'The Legend of the Year' Using Python: Scraping, Word Cloud, and Network Graph

This article demonstrates how to use Python to scrape Weibo comments about the TV series 'The Legend of the Year', generate word clouds, compute character frequencies and co‑occurrences, and visualize the resulting character relationship network with Gephi.

Python Programming Learning Circle

Jun 20, 2024

Weibo Analysis of 'The Legend of the Year' Using Python: Scraping, Word Cloud, and Network Graph

In this tutorial, the author analyzes the popularity and discussion of the Chinese TV series "The Legend of the Year" on Weibo by first scraping comments from the series' official super‑topic page using a custom Python spider.

The spider is built with argparse for command‑line arguments and a weibo class to handle login and comment retrieval; the full script is shown below.

import argparse
parser = argparse.ArgumentParser(description="weibo comments spider")
parser.add_argument('-u', dest='username', help='weibo username', default='')  # 输入你的用户名
parser.add_argument('-p', dest='password', help='weibo password', default='')  # 输入你的微博密码
parser.add_argument('-m', dest='max_page', help='max number of comment pages to crawl(number<int> larger than 0 or all)', default=)  # 设定你需要爬取的评论页数
parser.add_argument('-l', dest='link', help='weibo comment link', default='')  # 输入你需要爬取的微博链接
parser.add_argument('-t', dest='url_type', help='weibo comment link type(pc or phone)', default='pc')
args = parser.parse_args()
wb = weibo()
username = args.username
password = args.password
try:
    max_page = int(float(args.max_page))
except:
    pass
url = args.link
url_type = args.url_type
if not username or not password or not max_page or not url or not url_type:
    raise ValueError('argument error')
wb.login(username, password)
wb.getComments(url, url_type, max_page)

After obtaining the raw comments, the text is cleaned and tokenized. The author first attempts Chinese word segmentation with jieba, but finds it insufficient for proper name extraction, so a predefined dictionary of character names and aliases is created.

import jieba

test = 'temp.txt'  # 设置要分析的文本路径
text = open(test, 'r', 'utf-8')
seg_list = jieba.cut(text, cut_all=True, HMM=False)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

The cleaned comments are then processed to count the frequency of each character and the co‑occurrence of character pairs. The results are saved as CSV files suitable for network visualization.

def synonymous_names(synonymous_dict_path):
    with codecs.open(synonymous_dict_path, 'r', 'utf-8') as f:
        lines = f.read().split('
')
    for l in lines:
        synonymous_dict[l.split(' ')[0]] = l.split(' ')[1]
    return synonymous_dict


def clean_text(text):
    new_text = []
    text_comment = []
    with open(text, encoding='gb18030') as f:
        para = f.read().split('
')
        para = para[0].split('\u3000')
    for i in range(len(para)):
        if para[i] != '':
            new_text.append(para[i])
    for i in range(len(new_text)):
        new_text[i] = new_text[i].replace('
', '').replace(' ', '')
        text_comment.append(new_text[i])
    return text_comment

text_node = []
for name, times in person_counter.items():
    text_node.append([])
    text_node[-1].append(name)
    text_node[-1].append(name)
    text_node[-1].append(str(times))
node_data = DataFrame(text_node, columns=['Id', 'Label', 'Weight'])
node_data.to_csv('node.csv', encoding='gbk')

Word clouds are generated to illustrate the most frequent terms associated with each main character, revealing distinct discussion patterns. Subsequently, the author uses Gephi to construct and visualize a character relationship graph, showing that the protagonist appears more often than all other characters combined and that the network spans up to five layers of retweets, predominantly among female users in first‑ and second‑tier cities.

Figures in the original article display sample word clouds, the character relationship network, and statistical charts of user demographics. The analysis is based on the methodology described in Ren et al., “WeiboEvents: A Crowd Sourcing Weibo Visual Analytic System” (IEEE PacificVis 2014).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Data Analysis Social Media Web Scraping Weibo word cloud

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.