Scraping Maoyan Real-Time Box Office Data with Selenium and Visualizing the Results
Using Python's Selenium library, this tutorial demonstrates how to scrape real-time box office data from Maoyan's regular page, extract movie names, total and incremental earnings, process the data with pandas, export to Excel, and create visual analyses of top‑10 films' revenues and market shares.
Each summer the movie market experiences a surge in releases, and Maoyan (a Meituan subsidiary) holds over 40% of the online ticketing market. This article shows how to use selenium to crawl Maoyan's real‑time box‑office page, collect total box‑office, schedule share, and seat‑share data, and visualize the results.
The target URL is https://piaofang.maoyan.com/box-office?ver=normal . The regular version is chosen because the professional version hides the box‑office numbers inside invisible div elements, making it harder to scrape.
First, import the required libraries and open the page:
<code>from selenium import webdriver</code><code>from selenium.webdriver.common.by import By</code><code>driver = webdriver.Chrome()</code><code>driver.get('http://piaofang.maoyan.com/box-office?ver=normal')</code>Inspecting the page reveals that movie names are stored in p elements with class movie-name . Similar XPath expressions can locate total box‑office ( sumBox ), incremental box‑office ( box ), box‑office share ( boxRate ), schedule share ( countRate ), and seat‑share ( seatRate ) elements.
<code>name = driver.find_elements(By.XPATH, "//*[@class='movie-name']")</code><code>sumBox = driver.find_elements(By.XPATH, "//*[@class='sumBox']")</code><code>box = driver.find_elements(By.XPATH, "//*[@class='boxDesc-wrap red-color']")</code><code>boxRate = driver.find_elements(By.XPATH, "//*[@class='boxRate-wrap']")</code><code>showRate = driver.find_elements(By.XPATH, "//*[@class='countRate-wrap']")</code><code>seatRate = driver.find_elements(By.XPATH, "//*[@class='seatRate-wrap']")</code>Since find_elements returns element objects, the text must be extracted with element.text . A helper function transfer_to_text converts a list of elements into a list of strings:
<code># 构造函数</code><code>def transfer_to_text(input_list):</code><code> new_list = []</code><code> for i in input_list:</code><code> text_str = i.text</code><code> new_list.append(text_str)</code><code> return new_list</code><code></code><code># 提取文本内容</code><code>new_name = transfer_to_text(name)</code><code>new_sumBox = transfer_to_text(sumBox)</code><code>new_box = transfer_to_text(box)</code><code>new_boxRate = transfer_to_text(boxRate)</code><code>new_showRate = transfer_to_text(showRate)</code><code>new_seatRate = transfer_to_text(seatRate)</code>The extracted lists are combined into a single data structure:
<code>file_info = list(zip(new_name, new_sumBox, new_box, new_boxRate, new_showRate, new_seatRate))</code><code>print(file_info)</code>Optionally, the data can be written to an Excel file. The tuple list is first turned into a dictionary, then a pandas.DataFrame is created and saved:
<code>#输出为表格</code><code>import pandas as pd</code><code></code><code>def export_to_excel(file_info):</code><code> data = {</code><code> 'new_name': [],</code><code> 'new_sumBox': [],</code><code> 'new_box': [],</code><code> 'new_boxRate': [],</code><code> 'new_showRate': [],</code><code> 'new_seatRate': []</code><code> }</code><code></code><code> for item in file_info:</code><code> new_name, new_sumBox, new_box, new_boxRate, new_showRate, new_seatRate = item</code><code> data['new_name'].append(new_name)</code><code> data['new_sumBox'].append(new_sumBox)</code><code> data['new_box'].append(new_box)</code><code> data['new_boxRate'].append(new_boxRate)</code><code> data['new_showRate'].append(new_showRate)</code><code> data['new_seatRate'].append(new_seatRate)</code><code></code><code> df = pd.DataFrame(data)</code><code> df.to_excel('output.xlsx', index=False)</code>Before visualizing, the total box‑office column ( new_sumBox ) contains strings with units “万” (ten‑thousand) or “亿” (hundred‑million). A conversion function normalizes all values to “万”:
<code># 定义函数进行单位换算</code><code>def convert_to_wan(value):</code><code> pattern = r'(\d+(\.\d+)?)(亿|万)?'</code><code> match = re.match(pattern, value)</code><code> if match:</code><code> number = float(match.group(1))</code><code> unit = match.group(3)</code><code> if unit == '亿':</code><code> number *= 10000</code><code> elif unit == '万':</code><code> pass</code><code> else:</code><code> number /= 10000</code><code> return number</code><code> else:</code><code> return 0</code><code></code><code>file_excel['new_sumBox'] = file_excel['new_sumBox'].apply(convert_to_wan)</code><code>print(file_excel)</code>Top‑10 movies by total and incremental box‑office are selected for plotting:
<code># 选取前十</code><code>sumBox = file_excel.sort_values(by='new_sumBox', ascending=False).head(10)</code><code>box = file_excel.sort_values(by='new_box', ascending=False).head(10)</code>Bar charts for total box‑office and incremental box‑office are created with matplotlib . Random colors and Chinese fonts are applied to ensure readability:
<code>import random</code><code>from matplotlib.font_manager import FontProperties</code><code>from matplotlib.colors import Normalize</code><code></code><code># 数据准备</code><code>new_name_values1 = sumBox['new_name']</code><code>new_sumBox_values1 = sumBox['new_sumBox']</code><code>new_name_values2 = box['new_name']</code><code>new_box_values2 = box['new_box']</code><code></code><code># 颜色和字体</code><code>color_palette = plt.cm.get_cmap('Blues', 20)</code><code>colors = [color_palette(random.randint(0, 19)) for _ in new_name_values1]</code><code>font = FontProperties(fname='Songti.ttc', size=12)</code><code></code><code># 总票房柱状图</code><code>plt.figure(figsize=(10, 6))</code><code>bars = plt.bar(new_name_values1, new_sumBox_values1, color=colors)</code><code>plt.xlabel('电影名称', fontproperties=font)</code><code>plt.ylabel('总票房,单位:万', fontproperties=font)</code><code>plt.title('总票房排名', fontproperties=font, size=18)</code><code>plt.xticks(rotation=30, fontproperties=font)</code><code>for bar in bars:</code><code> height = bar.get_height()</code><code> plt.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.0f}', ha='center', va='bottom', fontproperties=font)</code><code>plt.tight_layout()</code><code>plt.show()</code>The same approach is used for the incremental box‑office chart, with a different colormap:
<code># 综合票房柱状图</code><code>color_palette = plt.cm.get_cmap('Greens', 20)</code><code>colors = [color_palette(random.randint(0, 19)) for _ in new_name_values2]</code><code>plt.figure(figsize=(10, 6))</code><code>bars = plt.bar(new_name_values2, new_box_values2, color=colors)</code><code>plt.xlabel('电影名称', fontproperties=font)</code><code>plt.ylabel('综合票房,单位:万', fontproperties=font)</code><code>plt.title('综合票房排名', fontproperties=font, size=18)</code><code>plt.xticks(rotation=30, fontproperties=font)</code><code>for bar in bars:</code><code> height = bar.get_height()</code><code> plt.text(bar.get_x() + bar.get_width() / 2, height, f'{height}', ha='center', va='bottom', fontproperties=font)</code><code>plt.tight_layout()</code><code>plt.show()</code>A multi‑series bar chart visualizes box‑office share, schedule share, and seat‑rate for the top‑10 movies, while a secondary y‑axis plots the incremental box‑office as a line:
<code># 多数据系列柱状图</code><code>new_name_values = box['new_name']</code><code>new_box_values = box['new_box']</code><code>new_boxRate_values = box['new_boxRate'].str.rstrip('%').astype(float)</code><code>new_showRate_values = box['new_showRate'].str.rstrip('%').astype(float)</code><code>new_seatRate_values = box['new_seatRate'].str.rstrip('%').astype(float)</code><code>font = FontProperties(fname='Songti.ttc', size=12)</code><code>color_palette = plt.cm.get_cmap('Set3', 3)</code><code>colors = [color_palette(i) for i in range(3)]</code><code>plt.figure(figsize=(10, 6))</code><code>bar_width = 0.2</code><code>index = range(len(new_name_values))</code><code>plt.bar(index, new_boxRate_values, color=colors[0], width=bar_width, label='new_boxRate')</code><code>plt.bar([i + bar_width for i in index], new_showRate_values, color=colors[1], width=bar_width, label='new_showRate')</code><code>plt.bar([i + 2 * bar_width for i in index], new_seatRate_values, color=colors[2], width=bar_width, label='new_seatRate')</code><code>plt.xlabel('电影名称', fontproperties=font)</code><code>plt.ylabel('百分比', fontproperties=font)</code><code>plt.title('综合票房top10数据', fontproperties=font, size=18)</code><code>plt.xticks([i + bar_width for i in index], new_name_values, rotation=30, fontproperties=font)</code><code>plt.legend(labels=['综合票房占比', '排片占比', '上座率', '综合票房'], loc='upper right', prop=font)</code><code># 添加数值标注</code><code>for i in index:</code><code> plt.text(i, new_boxRate_values[i], f'{new_boxRate_values[i]:.1f}', ha='center', va='bottom', fontproperties=font)</code><code> plt.text(i + bar_width, new_showRate_values[i], f'{new_showRate_values[i]:.1f}', ha='center', va='bottom', fontproperties=font)</code><code> plt.text(i + 2 * bar_width, new_seatRate_values[i], f'{new_seatRate_values[i]:.1f}', ha='center', va='bottom', fontproperties=font)</code><code># 折线图(综合票房)</code><code>plt2 = plt.twinx()</code><code>plt2.plot(index, new_box_values, color='green', marker='o', label='综合票房')</code><code>plt2.set_ylabel('综合票房,单位:万', fontproperties=font)</code><code>plt2.legend(loc='upper right', bbox_to_anchor=(0.96, 0.8), prop=font)</code><code>plt.tight_layout()</code><code>plt.show()</code>The article concludes that while Selenium works well for the regular page, other frameworks such as Scrapy can also be used, and the professional version presents additional challenges like dynamic font encryption.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.