Scraping and Visualizing China’s Tourist Spot Data: From Web Crawl to Insights
This article details a complete workflow for extracting nationwide tourist attraction data from Qunar, cleaning and enriching it with geographic coordinates, and performing multi‑level statistical analysis and visualizations—including sales rankings, popularity metrics, heatmaps, and word clouds—to reveal regional tourism patterns across China.
Data Scraping
Qunar provides extensive tourism information covering almost all attractions in China. The author crawled data for 32 provinces (excluding Hong Kong and Macau), extracting fields such as name, level, location, description, price, sales volume, and popularity. A try/except structure was added to handle missing fields, resulting in a robust script that collected 41,611 records.
<code>for i in s:
inf = {}
try:
inf['level'] = i.find('span', class_='level').text[0]
except Exception as e:
inf['level'] = '0'
try:
inf['price'] = i.find('span', class_='sight_item_price').find('em').text
except Exception as e:
inf['price'] = ''
try:
inf['name'] = i.find('a', class_='name').text
except Exception as e:
inf['name'] = ''
try:
inf['num'] = i.find('span', class_='hot_num').text
except Exception as e:
inf['num'] = ''
try:
inf['add_pro'] = i.find('span', class_='area').find('a').text.split('·')[0]
inf['add_city'] = i.find('span', class_='area').find('a').text.split('·')[1]
except Exception as e:
inf['add_pro'] = i.find('span', class_='area').find('a').text
inf['add_city'] = i.find('span', class_='area').find('a').text
try:
inf['hot'] = i.find('span', class_='product_star_level').find('em').get('title').split(':')[1]
except Exception as e:
inf['hot'] = ''
try:
inf['descri'] = i.find('div', class_='intro color999').text
except Exception as e:
inf['descri'] = ''</code>Data Analysis
5A Scenic Spots
The sales ranking shows the Terracotta Army far ahead of the second place, Guangzhou Chimelong Paradise (approximately 1.67 times). Six amusement parks appear in the top‑20, suggesting that developing theme parks can be a viable strategy for cities lacking natural or historic attractions.
Jiangsu has the most 5A spots (41), followed by Zhejiang and Guangdong (21 each). Eastern provinces dominate 5A distribution, while western regions lag due to weaker economic support.
For travelers seeking less‑crowded yet beautiful sites, places like Wuhan’s East Lake Mo Shan, Hongqi Canal, and Yansanpo are recommended despite lower sales.
<code>def huati(name,num,k):
kk=[]
for i in range(len(name)):
if not numpy.isnan(num[i]):
q=[]
q.append(name[i])
q.append(num[i])
kk.append(q)
hh=sorted(kk,key=lambda i:i[1],reverse=True)
page=Page()
att,val=[],[]
for i in hh[:20]:
att.append(i[0])
val.append(i[1])
bar1 = Bar("", k+"A景区销量排行", title_pos="center", width=1200, height=600)
bar1.add("",att,val, is_visualmap=True, visual_text_color='#fff', mark_point=["average"],
mark_line=["average"], is_more_utils=True, is_label_show=True, is_datazoom_show=True, xaxis_rotate=45)
page.add_chart(bar1)
att, val = [], []
for i in hh[-20:]:
att.append(i[0])
val.append(i[1])
bar2 = Bar("", k+"A景区销量排行", title_pos="center", width=1200, height=600)
bar2.add("", att, val, is_visualmap=True, visual_text_color='#fff', mark_point=["average"],
mark_line=["average"], is_more_utils=True, is_label_show=True, is_datazoom_show=True, xaxis_rotate=45)
page.add_chart(bar2)
page.render(k+"A景区销量bar.html")
def sum_pro(pro,k):
p=[]
c=[]
for i in set(pro):
p.append(i)
c.append(pro.count(i))
map= Map('各省'+k+'A景点分布', width=1200, height=600)
map.add("", p,c, is_visualmap=True, visual_range=[min(c), max(c)],
visual_text_color='#000', is_map_symbol_show=True, is_label_show=True)
map.render('各省'+k+'A景点分布.html')
</code>4A Scenic Spots
Chengdu Panda Base tops 4A sales. Like 5A, amusement parks occupy about 40% of the top‑20, indicating that cities such as Nanjing could benefit from more large‑scale parks.
Shandong leads with 167 4A spots; Zhejiang, Jiangsu, Guangdong, Hebei, Sichuan, and Anhui each exceed 100. Tibet has the fewest (6).
3A Scenic Spots
Zhu Lin Changshou Mountain ranks highest in 3A sales (1,326), placing it among the upper tier of 4A spots.
Shandong again tops the count with 211 3A spots; Henan, Anhui, Liaoning, Heilongjiang, and Xinjiang each have over 100.
Comprehensive Comparison
Popularity scores show that nearly 30% of 5A spots have a score of 1, while 4A and 3A spots have virtually none. About 60% of 3A spots score 0, indicating very low appeal.
<code>def hottt(fivhot,fouhot,thrhot):
fiv, fou, th = [], [], []
atts = ['0', '0.7', '0.8', '0.9', '1']
for i in zip(fivhot,fouhot,thrhot):
fiv.append(round(i[0], 1))
fou.append(round(i[1], 1))
th.append(round(i[2], 1))
levels = ['5A', '4A', '3A']
data = {}
data['att'] = atts
data['5A'], data['4A'], data['3A'] = [], [], []
for i in range(len(atts)):
data['5A'].append(round(fiv.count(float(atts[i])) / len(fiv) * 100, 3))
data['4A'].append(round(fou.count(float(atts[i])) / len(fou) * 100, 3))
data['3A'].append(round(th.count(float(atts[i])) / len(th) * 100, 3))
print(data)
output_file("bars.html") # 输出文件名
x = [(att, level) for att in atts for level in levels]
counts = sum(zip(data['5A'], data['4A'], data['3A']), ())
source = ColumnDataSource(data=dict(x=x, counts=counts))
p = figure(x_range=FactorRange(*x), plot_height=250, title="各等级景区人气值占比",
toolbar_location=None, tools="")
p.vbar(x='x', top='counts', width=0.9, source=source)
show(p)
</code> <code>def box(q,w,e,l):
a = go.Box(y=q, name='5A景区')
b = go.Box(y=w, name='4A景区')
c = go.Box(y=e, name='3A景区')
g = go.Box(y=l, name='所有景区')
data = [a, b, c,g]
layout = go.Layout(legend=dict(font=dict(size=16)), orientation=270)
fig = go.Figure(data=data, layout=layout)
plotly.offline.plot(data)
</code>A word cloud generated with R shows common terms such as "location", "culture", "leisure", "tourism", "experience", "park", "history", and "entertainment", reflecting typical descriptions used by tourism operators.
Gaode Map Visualization
Gaode Map’s geocoding API converts scraped address strings into latitude and longitude. Example request format:
https://restapi.amap.com/v3/geocode/geo?address=ADDRESS&output=XML&key=<YOUR_KEY>&city=CITY
<code>def trans(city,name,pro,level):
for i in range(len(name)):
x = pandas.DataFrame()
t={}
add = name[i]
chengshi=city[i]
parameters = { 'address': add, 'key': '','city':chengshi }
html = requests.get('https://restapi.amap.com/v3/geocode/geo',
params=parameters).json()
try:
t['jingwei'] = html['geocodes'][0]['location']
except IndexError:
t['jingwei']='0,0'
finally:
t['n'] = name[i]
t['level']=level[i]
t['pro']=pro[i]
t['city']=city[i]
x = x.append(t, ignore_index=True)
x.to_csv('55543.csv', encoding='utf-8', index=False, mode='a', header=False)
</code>National Distribution Maps
Heatmaps and hexagonal density maps illustrate that Beijing has the richest tourism resources, while cities such as Chongqing, Guangzhou, Tianjin, and Suzhou also rank highly.
The author also created an animated trajectory of tourism spot distribution across China, accessible via a public link.
For travelers interested in Hunan, the author recommends visiting Changsha, Zhangjiajie, Yongzhou, Huaihua, and Chenzhou.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.