Big Data 15 min read

Scraping and Visualizing China’s Tourist Spot Data: From Web Crawl to Insights

This article details a complete workflow for extracting nationwide tourist attraction data from Qunar, cleaning and enriching it with geographic coordinates, and performing multi‑level statistical analysis and visualizations—including sales rankings, popularity metrics, heatmaps, and word clouds—to reveal regional tourism patterns across China.

Efficient Ops
Efficient Ops
Efficient Ops
Scraping and Visualizing China’s Tourist Spot Data: From Web Crawl to Insights

Data Scraping

Qunar provides extensive tourism information covering almost all attractions in China. The author crawled data for 32 provinces (excluding Hong Kong and Macau), extracting fields such as name, level, location, description, price, sales volume, and popularity. A try/except structure was added to handle missing fields, resulting in a robust script that collected 41,611 records.

<code>for i in s:
    inf = {}
    try:
        inf['level'] = i.find('span', class_='level').text[0]
    except Exception as e:
        inf['level'] = '0'
    try:
        inf['price'] = i.find('span', class_='sight_item_price').find('em').text
    except Exception as e:
        inf['price'] = ''
    try:
        inf['name'] = i.find('a', class_='name').text
    except Exception as e:
        inf['name'] = ''
    try:
        inf['num'] = i.find('span', class_='hot_num').text
    except Exception as e:
        inf['num'] = ''
    try:
        inf['add_pro'] = i.find('span', class_='area').find('a').text.split('·')[0]
        inf['add_city'] = i.find('span', class_='area').find('a').text.split('·')[1]
    except Exception as e:
        inf['add_pro'] = i.find('span', class_='area').find('a').text
        inf['add_city'] = i.find('span', class_='area').find('a').text
    try:
        inf['hot'] = i.find('span', class_='product_star_level').find('em').get('title').split(':')[1]
    except Exception as e:
        inf['hot'] = ''
    try:
        inf['descri'] = i.find('div', class_='intro color999').text
    except Exception as e:
        inf['descri'] = ''</code>

Data Analysis

5A Scenic Spots

The sales ranking shows the Terracotta Army far ahead of the second place, Guangzhou Chimelong Paradise (approximately 1.67 times). Six amusement parks appear in the top‑20, suggesting that developing theme parks can be a viable strategy for cities lacking natural or historic attractions.

Jiangsu has the most 5A spots (41), followed by Zhejiang and Guangdong (21 each). Eastern provinces dominate 5A distribution, while western regions lag due to weaker economic support.

For travelers seeking less‑crowded yet beautiful sites, places like Wuhan’s East Lake Mo Shan, Hongqi Canal, and Yansanpo are recommended despite lower sales.

<code>def huati(name,num,k):
    kk=[]
    for i in range(len(name)):
        if not numpy.isnan(num[i]):
            q=[]
            q.append(name[i])
            q.append(num[i])
            kk.append(q)
    hh=sorted(kk,key=lambda i:i[1],reverse=True)
    page=Page()
    att,val=[],[]
    for i in hh[:20]:
        att.append(i[0])
        val.append(i[1])
    bar1 = Bar("", k+"A景区销量排行", title_pos="center", width=1200, height=600)
    bar1.add("",att,val, is_visualmap=True, visual_text_color='#fff', mark_point=["average"],
             mark_line=["average"], is_more_utils=True, is_label_show=True, is_datazoom_show=True, xaxis_rotate=45)
    page.add_chart(bar1)
    att, val = [], []
    for i in hh[-20:]:
        att.append(i[0])
        val.append(i[1])
    bar2 = Bar("", k+"A景区销量排行", title_pos="center", width=1200, height=600)
    bar2.add("", att, val, is_visualmap=True, visual_text_color='#fff', mark_point=["average"],
             mark_line=["average"], is_more_utils=True, is_label_show=True, is_datazoom_show=True, xaxis_rotate=45)
    page.add_chart(bar2)
    page.render(k+"A景区销量bar.html")

def sum_pro(pro,k):
    p=[]
    c=[]
    for i in set(pro):
        p.append(i)
        c.append(pro.count(i))
    map= Map('各省'+k+'A景点分布', width=1200, height=600)
    map.add("", p,c, is_visualmap=True, visual_range=[min(c), max(c)],
            visual_text_color='#000', is_map_symbol_show=True, is_label_show=True)
    map.render('各省'+k+'A景点分布.html')
</code>

4A Scenic Spots

Chengdu Panda Base tops 4A sales. Like 5A, amusement parks occupy about 40% of the top‑20, indicating that cities such as Nanjing could benefit from more large‑scale parks.

Shandong leads with 167 4A spots; Zhejiang, Jiangsu, Guangdong, Hebei, Sichuan, and Anhui each exceed 100. Tibet has the fewest (6).

3A Scenic Spots

Zhu Lin Changshou Mountain ranks highest in 3A sales (1,326), placing it among the upper tier of 4A spots.

Shandong again tops the count with 211 3A spots; Henan, Anhui, Liaoning, Heilongjiang, and Xinjiang each have over 100.

Comprehensive Comparison

Popularity scores show that nearly 30% of 5A spots have a score of 1, while 4A and 3A spots have virtually none. About 60% of 3A spots score 0, indicating very low appeal.

<code>def hottt(fivhot,fouhot,thrhot):
    fiv, fou, th = [], [], []
    atts = ['0', '0.7', '0.8', '0.9', '1']
    for i in zip(fivhot,fouhot,thrhot):
        fiv.append(round(i[0], 1))
        fou.append(round(i[1], 1))
        th.append(round(i[2], 1))
    levels = ['5A', '4A', '3A']
    data = {}
    data['att'] = atts
    data['5A'], data['4A'], data['3A'] = [], [], []
    for i in range(len(atts)):
        data['5A'].append(round(fiv.count(float(atts[i])) / len(fiv) * 100, 3))
        data['4A'].append(round(fou.count(float(atts[i])) / len(fou) * 100, 3))
        data['3A'].append(round(th.count(float(atts[i])) / len(th) * 100, 3))
    print(data)
    output_file("bars.html")  # 输出文件名
    x = [(att, level) for att in atts for level in levels]
    counts = sum(zip(data['5A'], data['4A'], data['3A']), ())
    source = ColumnDataSource(data=dict(x=x, counts=counts))
    p = figure(x_range=FactorRange(*x), plot_height=250, title="各等级景区人气值占比",
               toolbar_location=None, tools="")
    p.vbar(x='x', top='counts', width=0.9, source=source)
    show(p)
</code>
<code>def box(q,w,e,l):
    a = go.Box(y=q, name='5A景区')
    b = go.Box(y=w, name='4A景区')
    c = go.Box(y=e, name='3A景区')
    g = go.Box(y=l, name='所有景区')
    data = [a, b, c,g]
    layout = go.Layout(legend=dict(font=dict(size=16)), orientation=270)
    fig = go.Figure(data=data, layout=layout)
    plotly.offline.plot(data)
</code>

A word cloud generated with R shows common terms such as "location", "culture", "leisure", "tourism", "experience", "park", "history", and "entertainment", reflecting typical descriptions used by tourism operators.

Gaode Map Visualization

Gaode Map’s geocoding API converts scraped address strings into latitude and longitude. Example request format:

https://restapi.amap.com/v3/geocode/geo?address=ADDRESS&output=XML&key=<YOUR_KEY>&city=CITY
<code>def trans(city,name,pro,level):
    for i in range(len(name)):
        x = pandas.DataFrame()
        t={}
        add = name[i]
        chengshi=city[i]
        parameters = { 'address': add, 'key': '','city':chengshi }
        html = requests.get('https://restapi.amap.com/v3/geocode/geo',
params=parameters).json()
        try:
            t['jingwei'] = html['geocodes'][0]['location']
        except IndexError:
            t['jingwei']='0,0'
        finally:
            t['n'] = name[i]
            t['level']=level[i]
            t['pro']=pro[i]
            t['city']=city[i]
            x = x.append(t, ignore_index=True)
            x.to_csv('55543.csv', encoding='utf-8', index=False, mode='a', header=False)
</code>

National Distribution Maps

Heatmaps and hexagonal density maps illustrate that Beijing has the richest tourism resources, while cities such as Chongqing, Guangzhou, Tianjin, and Suzhou also rank highly.

The author also created an animated trajectory of tourism spot distribution across China, accessible via a public link.

For travelers interested in Hunan, the author recommends visiting Changsha, Zhangjiajie, Yongzhou, Huaihua, and Chenzhou.

Pythondata visualizationweb scrapinggeocodingtourism data
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.