Backend Development 20 min read

Web Scraping Anjuke Real Estate Data with Python: A Step‑by‑Step Guide

This article provides a comprehensive Python tutorial for scraping second‑hand housing community data from Anjuke, covering city selection, URL collection, HTML parsing with lxml, data cleaning, CSV export, and full‑city crawling strategies, complete with runnable code examples.

Python Programming Learning Circle

Jun 25, 2024

Web Scraping Anjuke Real Estate Data with Python: A Step‑by‑Step Guide

With the rapid development of artificial intelligence, machine learning has become increasingly important, and many people are beginning to learn it; this article introduces the basics of machine learning and demonstrates how to use Python for web crawling.

1. Introduction

The article focuses on extracting real‑estate data for Shijiazhuang city from the Anjuke website, aiming to collect detailed information for 16 fields such as community name, address, average price, property type, building area, and more.

2. City Selection

We choose Shijiazhuang as the target city and plan to crawl the first 500 community pages (approximately 20 pages, 25 listings per page) to illustrate the process.

3. Community URL Collection

# 首页URL
url = 'https://sjz.anjuke.com/community/p1'
# 多页爬取: 为了爬取方便，这里以爬取前500个小区为例，每页25个，共有20页
for i in range(20):
    url = 'https://sjz.anjuke.com/community/p{}'.format(i)

4. Parsing HTML to Locate Fields

## 解析一级页面函数
def get_link(url):
    text = requests.get(url=url, headers=headers, proxies={"http": "http://{}".format(get_proxy())}).text
    html = etree.HTML(text)
    link = html.xpath('.//div[@class="list-cell"]/a/@href')
    price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')
    return zip(link, price)

For the second‑level detail pages, the following function extracts the required fields:

## 解析二级页面函数
def parse_message(url, price):
    dict_result = {'小区名称':'-','价格':'-','小区地址':'-','物业类型':'-','物业费':'-','总建面积':'-','总户数':'-','建造年代':'-','停车位':'-','容积率':'-','绿化率':'-','开发商':'-','物业公司':'-','所属商圈':'-','二手房源数':'-','租房房源数':'-'}
    text = requests.get(url=url, headers=headers, proxies={"http": "http://{}".format(get_proxy())}).text
    html = etree.HTML(text)
    dict_result['小区名称'] = html.xpath('.//div[@class="comm-title"]/h1/text()')
    dict_result['小区地址'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()')
    dict_result['物业类型'] = html.xpath('.//div[@class="comm-basic-mod  "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()')
    # ... (other fields omitted for brevity) ...
    dict_result['价格'] = price
    return dict_result

5. Saving Data

## 将数据读取到csv文件中
def save_csv(result):
    for row in result:
        csv_write.writerow(row)

6. Main Crawling Loop (single‑page example)

# 主函数
C = 1
k = 1
print("************************第1页开始爬取************************")
url = 'https://sjz.anjuke.com/community/p1'
link = get_link(url)
list_result = []
for j in link:
    try:
        result = parse_message(j[0], j[1])
        list_result.append(result)
        print("已爬取{}条数据".format(k))
        k += 1
        time.sleep(round(random.randint(5, 10), C))
    except Exception as err:
        print(err)
save_csv(list_result)
print("************************第1页爬取成功************************")

7. Full‑City Crawling

The script can be extended to iterate over pages 1‑20, collecting up to 500 community records, with random delays and optional proxy usage to avoid anti‑scraping mechanisms.

8. Result

After execution, the data is stored in a CSV file containing fields such as community name, price, address, property type, construction year, parking spaces, floor‑area ratio, green ratio, developer, property company, business district, and counts of second‑hand and rental listings.

By following this guide, readers can gain practical experience in web scraping, HTML parsing, data cleaning, and CSV handling using Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CSV Real Estate web-scraping anjuke lxml data-extraction

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.