Web Scraping and Data Analysis of Pet Cat Breeds Using Python
This article demonstrates how to scrape cat breed information from a dedicated website, store the data in Excel, and perform comprehensive analysis and visualizations—including relationship graphs, geographic distribution, size ratios, price extremes, and word clouds—using Python libraries such as requests, lxml, pandas, pyecharts, and stylecloud.
The article begins with a brief introduction to the Juejin "Use Code to Attract Cats" activity, posing two questions about cat ownership and curiosity, and explains the author's motivation to learn about various pet cat breeds through coding.
Data collection is performed by crawling the cat breed website www.maomijiaoyi.com . The following Python code fetches the list of breed pages, extracts the breed name, price, and detail URL, and prints the results:
from lxml import etree
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
url_base = "http://www.maomijiaoyi.com"
session = requests.Session()
# Access the breed index page and collect detail links
url = url_base + "/index.php?/pinzhongdaquan_5.html"
res = session.get(url, headers=headers)
html = etree.HTML(res.text)
main_data = []
for a_tag in html.xpath("//div[@class='pinzhong_left']/a"):
url = url_base + a_tag.xpath("./@href")[0]
pet_name, pet_price = None, None
pet_name_tag = a_tag.xpath("./div[@class='pet_name']/text()")
if pet_name_tag:
pet_name = pet_name_tag[0].strip()
pet_price_tag = a_tag.xpath("./div[@class='pet_price']/span/text()")
if pet_price_tag:
pet_price = pet_price_tag[0].strip()
print(pet_name, pet_price, url)
main_data.append((pet_name, pet_price, url))After obtaining the links, the script visits each detail page, parses basic attributes, appearance attributes, detailed descriptions, and image URLs, then downloads the images. The extracted data is saved to an Excel file named 猫咪.xlsx . Sample screenshots of the scraped data and downloaded images are shown below:
Data analysis starts by loading the Excel file with pandas:
import pandas as pd
df = pd.read_excel("猫咪.xlsx")Various visualizations are created using the pyecharts library:
A relationship graph shows each breed and its aliases.
A bar chart displays the geographic distribution of breeds.
A treemap visualizes the distribution of breeds across countries.
A pie chart illustrates the proportion of different body sizes.
from pyecharts import options as opts
from pyecharts.charts import Graph, Bar, TreeMap, Pie
# (code omitted for brevity – the full snippets are present in the source)Price analysis splits the "参考价格" column, identifies the cheapest and most expensive breeds, and prints the results:
tmp = df.参考价格.str.split("-", expand=True)
tmp.columns = ["最低价格", "最高价格"]
tmp.dropna(inplace=True)
tmp = tmp.astype("int")
cheap_cat = df.loc[tmp.index[tmp.最低价格 == tmp.最低价格.min()], "中文学名"].to_list()
costly_cat = df.loc[tmp.index[tmp.最高价格 == tmp.最高价格.max()], "中文学名"].to_list()
print("最便宜的品种有:", cheap_cat)
print("最贵的品种有:", costly_cat)Word clouds are generated for descriptive columns using the stylecloud library. Example code for creating a general word cloud and separate clouds for personality traits and living habits is provided:
import stylecloud, jieba
from IPython.display import Image
# (code omitted for brevity – the full snippets are present in the source)Finally, a mind‑map style diagram groups breeds by body size, producing a hierarchical view of the cat taxonomy.
References:
https://juejin.cn/post/7024369534119182367
http://www.maomijiaoyi.com/
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.