Top 10 Essential Python Libraries for Data Analysis and Machine Learning
This tutorial introduces ten highly practical Python libraries—Pandas, NumPy, Matplotlib, Seaborn, Plotly, Scikit-learn, Dask, PySpark, Bokeh, and Prophet—providing code examples that guide readers through data cleaning, visualization, and predictive modeling to accelerate their data‑analysis expertise.
In Python data analysis, mastering core libraries can dramatically boost productivity. This article selects ten highly practical libraries, providing code examples that walk through the full workflow from data processing to machine learning, helping readers quickly become proficient data analysts.
1. Pandas: The All‑Rounder for Structured Data
Pandas excels at handling tabular data, offering efficient data cleaning and transformation capabilities.
<code># 读取Excel文件并处理缺失值
import pandas as pd
df = pd.read_excel('customer_data.xlsx')
df['age'].fillna(df['age'].median(), inplace=True) # 用中位数填充年龄缺失值
# 数据转换:将日期字符串转为日期格式
df['register_date'] = pd.to_datetime(df['register_date'])
</code>2. NumPy: The Accelerated Engine for Multidimensional Arrays
NumPy provides high‑performance numerical computation, ideal for large‑scale data operations.
<code>import numpy as np
# 创建数组并执行向量化运算
sales = np.array([1200, 1500, 800, 2000])
commission = sales * 0.05 # 计算5%的佣金
total = np.sum(sales) # 总销售额:5500
</code>3. Matplotlib: The Swiss Army Knife for Basic Plotting
Matplotlib can quickly generate line charts, scatter plots, and other fundamental visualizations.
<code>import matplotlib.pyplot as plt
# 绘制分组柱状图
products = ['A', 'B', 'C']
sales = [120, 150, 90]
plt.bar(products, sales, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.title('Product Sales Comparison')
plt.show()
</code>4. Seaborn: The Aesthetic Companion for Statistical Visualisation
Built on Matplotlib, Seaborn produces more attractive statistical charts.
<code>import seaborn as sns
# 绘制热图分析相关性
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()
</code>5. Plotly: The Interactive Visualisation Expert
Plotly supports interactive charts, suitable for dynamic reports.
<code>import plotly.express as px
# 生成交互式地图
fig = px.choropleth(df, locations='state', color='sales',
hover_data=['city', 'revenue'],
color_continuous_scale='Viridis')
fig.show()
</code>6. Scikit‑learn: The Swiss Army Knife for Machine‑Learning Pre‑processing
Scikit‑learn offers tools for data preprocessing and model training.
<code>from sklearn.preprocessing import StandardScaler
# 特征标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['price', 'advertising']])
</code>7. Dask: The Parallel Pioneer for Distributed Computing
Dask handles massive datasets and supports distributed computation.
<code>import dask.dataframe as dd
# 分块读取CSV文件
ddf = dd.read_csv('large_sales.csv')
average = ddf.groupby('category')['sales'].mean().compute()
</code>8. PySpark: The Distributed Engine for Big‑Data Analytics
PySpark is suited for processing huge volumes of data with distributed computing.
<code>from pyspark.sql import SparkSession
# 初始化Spark会话
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()
df_spark = spark.read.csv('sales_data.csv', header=True, inferSchema=True)
# 分布式计算销售额Top5
df_spark.orderBy(df_spark['sales'].desc()).show(5)
</code>9. Bokeh: The Lightweight Choice for Interactive Visualisation
Bokeh creates interactive charts that integrate easily into web applications.
<code>from bokeh.plotting import figure, show
# 创建交互式散点图
p = figure(title="Sales vs. Price", x_axis_label='Price', y_axis_label='Sales')
p.circle(df['price'], df['sales'], size=10, color='blue', alpha=0.5)
show(p)
</code>10. Prophet: The Time‑Series Forecasting Power‑Tool
Prophet excels at handling time‑series data and delivering high‑accuracy forecasts.
<code>from prophet import Prophet
# 构建预测模型
df_prophet = df[['register_date', 'sales']].rename(columns={'register_date':'ds', 'sales':'y'})
model = Prophet()
model.fit(df_prophet)
future = model.make_future_dataframe(periods=365)
forecast = model.predict(future)
model.plot(forecast)
</code>What challenges have you encountered that these libraries cannot solve? Feel free to leave a comment and discuss!
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.