Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn
This guide walks through loading CSV data with pandas, cleaning missing values, filtering, grouping, visualizing, performing correlation and time‑series analysis, detecting outliers, and applying linear and logistic regression models using scikit‑learn, all illustrated with complete Python code snippets.
1. Data Loading and Preview Scenario: Load data from a CSV file and view the first few rows.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# View first 5 rows
print(df.head())
2. Data Cleaning: Missing Value Handling Scenario: Fill missing values in the 'Age' column with the column mean.
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
3. Data Filtering Scenario: Select rows where age is greater than 30.
filtered_df = df[df['Age'] > 30]
4. Grouping and Aggregation Scenario: Compute the average age for each gender.
grouped = df.groupby('Gender')['Age'].mean()
print(grouped)
5. Data Visualization: Bar Chart Scenario: Plot a bar chart of user counts by gender.
import matplotlib.pyplot as plt
gender_counts = df['Gender'].value_counts()
gender_counts.plot(kind='bar')
plt.title('User Count by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
6. Correlation Analysis Scenario: Compute Pearson correlation coefficients between numeric variables.
correlation_matrix = df.corr()
print(correlation_matrix)
7. Time Series Analysis Scenario: Plot monthly sales from a date column.
df['Date'] = pd.to_datetime(df['Date']) # assume a date column exists
df.set_index('Date', inplace=True)
monthly_sales = df['Sales'].resample('M').sum()
monthly_sales.plot()
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()
8. Outlier Detection Scenario: Identify outliers in the 'Age' column using the IQR method.
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print(outliers)
9. Simple Linear Regression Scenario: Analyze the relationship between advertising spend and sales.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = df[['Advertising_Spend']]
y = df['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
10. Classification Task: Logistic Regression Scenario: Predict whether a user will purchase a product based on features.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Assume we have features X and target y
X = df[['Age', 'Income', 'Gender']]
y = df['Will_Purchase']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Predict probabilities
purchase_probabilities = logreg.predict_proba(X_test)
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.