Artificial Intelligence 6 min read

Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

This guide walks through loading CSV data with pandas, cleaning missing values, filtering, grouping, visualizing, performing correlation and time‑series analysis, detecting outliers, and applying linear and logistic regression models using scikit‑learn, all illustrated with complete Python code snippets.

Test Development Learning Exchange
Test Development Learning Exchange
Test Development Learning Exchange
Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

1. Data Loading and Preview Scenario: Load data from a CSV file and view the first few rows.

import pandas as pd

# Load data

df = pd.read_csv('data.csv')

# View first 5 rows

print(df.head())

2. Data Cleaning: Missing Value Handling Scenario: Fill missing values in the 'Age' column with the column mean.

mean_age = df['Age'].mean()

df['Age'].fillna(mean_age, inplace=True)

3. Data Filtering Scenario: Select rows where age is greater than 30.

filtered_df = df[df['Age'] > 30]

4. Grouping and Aggregation Scenario: Compute the average age for each gender.

grouped = df.groupby('Gender')['Age'].mean()

print(grouped)

5. Data Visualization: Bar Chart Scenario: Plot a bar chart of user counts by gender.

import matplotlib.pyplot as plt

gender_counts = df['Gender'].value_counts()

gender_counts.plot(kind='bar')

plt.title('User Count by Gender')

plt.xlabel('Gender')

plt.ylabel('Count')

plt.show()

6. Correlation Analysis Scenario: Compute Pearson correlation coefficients between numeric variables.

correlation_matrix = df.corr()

print(correlation_matrix)

7. Time Series Analysis Scenario: Plot monthly sales from a date column.

df['Date'] = pd.to_datetime(df['Date']) # assume a date column exists

df.set_index('Date', inplace=True)

monthly_sales = df['Sales'].resample('M').sum()

monthly_sales.plot()

plt.title('Monthly Sales')

plt.xlabel('Month')

plt.ylabel('Sales')

plt.show()

8. Outlier Detection Scenario: Identify outliers in the 'Age' column using the IQR method.

Q1 = df['Age'].quantile(0.25)

Q3 = df['Age'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]

print(outliers)

9. Simple Linear Regression Scenario: Analyze the relationship between advertising spend and sales.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

X = df[['Advertising_Spend']]

y = df['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

# Predict

predictions = model.predict(X_test)

10. Classification Task: Logistic Regression Scenario: Predict whether a user will purchase a product based on features.

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

# Assume we have features X and target y

X = df[['Age', 'Income', 'Gender']]

y = df['Will_Purchase']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

# Predict probabilities

purchase_probabilities = logreg.predict_proba(X_test)

machine learningPythondata cleaningvisualizationpandasscikit-learn
Test Development Learning Exchange
Written by

Test Development Learning Exchange

Test Development Learning Exchange

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.