Artificial Intelligence 6 min read

Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

This guide walks through loading CSV data with pandas, cleaning missing values, filtering, grouping, visualizing, performing correlation and time‑series analysis, detecting outliers, and applying linear and logistic regression models using scikit‑learn, all illustrated with complete Python code snippets.

Test Development Learning Exchange

May 21, 2024

Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

1. Data Loading and Preview Scenario: Load data from a CSV file and view the first few rows.

import pandas as pd

# Load data

df = pd.read_csv('data.csv')

# View first 5 rows

print(df.head())

2. Data Cleaning: Missing Value Handling Scenario: Fill missing values in the 'Age' column with the column mean.

mean_age = df['Age'].mean()

df['Age'].fillna(mean_age, inplace=True)

3. Data Filtering Scenario: Select rows where age is greater than 30. filtered_df = df[df['Age'] > 30] 4. Grouping and Aggregation Scenario: Compute the average age for each gender.

grouped = df.groupby('Gender')['Age'].mean()

print(grouped)

5. Data Visualization: Bar Chart Scenario: Plot a bar chart of user counts by gender.

import matplotlib.pyplot as plt

gender_counts = df['Gender'].value_counts()

gender_counts.plot(kind='bar')

plt.title('User Count by Gender')

plt.xlabel('Gender')

plt.ylabel('Count')

plt.show()

6. Correlation Analysis Scenario: Compute Pearson correlation coefficients between numeric variables.

correlation_matrix = df.corr()

print(correlation_matrix)

7. Time Series Analysis Scenario: Plot monthly sales from a date column.

df['Date'] = pd.to_datetime(df['Date'])  # assume a date column exists

df.set_index('Date', inplace=True)

monthly_sales = df['Sales'].resample('M').sum()

monthly_sales.plot()

plt.title('Monthly Sales')

plt.xlabel('Month')

plt.ylabel('Sales')

plt.show()

8. Outlier Detection Scenario: Identify outliers in the 'Age' column using the IQR method.

Q1 = df['Age'].quantile(0.25)

Q3 = df['Age'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]

print(outliers)

9. Simple Linear Regression Scenario: Analyze the relationship between advertising spend and sales.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

X = df[['Advertising_Spend']]

y = df['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

# Predict

predictions = model.predict(X_test)

10. Classification Task: Logistic Regression Scenario: Predict whether a user will purchase a product based on features.

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

# Assume we have features X and target y

X = df[['Age', 'Income', 'Gender']]

y = df['Will_Purchase']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

# Predict probabilities

purchase_probabilities = logreg.predict_proba(X_test)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Data cleaning visualization pandas scikit-learn

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.