Artificial Intelligence 17 min read

Illustrated Guide to the Complete Machine Learning Workflow

This article presents a hand‑drawn, illustrated walkthrough of the entire machine‑learning pipeline—from dataset definition, exploratory data analysis, preprocessing, and data splitting to model building, algorithm selection, hyper‑parameter tuning, feature selection, and evaluation for both classification and regression tasks.

DataFunTalk
DataFunTalk
DataFunTalk
Illustrated Guide to the Complete Machine Learning Workflow

1. Dataset

A dataset is the starting point of any machine‑learning project and can be viewed as an M×N matrix where M represents features (columns) and N represents samples (rows). Features are split into X (independent variables) and Y (target labels). Supervised datasets contain both X and Y, while unsupervised datasets contain only X. If Y is quantitative, the task is regression; if Y is categorical, the task is classification.

2. Exploratory Data Analysis (EDA)

EDA provides an initial understanding of the data. Common EDA techniques include descriptive statistics (mean, median, mode, standard deviation), data visualization (heatmaps, box plots, scatter plots, PCA), and data reshaping (pivot, group, filter).

3. Data Preprocessing

Data preprocessing (cleaning, normalizing, standardizing, transforming) corrects missing values, errors, and ensures comparability of features. It can consume up to 80% of a data‑science project’s time, while model building and analysis take the remaining 20%.

4. Data Splitting

4.1 Train‑Test Split

The dataset is divided into a larger training set (e.g., 80%) and a smaller test set (e.g., 20%) to evaluate model performance on unseen data.

4.2 Train‑Validation‑Test Split

A three‑part split adds a validation set for hyper‑parameter tuning and model selection, while the test set remains untouched until final evaluation.

4.3 Cross‑Validation

k‑fold cross‑validation (commonly 5‑ or 10‑fold) repeatedly holds out one fold as test data while training on the remaining folds, producing k models whose performance metrics are averaged.

5. Model Building

5.1 Learning Algorithms

Algorithms fall into three categories: supervised learning (mapping X to Y), unsupervised learning (discovering structure in X alone), and reinforcement learning (optimizing actions via trial‑and‑error).

5.2 Hyper‑parameter Tuning

Hyper‑parameters such as mtry and ntree for Random Forests, or C and gamma for SVMs, must be optimized because no universal setting works for all datasets.

5.3 Feature Selection

Feature selection reduces the original feature set to a subset that improves model accuracy and interpretability. Techniques include evolutionary algorithms (genetic algorithms, particle swarm optimization), Monte‑Carlo methods, and hybrid approaches.

6. Machine‑Learning Tasks

6.1 Classification

Classification models predict categorical labels. Common performance metrics include accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC).

6.2 Regression

Regression models predict continuous outcomes (Y = f(X)). Key evaluation metrics are R², Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

7. Visual Illustration of a Classification Task

Using the Penguins dataset (8 features, 3 species), the article demonstrates how to train a classifier, evaluate it, and optionally apply PCA for visualizing underlying data structure.

Overall, the hand‑drawn infographic provides a clear, step‑by‑step visual guide to the full machine‑learning process, making the subject more engaging and accessible.

Machine Learningregressionmodel evaluationclassificationdata preprocessingcross-validation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.