Artificial Intelligence 17 min read

Illustrated Guide to the Complete Machine Learning Workflow

This article presents a hand‑drawn, illustrated walkthrough of the entire machine‑learning pipeline—from dataset definition, exploratory data analysis, preprocessing, and data splitting to model building, algorithm selection, hyper‑parameter tuning, feature selection, and evaluation for both classification and regression tasks.

DataFunTalk

Aug 14, 2020

Illustrated Guide to the Complete Machine Learning Workflow

1. Dataset

A dataset is the starting point of any machine‑learning project and can be viewed as an M×N matrix where M represents features (columns) and N represents samples (rows). Features are split into X (independent variables) and Y (target labels). Supervised datasets contain both X and Y, while unsupervised datasets contain only X. If Y is quantitative, the task is regression; if Y is categorical, the task is classification.

2. Exploratory Data Analysis (EDA)

EDA provides an initial understanding of the data. Common EDA techniques include descriptive statistics (mean, median, mode, standard deviation), data visualization (heatmaps, box plots, scatter plots, PCA), and data reshaping (pivot, group, filter).

3. Data Preprocessing

Data preprocessing (cleaning, normalizing, standardizing, transforming) corrects missing values, errors, and ensures comparability of features. It can consume up to 80% of a data‑science project’s time, while model building and analysis take the remaining 20%.

4. Data Splitting

4.1 Train‑Test Split

The dataset is divided into a larger training set (e.g., 80%) and a smaller test set (e.g., 20%) to evaluate model performance on unseen data.

4.2 Train‑Validation‑Test Split

A three‑part split adds a validation set for hyper‑parameter tuning and model selection, while the test set remains untouched until final evaluation.

Train‑validation‑test split illustration

4.3 Cross‑Validation

k‑fold cross‑validation (commonly 5‑ or 10‑fold) repeatedly holds out one fold as test data while training on the remaining folds, producing k models whose performance metrics are averaged.

5. Model Building

5.1 Learning Algorithms

Algorithms fall into three categories: supervised learning (mapping X to Y), unsupervised learning (discovering structure in X alone), and reinforcement learning (optimizing actions via trial‑and‑error).

5.2 Hyper‑parameter Tuning

Hyper‑parameters such as mtry and ntree for Random Forests, or C and gamma for SVMs, must be optimized because no universal setting works for all datasets.

5.3 Feature Selection

Feature selection reduces the original feature set to a subset that improves model accuracy and interpretability. Techniques include evolutionary algorithms (genetic algorithms, particle swarm optimization), Monte‑Carlo methods, and hybrid approaches.

6. Machine‑Learning Tasks

6.1 Classification

Classification models predict categorical labels. Common performance metrics include accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC).

6.2 Regression

Regression models predict continuous outcomes (Y = f(X)). Key evaluation metrics are R², Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

7. Visual Illustration of a Classification Task

Using the Penguins dataset (8 features, 3 species), the article demonstrates how to train a classifier, evaluate it, and optionally apply PCA for visualizing underlying data structure.

Overall, the hand‑drawn infographic provides a clear, step‑by‑step visual guide to the full machine‑learning process, making the subject more engaging and accessible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning regression model evaluation classification data preprocessing cross-validation

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.