Illustrated Guide to the Complete Machine Learning Workflow
This article presents a hand‑drawn, illustrated walkthrough of the entire machine‑learning pipeline—from dataset definition, exploratory data analysis, preprocessing, and data splitting to model building, algorithm selection, hyper‑parameter tuning, feature selection, and evaluation for both classification and regression tasks.
1. Dataset
A dataset is the starting point of any machine‑learning project and can be viewed as an M×N matrix where M represents features (columns) and N represents samples (rows). Features are split into X (independent variables) and Y (target labels). Supervised datasets contain both X and Y, while unsupervised datasets contain only X. If Y is quantitative, the task is regression; if Y is categorical, the task is classification.
2. Exploratory Data Analysis (EDA)
EDA provides an initial understanding of the data. Common EDA techniques include descriptive statistics (mean, median, mode, standard deviation), data visualization (heatmaps, box plots, scatter plots, PCA), and data reshaping (pivot, group, filter).
3. Data Preprocessing
Data preprocessing (cleaning, normalizing, standardizing, transforming) corrects missing values, errors, and ensures comparability of features. It can consume up to 80% of a data‑science project’s time, while model building and analysis take the remaining 20%.
4. Data Splitting
4.1 Train‑Test Split
The dataset is divided into a larger training set (e.g., 80%) and a smaller test set (e.g., 20%) to evaluate model performance on unseen data.
4.2 Train‑Validation‑Test Split
A three‑part split adds a validation set for hyper‑parameter tuning and model selection, while the test set remains untouched until final evaluation.
4.3 Cross‑Validation
k‑fold cross‑validation (commonly 5‑ or 10‑fold) repeatedly holds out one fold as test data while training on the remaining folds, producing k models whose performance metrics are averaged.
5. Model Building
5.1 Learning Algorithms
Algorithms fall into three categories: supervised learning (mapping X to Y), unsupervised learning (discovering structure in X alone), and reinforcement learning (optimizing actions via trial‑and‑error).
5.2 Hyper‑parameter Tuning
Hyper‑parameters such as mtry and ntree for Random Forests, or C and gamma for SVMs, must be optimized because no universal setting works for all datasets.
5.3 Feature Selection
Feature selection reduces the original feature set to a subset that improves model accuracy and interpretability. Techniques include evolutionary algorithms (genetic algorithms, particle swarm optimization), Monte‑Carlo methods, and hybrid approaches.
6. Machine‑Learning Tasks
6.1 Classification
Classification models predict categorical labels. Common performance metrics include accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC).
6.2 Regression
Regression models predict continuous outcomes (Y = f(X)). Key evaluation metrics are R², Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
7. Visual Illustration of a Classification Task
Using the Penguins dataset (8 features, 3 species), the article demonstrates how to train a classifier, evaluate it, and optionally apply PCA for visualizing underlying data structure.
Overall, the hand‑drawn infographic provides a clear, step‑by‑step visual guide to the full machine‑learning process, making the subject more engaging and accessible.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.