Artificial Intelligence 17 min read

Step‑by‑Step Guide to Building Machine Learning Models with Scikit‑learn Templates

This article introduces a practical, step‑by‑step tutorial on building machine learning models with scikit‑learn, covering problem types, dataset loading, splitting, and a series of reusable templates (V1.0, V2.0, V3.0) for classification, regression, clustering, cross‑validation, and hyper‑parameter tuning, complete with code examples.

Python Programming Learning Circle

Apr 19, 2022

Step‑by‑Step Guide to Building Machine Learning Models with Scikit‑learn Templates

Follow + star, learn new Python skills daily

Algorithm engineers have emerged alongside the rise of artificial intelligence, but thanks to Python's ecosystem, building machine‑learning models has become very simple: many powerful algorithms can be used directly without implementing them from scratch.

You only need two steps to construct your own model:

Identify the type of problem you need to solve and the corresponding algorithm.

Call the appropriate algorithm from scikit‑learn to build the model.

Common problem types are classification, regression, and clustering. For example, predicting a categorical label is a classification problem (binary or multi‑class), predicting a continuous value is a regression problem, and discovering group structures without labels is a clustering problem (e.g., k‑means).

Universal Template V1.0

All algorithms differ only in name and parameter settings. With the template you can simply copy‑paste and change the algorithm name.

1. Load Dataset

We use the Iris dataset, a typical multi‑class classification problem, directly from sklearn.datasets:

from sklearn.datasets import load_iris
data = load_iris()
x = data.data
y = data.target

The feature matrix x is a (150, 4) array and the target vector y contains three class labels (0, 1, 2).

2. Split Dataset

To evaluate over‑fitting we split the data using train_test_split, reserving 10% for testing:

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.1, random_state=0)

Universal Template V2.0 – Adding Cross‑Validation

Running the same program repeatedly yields varying accuracies because the training data order changes. Moreover, a model that performs well on the training set may over‑fit and perform poorly on the test set.

To obtain a more reliable estimate we create a validation set inside the training data and use k‑fold cross‑validation. Scikit‑learn provides cross_val_score() to handle the splitting automatically:

# Example for SVM with cross‑validation
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
svm_model = SVC()
svm_model.fit(train_x, train_y)
scores = cross_val_score(svm_model, train_x, train_y, cv=5, scoring='accuracy')
print("Training accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()*2))

The same approach can be applied to the test set, and multiple metrics can be evaluated with cross_validate().

Universal Template V3.0 – Hyper‑Parameter Tuning

Default parameters may not be optimal for every dataset, so we need to tune them. Scikit‑learn exposes each estimator’s tunable parameters via estimator.get_params(). For SVM, for instance, you can retrieve the full parameter grid. SVC().get_params() Using GridSearchCV with a predefined parameter list allows systematic search for the best combination:

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
svm_model = SVC()
params = [
    {'kernel': ['linear'], 'C': [1, 10, 100]},
    {'kernel': ['poly'], 'C': [1], 'degree': [2, 3]},
    {'kernel': ['rbf'], 'C': [1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001]}
]
best_model = GridSearchCV(svm_model, param_grid=params, cv=5, scoring='accuracy')
best_model.fit(train_x, train_y)
print(best_model.best_score_)
print(best_model.best_params_)
print(best_model.best_estimator_)
print(best_model.cv_results_)

The resulting best_model object provides the optimal score, parameters, and the full cross‑validation results.

Application Cases

1. Build an SVM Classification Model (V1.0)

# svm classifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svm_model = SVC()
svm_model.fit(train_x, train_y)
pred_train = svm_model.predict(train_x)
accuracy_train = accuracy_score(train_y, pred_train)
print('Training accuracy: %.4f' % accuracy_train)
pred_test = svm_model.predict(test_x)
accuracy_test = accuracy_score(test_y, pred_test)
print('Test accuracy: %.4f' % accuracy_test)

Output example: Training accuracy 0.9810, Test accuracy 0.9778.

2. Build a Logistic Regression Model (V1.0)

# LogisticRegression classifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr_model = LogisticRegression()
lr_model.fit(train_x, train_y)
# ... same evaluation steps as above ...

Output example: Training accuracy 0.9429, Test accuracy 0.8889.

3. Build an SVM Model with Cross‑Validation (V2.0)

from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
svm_model = SVC()
scores = cross_val_score(svm_model, train_x, train_y, cv=5, scoring='accuracy')
print('Cross‑validated training accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()*2))

Similar code is shown for Logistic Regression and Random Forest.

Finally, the article provides visual references (images) illustrating the templates and the model‑selection guide.

Disclaimer: This article is compiled from online sources; copyright belongs to the original author.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Python classification scikit-learn cross-validation hyperparameter tuning model building

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.