Top 10 New Features in Scikit‑learn 0.24
The article reviews the most important additions in scikit‑learn 0.24, including faster hyper‑parameter search methods, ICE plots, histogram‑based boosting improvements, new feature‑selection tools, polynomial‑feature approximations, a semi‑supervised classifier, MAPE metric, enhanced OneHotEncoder and OrdinalEncoder handling, and a more flexible RFE interface.
Since its first release in 2007, scikit‑learn has become a cornerstone Python library for machine learning, offering classification, regression, dimensionality reduction, clustering, feature extraction, data preprocessing, and model evaluation.
Its strengths lie in comprehensive documentation, an extensive and well‑liked API, a large collection of algorithms (including LIBSVM and LIBLINEAR), and many built‑in datasets that save users time.
Version 0.24, released in 2021, introduces several noteworthy features:
1. Faster hyper‑parameter selection
HalvingGridSearchCV and HalvingRandomSearchCV combine the functionality of GridSearchCV and RandomizedSearchCV while using a tournament‑style approach to evaluate fewer candidates, dramatically reducing computational cost. Use them when the search space is large or model training is slow; otherwise, stick with GridSearchCV.
Import the experimental classes before use:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV2. ICE plots
Partial dependence plots (PDP) were introduced in 0.23; version 0.24 adds Individual Conditional Expectation (ICE) plots, which display the dependence of the prediction on a feature for each individual sample. Use kind='individual' in plot_partial_dependency to view ICE, or kind='both' for PDP and ICE together.
3. Histogram‑based boosting improvements
Inspired by LightGBM, HistGradientBoostingRegressor and HistGradientBoostingClassifier now accept a categorical_features argument, allowing direct handling of categorical data without one‑hot encoding, reducing training time and often improving performance. Missing values are also natively supported.
model = HistGradientBoostingRegressor(
categorical_features=[True, False]
)4. Forward feature selection
The SequentialFeatureSelector performs forward selection by iteratively adding the most valuable feature until a stopping criterion is met, without requiring the underlying estimator to expose coef_ or feature_importances_ . It may be slower than RFE because it uses cross‑validation.
5. Fast approximation of polynomial features
The new PolynomialCountSketch estimator from the kernel_approximation module provides a memory‑ and time‑efficient alternative to PolynomialFeatures , generating a fixed number of sketch features (default 100) that approximate high‑order interactions.
6. SelfTrainingClassifier for semi‑supervised learning
This meta‑classifier wraps any supervised classifier that can output class probabilities, allowing it to learn from unlabeled data. Unlabeled samples must be marked with -1 in the target vector.
7. Mean Absolute Percentage Error (MAPE)
The new mean_absolute_percentage_error function provides a regression metric comparable across different problems, complementing R‑squared.
8. OneHotEncoder supports missing values
When handle_unknown='ignore' and the training data contain np.nan , the encoder creates an extra column to represent missing values.
9. OrdinalEncoder can handle unseen categories in test data
Set handle_unknown='use_encoded_value' together with an unknown_value (an integer not used in the training encoding or np.nan ) to safely encode categories that appear only in the test set.
10. RFE accepts a proportion of features to retain
Passing a float between 0 and 1 to n_features_to_select lets Recursive Feature Elimination keep a specified percentage of the original features, simplifying programmatic feature reduction.
For the original article, see the link: Towards Data Science .
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.