Why FLAML Is the Fast, Lightweight AutoML Framework You Should Try
This article introduces Microsoft’s FLAML, a fast and lightweight AutoML library, explains its design principles, cost‑aware search strategy, key observations, properties, and experimental results, and provides practical code examples for integrating FLAML into Python machine‑learning workflows.
Overview
AutoML has achieved many successes in recent machine‑learning competitions, and FLAML is a new, efficient, lightweight AutoML framework promoted by Microsoft. FLAML (Fast and Lightweight AutoML Library) was introduced in a 2021 Microsoft research paper and quickly became an officially recommended automated tuning library for LightGBM.
Related Work
For background on AutoML, see the article "AutoML框架概览" (https://zhuanlan.zhihu.com/p/212512984) and the original paper (https://arxiv.org/pdf/1911.04706.pdf). Additional references include a KDnuggets article on FLAML + Ray Tune and a Microsoft Research stand‑up video.
Summary of Current Work
FLAML differs from other state‑of‑the‑art AutoML frameworks by focusing on lightweight search: it incorporates hyper‑parameters, learner choice, and sample size into a unified cost model that includes CPU time and cross‑validation overhead. The framework’s core contribution is a cost‑aware search strategy that accelerates and simplifies the search space.
Observations
Increasing sample size reduces the gap between test error and validation error, and cross‑validation yields smaller gaps than hold‑out under fixed conditions.
With fixed sample size, increasing model complexity does not always lower loss to the minimum.
Cost is proportional to both sample size and hyper‑parameter space (e.g., number of trees in a tree model).
Properties
Choose sample size and model complexity jointly: large sample with complex model, small sample with simple model.
Use cross‑validation only for small sample sizes or when the cost budget is large; otherwise hold‑out suffices.
Allow all learners to compete fairly by considering resampling strategy and sample size in the cost model.
Select the configuration with the lowest cost when errors are comparable, avoiding excessive time on marginal improvements.
Design Overview
FLAML’s design emphasizes speed and efficiency by avoiding exhaustive global search. It sequentially selects a resampling strategy, learner, and hyper‑parameters, then evaluates validation error and cost. This loop repeats until a time limit is reached.
Search Strategy
The framework introduces Estimated Cost for Improvement (ECI) to guide search. ECI estimates the cost required to achieve the next loss improvement, considering three cases: (1) improving loss with the current learner and sample size, (2) increasing sample size for the current learner, and (3) switching to a new learner.
Search Detailed Procedure
Select an appropriate resampling strategy (cross‑validation or hold‑out) based on dataset size and time budget.
Choose a learner using ECI‑based probabilities, ensuring fair competition among learners.
Search hyper‑parameters and sample size with a randomized direct search method, starting with small sample sizes and expanding as training progresses.
Example
An illustration (from the Microsoft stand‑up) shows FLAML’s iterative process, where early iterations use simple models and small samples, then gradually increase model complexity (e.g., LightGBM trees) and sample size as loss improves.
FLAML Comparison Experiments
FLAML was evaluated on 53 datasets (39 classification, 14 regression). Compared with Auto‑sklearn, H2O AutoML, TPOT, cloud‑automlh, and HpBandSter, FLAML achieved comparable or better results within the same time budget. Notably, FLAML trained for 1 minute outperformed other libraries trained for 10 minutes on 62‑83% of datasets, and outperformed 10‑minute training on 72‑89% of datasets when compared against 1‑hour training.
Implementation
FLAML follows the scikit‑learn API. A basic usage example:
<code>from flaml import AutoML
automl = AutoML()
automl.fit(X_train=X_train, y_train=y_train, time_budget=60, estimator_list=['lgbm'])
print('Best ML model:', automl.model)
print('Best hyperparameter config:', automl.best_config)</code>Custom learners can be defined by subclassing existing estimators and providing a custom search space. Example for a customized XGBoost learner:
<code>'''create an XGBoost learner class with a customized search space'''
from flaml.model import XGBoostSklearnEstimator
from flaml import tune
class MyXGB(XGBoostSklearnEstimator):
'''XGBoostSklearnEstimator with a customized search space'''
@classmethod
def search_space(cls, data_size, **params):
upper = min(2**15, int(data_size))
return {
'n_estimators': {
'domain': tune.lograndint(lower=4, upper=upper),
'low_cost_init_value': 4,
},
'max_leaves': {
'domain': tune.lograndint(lower=4, upper=upper),
'low_cost_init_value': 4,
},
}
'''Use CFO in FLAML to tune XGBoost'''
from flaml import AutoML
automl = AutoML()
automl.add_learner(learner_name='my_xgboost', learner_class=MyXGB)
automl.fit(X_train=X_train, y_train=y_train, time_budget=15, estimator_list=['my_xgboost'], hpo_method='cfo')
settings = {
"time_budget": 240,
"metric": 'r2',
"estimator_list": ['my_xgboost'],
"task": 'regression',
"log_file_name": 'houses_experiment.log',
"hpo_method": 'cfo',
"seed": 7654321,
}
automl.fit(X_train=X_train, y_train=y_train, **settings)
</code>Because FLAML adheres to the scikit‑learn API, it can be integrated into pipelines for further convenience.
Images
GuanYuan Data Tech Team
Practical insights from the GuanYuan Data Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.