Artificial Intelligence 16 min read

Ensemble-based Offline-to-Online Reinforcement Learning (ENOTO): Methodology, Experiments, and Analysis

ENOTO introduces ensemble Q‑networks into the offline‑to‑online reinforcement‑learning pipeline, using minimum‑Q and uncertainty‑driven exploration to stabilize fine‑tuning, boost learning efficiency, and achieve 10‑25 % higher cumulative returns with minimal online interaction across MuJoCo and AntMaze benchmarks.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Ensemble-based Offline-to-Online Reinforcement Learning (ENOTO): Methodology, Experiments, and Analysis

Reinforcement Learning (RL) has two primary training paradigms: Online RL, which requires interaction with the environment and incurs high exploration costs, and Offline RL, which trains solely on pre-collected datasets but is limited by data quality and coverage.

To combine the advantages of both, researchers have proposed an Offline-to-Online RL paradigm. First, an offline policy is trained on existing datasets; then this policy is fine‑tuned online with a small amount of interaction data. This approach aims to overcome offline data limitations while requiring far fewer online interactions than pure online RL. Two main challenges arise: (1) performance degradation due to distribution shift when fine‑tuning an offline policy, and (2) achieving high learning efficiency with minimal online interactions.

At IJCAI 2024, Bilibili AI Platform and Tianjin University introduced ENsemble‑based Offline‑To‑Online RL (ENOTO), which incorporates ensemble Q‑networks into the offline‑to‑online pipeline. ENOTO leverages the uncertainty estimates from the ensemble to stabilize the transition between offline and online phases and to encourage efficient exploration. The framework can be combined with various base RL algorithms and has demonstrated improved stability and learning efficiency across MuJoCo, AntMaze, and multiple quality datasets, yielding 10%‑25% higher cumulative returns compared to prior methods.

Motivation

Early offline RL methods such as Conservative Q‑Learning (CQL) penalize out‑of‑distribution Q‑values, effectively restricting the policy to actions present in the dataset. Extending the single Q‑network to an ensemble of N Q‑networks (Q‑ensembles) surprisingly yields substantial gains in the offline‑to‑online setting. Experiments with CQL on MuJoCo show that naïve online fine‑tuning (CQL→CQL) suffers from low efficiency, while switching to an online algorithm (CQL→SAC) causes performance spikes. Introducing an ensemble (CQL‑N→SAC‑N) achieves both stability and improved learning speed.

Method

ENOTO consists of three progressive steps:

1. Stabilize the transition : Replace the single Q‑network in both offline and online phases with an ensemble of N Q‑networks and use the minimum Q‑value (MinQ) across the ensemble as the target. This reduces over‑optimism during fine‑tuning.

2. Improve online efficiency : Replace MinQ with a more balanced estimator, WeightedMinPair, which better trades off conservatism and optimism for online learning.

3. Uncertainty‑driven exploration : Compute the standard deviation of the ensemble’s Q‑values as an uncertainty measure. During action selection, combine the Q‑value and its uncertainty (weighted by a hyper‑parameter) to favor actions with higher uncertainty, encouraging exploration of less‑certain regions.

Experiments

Experiments were conducted on the MuJoCo benchmark (HalfCheetah, Walker2d, Hopper) using D4RL datasets of varying quality (medium, medium‑replay, medium‑expert). ENOTO‑CQL consistently outperformed baselines such as SAC, Scratch, IQL, AWAC, BR, PEX, and Cal‑QL in terms of stability and learning speed. Notably, ENOTO‑CQL started with a strong offline policy and quickly improved with few online steps.

Further validation on the more challenging AntMaze tasks (umaze, medium, large) with both “play” and “diverse” datasets showed that ENOTO‑LAPO (ENOTO instantiated on LAPO) achieved higher initial performance and stable, rapid improvement compared to IQL, PEX, and Cal‑QL.

Conclusion

The ENOTO framework introduces ensemble Q‑networks to offline‑to‑online RL, providing robust transition stability, enhanced online efficiency, and uncertainty‑guided exploration. Empirical results on MuJoCo and AntMaze demonstrate that ENOTO not only improves offline performance but also enables fast, stable online fine‑tuning without degrading the pretrained policy.

Reference: https://arxiv.org/abs/2306.06871

reinforcement learningAntMazeENOTOEnsemble Q-Networkslearning efficiencyMuJoCoOffline-to-Online
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.