Normalizing-Flow Policy
Models action chunks with an invertible transformation conditioned on observations, enabling exact log-likelihoods and expressive multimodal behavior.
arXiv:2602.09580 (Feb 2026)
*Equal contribution; Soft Robotics Lab, D-MAVT, ETH Zurich
SERNF is a sample-efficient reinforcement learning algorithm based on normalizing-flow for dexterous manipulation. It combines a normalizing-flow policy (exact likelihoods for multimodal action chunks) with an action-chunked critic (value learning aligned with chunked execution). It enables stable offline-to-online adaptation on real robots under a limited interaction budget.
Real-world fine-tuning of dexterous manipulation policies remains challenging due to
limited real-world interaction budgets and highly multimodal action distributions.
Diffusion-based policies, while expressive, do not permit conservative likelihood-based
updates during fine-tuning because action probabilities are intractable. In contrast,
conventional Gaussian policies collapse under multimodality, particularly when actions
are executed in chunks, and standard per-step critics fail to align with chunked execution,
leading to poor credit assignment.
We present SERNF, a Sample-Efficient Reinforcement learning with Normalizing Flows to address these challenges. The normalizing-flow policy yields
exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates
through likelihood regularization and thereby improving sample efficiency. An
action-chunked critic evaluates entire action sequences, aligning value estimation with
the policy’s temporal structure and improving long-horizon credit assignment.
We evaluate SERNF on two challenging real-world dexterous manipulation tasks:
scissors retrieval & tape cutting, and in-hand palm-done cube rotation.
On these tasks, SERNF achieves stable, sample-efficient adaptation.
Key idea: keep the policy expressive (multimodal action chunks) while making it tractable (exact log-likelihoods), and align value learning with the same temporal abstraction.
Models action chunks with an invertible transformation conditioned on observations, enabling exact log-likelihoods and expressive multimodal behavior.
Estimates value over entire chunks, matching the control interface and improving credit assignment under long horizons.
Starts from imitation learning (or sim-to-real distillation), then warm-starts the critic, runs offline RL, and finally performs online RL with a limited rollout budget.
marks successful rollout completion and failure.
Duck pick-and-place — final after online RL.
Scissors retrieval & tape cutting — final after online RL.
In-hand palm-down cube rotation — final after online RL.
This task uses only 30 teleoperated demonstrations. The robot must pick the duck from varying table poses and place it into the bowl, making low-data adaptation substantially more challenging.
Step 1: Imitation learning
Step 2: Offline RL
Step 3: Online RL
Step 1: Imitation learning
Step 2: Offline RL
Step 3: Online RL
Step 1: Simulation teacher (PPO + domain randomization)
Step 2: Distilled policy (real)
Step 3: After critic warm-up
Step 4: After online RL
With only 30 teleoperated demonstrations, the duck task is a challenging low-data setting. Even under varying pickup locations and orientations, offline RL reaches 68.75% success, and subsequent online RL with just 15 additional rollouts improves this to 87.5%, compared with 75% after imitation learning. On the subset of rollouts where all three policies succeed, RL also speeds up execution: average completion time drops from 8.86s for imitation learning to 7.08s after offline RL and 6.91s after online RL.
Starting from imitation learning, SERNF improves grasping with offline RL and enables successful cutting after limited online fine-tuning. In the paper’s summary table, SERNF reaches 70% grasp and 70% cutting success in the final setting.
Online fine-tuning refines a distilled sim policy into robust real-world rotation. The reported peak performance reaches 6.25 rotations/min after about 105 minutes of real-world data.
| Task | Stage | Summary |
|---|---|---|
| Duck pick-and-place | NF Imitation | 75% success with an average completion time of 8.86s. |
| Duck pick-and-place | SOFT-FLOW (offline only) | 68.75% success with an average completion time of 7.08s. |
| Duck pick-and-place | SOFT-FLOW (full) | 87.5% success after online RL with an average completion time of 6.91s. |
| Scissors + tape | NF Imitation | 50% grasping, 10% cutting. |
| Scissors + tape | SOFT-FLOW (offline only) | 80% grasping, but cutting not yet solved. |
| Scissors + tape | SOFT-FLOW (full) | 70% grasping, 70% cutting after online RL. |
| Cube rotation | Distilled -> online RL | Performance grows to a peak of 6.25 rotations/min with stable continuous turns. |
@article{yang2026sernf,
title={Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows},
author={Yang, Chenyu and Tarasov, Denis and Liconti, Davide and Zheng, Hehui and Katzschmann, Robert K},
journal={arXiv e-prints},
pages={arXiv--2602},
year={2026}
}
}