Normalizing-Flow Policy
Models action chunks with an invertible transformation conditioned on observations, enabling exact log-likelihoods and expressive multimodal behavior.
arXiv:2602.09580 (Feb 2026)
*Equal contribution; Soft Robotics Lab, D-MAVT, ETH Zurich
SOFT-FLOW is a sample-efficient off-policy fine-tuning method for dexterous manipulation that combines a normalizing-flow policy (exact likelihoods for multimodal action chunks) with an action-chunked critic (value learning aligned with chunked execution). It enables stable offline-to-online adaptation on real robots under a limited interaction budget.
Real-world fine-tuning of dexterous manipulation policies remains challenging due to
limited real-world interaction budgets and highly multimodal action distributions.
Diffusion-based policies, while expressive, do not permit conservative likelihood-based
updates during fine-tuning because action probabilities are intractable. In contrast,
conventional Gaussian policies collapse under multimodality, particularly when actions
are executed in chunks, and standard per-step critics fail to align with chunked execution,
leading to poor credit assignment.
We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with
normalizing flows to address these challenges. The normalizing-flow policy yields
exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates
through likelihood regularization and thereby improving sample efficiency. An
action-chunked critic evaluates entire action sequences, aligning value estimation with
the policy’s temporal structure and improving long-horizon credit assignment.
We evaluate SOFT-FLOW on two challenging real-world dexterous manipulation tasks:
scissors retrieval & tape cutting, and in-hand palm-done cube rotation.
On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation.
Key idea: keep the policy expressive (multimodal action chunks) while making it tractable (exact log-likelihoods), and align value learning with the same temporal abstraction.
Models action chunks with an invertible transformation conditioned on observations, enabling exact log-likelihoods and expressive multimodal behavior.
Estimates value over entire chunks, matching the control interface and improving credit assignment under long horizons.
Starts from imitation learning (or sim-to-real distillation), then warm-starts the critic, runs offline RL, and finally performs online RL with a limited rollout budget.
Videos play at 2x speed by default for convenience. ✔︎ marks successful sub-task completion, ✖︎ marks failure for scissors task.
Step 1: Imitation learning [✔︎] [✖︎]
Step 2: Offline RL [✔︎] [✖︎]
Step 3: Online RL [✔︎] [✔︎]
Step 1: Simulation teacher (PPO + domain randomization)
Step 2: Distilled policy (real)
Step 3: After critic warm-up
Step 4: After online RL
Starting from imitation learning, SOFT-FLOW improves grasping with offline RL and enables successful cutting after limited online fine-tuning. In the paper’s summary table, SOFT-FLOW reaches 70% grasp and 70% cutting success in the final setting.
Online fine-tuning refines a distilled sim policy into robust real-world rotation. The reported peak performance reaches 6.25 rotations/min after about 105 minutes of real-world data.
| Task | Stage | Summary |
|---|---|---|
| Scissors + tape | NF Imitation | 50% grasping, 10% cutting. |
| Scissors + tape | SOFT-FLOW (offline only) | 80% grasping, but cutting not yet solved. |
| Scissors + tape | SOFT-FLOW (full) | 70% grasping, 70% cutting after online RL. |
| Cube rotation | Distilled → online RL | Performance grows to a peak of 6.25 rotations/min with stable continuous turns. |
@article{yang2026softflow,
title = {Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows},
author = {Yang, Chenyu and Tarasov, Denis and Liconti, Davide and Zheng, Hehui and Katzschmann, Robert K.},
journal = {arXiv preprint arXiv:2602.09580},
year = {2026}
}