Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

arXiv:2602.09580 (Feb 2026)

Chenyu Yang*, Denis Tarasov*, Davide Liconti, Hehui Zheng, Robert K. Katzschmann

*Equal contribution; Soft Robotics Lab, D-MAVT, ETH Zurich

Soft Robotics Lab logo ETH Zurich logo
arXiv PDF Videos BibTeX Code (coming)

SOFT-FLOW is a sample-efficient off-policy fine-tuning method for dexterous manipulation that combines a normalizing-flow policy (exact likelihoods for multimodal action chunks) with an action-chunked critic (value learning aligned with chunked execution). It enables stable offline-to-online adaptation on real robots under a limited interaction budget.

Abstract

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment.

We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flows to address these challenges. The normalizing-flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy’s temporal structure and improving long-horizon credit assignment. We evaluate SOFT-FLOW on two challenging real-world dexterous manipulation tasks: scissors retrieval & tape cutting, and in-hand palm-done cube rotation. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation.

Overview

Key idea: keep the policy expressive (multimodal action chunks) while making it tractable (exact log-likelihoods), and align value learning with the same temporal abstraction.

  • Normalizing-flow actor models action chunks and provides exact likelihoods for conservative updates.
  • Action-chunked critic evaluates entire sequences to improve long-horizon credit assignment.
  • Practical pipeline for limited on-robot data: imitation / distillation → critic warm-up → offline RL → online RL.
SOFT-FLOW overview schematic
SOFT-FLOW combines a likelihood-based normalizing-flow policy with a chunk-aligned critic for stable off-policy fine-tuning.

Method at a Glance

Normalizing-Flow Policy

Models action chunks with an invertible transformation conditioned on observations, enabling exact log-likelihoods and expressive multimodal behavior.

Action-Chunked Critic

Estimates value over entire chunks, matching the control interface and improving credit assignment under long horizons.

Offline → Online Fine-Tuning

Starts from imitation learning (or sim-to-real distillation), then warm-starts the critic, runs offline RL, and finally performs online RL with a limited rollout budget.

SOFT-FLOW actor-critic architecture diagram
Transformer-based normalizing-flow actor and chunk-level critic used in SOFT-FLOW.

Videos

Videos play at 2x speed by default for convenience. ✔︎ marks successful sub-task completion, ✖︎ marks failure for scissors task.

Real-World: Scissors Retrieval & Tape Cutting

Step 1: Imitation learning [✔︎] [✖︎]

Step 2: Offline RL [✔︎] [✖︎]

Step 3: Online RL [✔︎] [✔︎]

Show more rollouts

Real-World: In-hand Palm-down Cube Rotation

Step 1: Simulation teacher (PPO + domain randomization)

Step 2: Distilled policy (real)

Step 3: After critic warm-up

Step 4: After online RL

Results

Scissors Retrieval & Tape Cutting

Starting from imitation learning, SOFT-FLOW improves grasping with offline RL and enables successful cutting after limited online fine-tuning. In the paper’s summary table, SOFT-FLOW reaches 70% grasp and 70% cutting success in the final setting.

Scissors task learning curve

In-hand Palm-down Cube Rotation

Online fine-tuning refines a distilled sim policy into robust real-world rotation. The reported peak performance reaches 6.25 rotations/min after about 105 minutes of real-world data.

Cube rotation learning curve
Task Stage Summary
Scissors + tape NF Imitation 50% grasping, 10% cutting.
Scissors + tape SOFT-FLOW (offline only) 80% grasping, but cutting not yet solved.
Scissors + tape SOFT-FLOW (full) 70% grasping, 70% cutting after online RL.
Cube rotation Distilled → online RL Performance grows to a peak of 6.25 rotations/min with stable continuous turns.

Citation

@article{yang2026softflow,
  title        = {Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows},
  author       = {Yang, Chenyu and Tarasov, Denis and Liconti, Davide and Zheng, Hehui and Katzschmann, Robert K.},
  journal      = {arXiv preprint arXiv:2602.09580},
  year         = {2026}
}