SOFT-FLOW | Project Page

Abstract

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment.

We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flows to address these challenges. The normalizing-flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy’s temporal structure and improving long-horizon credit assignment. We evaluate SOFT-FLOW on two challenging real-world dexterous manipulation tasks: scissors retrieval & tape cutting, and in-hand palm-done cube rotation. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation.

Overview

Key idea: keep the policy expressive (multimodal action chunks) while making it tractable (exact log-likelihoods), and align value learning with the same temporal abstraction.

Normalizing-flow actor models action chunks and provides exact likelihoods for conservative updates.
Action-chunked critic evaluates entire sequences to improve long-horizon credit assignment.
Practical pipeline for limited on-robot data: imitation / distillation → critic warm-up → offline RL → online RL.

SOFT-FLOW overview schematic — SOFT-FLOW combines a likelihood-based normalizing-flow policy with a chunk-aligned critic for stable off-policy fine-tuning.

Method at a Glance

Normalizing-Flow Policy

Models action chunks with an invertible transformation conditioned on observations, enabling exact log-likelihoods and expressive multimodal behavior.

Action-Chunked Critic

Estimates value over entire chunks, matching the control interface and improving credit assignment under long horizons.

Offline → Online Fine-Tuning

Starts from imitation learning (or sim-to-real distillation), then warm-starts the critic, runs offline RL, and finally performs online RL with a limited rollout budget.

SOFT-FLOW actor-critic architecture diagram — Transformer-based normalizing-flow actor and chunk-level critic used in SOFT-FLOW.

Videos

marks successful sub-task completion and failure for the scissors task.

Final Results (After Online RL)

Scissors retrieval & tape cutting — final after online RL.

In-hand palm-down cube rotation — final after online RL.

Real-World: Scissors Retrieval & Tape Cutting

Step 1: Imitation learning

Step 2: Offline RL

Step 3: Online RL

Show more rollouts

Real-World: In-hand Palm-down Cube Rotation

Step 1: Simulation teacher (PPO + domain randomization)

Step 2: Distilled policy (real)

Step 3: After critic warm-up

Step 4: After online RL

Results

Scissors Retrieval & Tape Cutting

Starting from imitation learning, SOFT-FLOW improves grasping with offline RL and enables successful cutting after limited online fine-tuning. In the paper’s summary table, SOFT-FLOW reaches 70% grasp and 70% cutting success in the final setting.

In-hand Palm-down Cube Rotation

Online fine-tuning refines a distilled sim policy into robust real-world rotation. The reported peak performance reaches 6.25 rotations/min after about 105 minutes of real-world data.

Task	Stage	Summary
Scissors + tape	NF Imitation	50% grasping, 10% cutting.
Scissors + tape	SOFT-FLOW (offline only)	80% grasping, but cutting not yet solved.
Scissors + tape	SOFT-FLOW (full)	70% grasping, 70% cutting after online RL.
Cube rotation	Distilled → online RL	Performance grows to a peak of 6.25 rotations/min with stable continuous turns.

Citation

@article{yang2026sample,
  title={Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows},
  author={Yang, Chenyu and Tarasov, Denis and Liconti, Davide and Zheng, Hehui and Katzschmann, Robert K},
  journal={arXiv preprint arXiv:2602.09580},
  year={2026}
}