[Project page] [Paper] [Colab (PushT)]
Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song
Stanford University
We provide a colab notebook for UVA on PushT using the pretrained checkpoint.
Install the conda environment:
$ mamba env create -f conda_environment.yaml
Download the pretrained checkpoints from the following links and put them in the checkpoints/
folder.
CUDA_VISIBLE_DEVICES=0 python eval_sim.py --checkpoint checkpoints/pusht.ckpt --output_dir checkpoints/pusht
CUDA_VISIBLE_DEVICES=0 python eval_sim.py --checkpoint checkpoints/pusht_multitask.ckpt --output_dir checkpoints/pusht_multitask
CUDA_VISIBLE_DEVICES=0 python eval_sim.py --checkpoint checkpoints/libero10.ckpt --output_dir checkpoints/libero10
We start from a pretrained VAE model and a pretrained image generation model MAR. Run the following command to download the pretrained models.
python unified_video_action/utils/download.py
We found that two-stage training works better than training on both video and action tasks directly. In the first stage, the model is trained on video generation
task, and in the second stage, it is fine-tuned on both video and action tasks.
To train the UVA model for the video generation task, we set predict_action=False
and selected_training_mode=video_model
. We did not incorporate additional video data during training. We believe that pretraining the model on large-scale web video datasets could substantially improve its generalization capabilities, and we plan to explore this approach in future work.
UVA's performance may currently be constrained by the model size. To evaluate it on larger or more complex real-world tasks, please consider using a larger UVA model.
Training video and action model takes longer time than training policy model only. We recommend using at least 4 GPUs for training. To train the UVA model on the PushT dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_pusht.yaml \
model.policy.action_model_params.predict_action=False \
model.policy.selected_training_mode=video_model \
model.policy.optimizer.learning_rate=1e-4 \
logging.project=uva \
hydra.run.dir="checkpoints/uva_pusht_video_model"
To train the UVA model on the joint video and action tasks, we set predict_action=True
and remove selected_training_mode=video_model
.
To train the UVA model on the UMI multi-task dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_pusht.yaml \
model.policy.autoregressive_model_params.pretrained_model_path=checkpoints/uva_pusht_video_model/checkpoints/latest.ckpt \
model.policy.action_model_params.predict_action=True
model.policy.optimizer.learning_rate=1e-4 \
logging.project=uva \
hydra.run.dir="uva_pusht_video_act_model"
Be careful when conducting real robot experiments. The robot moves quickly and can be dangerous.
Download the pretrained checkpoints from the following links and put them in the checkpoints/
folder.
This checkpoint is trained on 500 samples from each of the three datasets: Cup, Towel, and Mouse.
- Checkpoint trained on UMI Multitask
Please follow the instructions in arx5-sdk to setup the ARX X5 robot controller. Other models of robot arms could be used by modifying the arguments when running the controller.
To setup the UMI-related hardware (Camera, Gripper, etc.), please refer to the codebase of UMI-on-Legs and check out the 3d printing and assembly instructions.
We recommend first deploying the umi-arx codebase to test the hardware setup. For UVA deployment, please checkout the uva
branch for some updates with more safety checks.
Instead of running the detached_policy_inference.py
in the UMI
codebase, please run sh scripts/eval/eval_real.sh
to serve the UVA model. You can modify the parameters in the eval_real.sh
for different checkpoints and tcp ports. The rest of the deployment process is the same as the original UMI
codebase.
To train the video generation model on the UMI multi-task dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_umi_multi.yaml \
model.policy.action_model_params.predict_action=False \
model.policy.selected_training_mode=video_model \
model.policy.different_history_freq=True \
model.policy.optimizer.learning_rate=1e-4 \
task.dataset.dataset_root_dir=${dataset_path} \
logging.project=uva \
hydra.run.dir="checkpoints/uva_umi_multitask_video"
For all real-world experiments, we set different_history_freq=True
to use distinct history frequencies during training. Since the control frequency on real robot may differ from the data frequency in the collected dataset, using different history frequencies helps the model perform better during testing.
To train the UVA model on the UMI multi-task dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_umi_multi.yaml \
model.policy.autoregressive_model_params.pretrained_model_path=checkpoints/uva_umi_multitask_video/checkpoints/latest.ckpt \
model.policy.action_model_params.predict_action=True \
model.policy.use_proprioception=True \
model.policy.predict_proprioception=True \
model.policy.shift_action=False \
model.policy.different_history_freq=True \
model.policy.optimizer.learning_rate=1e-4 \
task.dataset.dataset_root_dir=${dataset_path} \
task.dataset.used_episode_indices_file=${indices_file} \
logging.project=uva \
hydra.run.dir="uva_umi_multitask_video_action"
All datasets are publicly available except for PushT-M
. We extend the PushT
task by incorporating various target βTβ positions and have collected a new dataset containing 247 demonstrations. Download the datasets and put them in the data
folder.
- PushT from Diffusion Policy.
- PushT-M from us. Download the file, extract its contents, and place them in the
data
folder. - Libero10 from LIBERO. We replayed the data to extract the absolute actions and appended language tokens from CLIP using
AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
. Download both the original hdf5 file and the converted dataset. Then, extract their contents and place them in thedata
folder. - Toolhang from Diffusion Policy. We use the file
ph/image_abs.hdf5
. Place it in thedata/tool_hang/ph/image_abs.hdf5
folder.
- UMI CUP Arrangement from UMI.
- UMI Towel Folding from Data Scaling Laws in Imitation Learning for Robotic Manipulation.
- UMI Mouse Arrangement from Data Scaling Laws in Imitation Learning for Robotic Manipulation.
- More UMI Datasets for large-scale training. Please run
process_dataset/download_dataset.py
to download and process the datasets.
We modified the UMI
dataloader to support multiple UMI datasets. We also optimized the memory usage and data loading speed especially when running on a SLURM system for large scale training.
The pipeline of processing the dataset is as follows, see process_dataset/download_dataset.py
for more details:
- Download the dataset (
.zarr.zip
format) from the corresponding urls. You can comment out the lines you don't need. - Copy the dataset into shared memory (
/dev/shm
) and decompress it to a.zarr
folder. The script is processing all the selected datasets in parallel, thus please make sure the server has enough available memory (at least 500GB). If not, you can run theprocess_dataset
function (indownload_dataset.py
) inside afor
loop. - Compress the dataset using
lz4
for faster compression and decompression speed. Then copy the.zarr.tar.lz4
files back to yourdata_dir
.
During training, you can run process_dataset/extract_umi_data.py
to extract multiple datasets into your shared memory /dev/shm
or a local disk in a SLURM system. When loading data batches, the dataloader unified_video_action/dataset/umi_multi_dataset.py
will randomly choose a UMI dataset and fetch the data from the shared memory in a "lazy" manner, i.e. only copy the data into program memory when needed and release it afterwards. Therefore during training, there will not be duplicated data in memory even if you are training on multiple GPUs.
Note that we do not use mirrors in the deployment setup. Therefore, we mask out all the mirror regions in the dataset whose gripper has mirror. You can modify the mask_mirror
option in umi_multi.yaml
to specify individually for each dataset.
For multi-node training, please refer to scripts/training/train_uva_umi_multi_node.sh
if you are using SLURM.
To add your own task, you need to implement a dataset, an environment runner, and a task configuration file. For guidance, please refer to the following examples from existing tasks:
unified_video_action/config/task/umi_multi.yaml
unified_video_action/dataset/umi_multi_dataset.py
Make sure that shape_meta
correspond to input and output shapes for your task. Make sure env_runner._target_
and dataset._target_
point to the new classes you have added. When training, add task=<your_task_name>
to train.py
's arguments.
To add your own model, you need to implement a configuration file, a workspace, and a policy file. For guidance, please refer to the following examples from existing models:
unified_video_action/config/model/uva.yaml
unified_video_action/workspace/train_unified_video_action_workspace.py
unified_video_action/policy/unified_video_action_policy.py
Are there any tips for training UVA?
We found that two-stage training works better than training on both video and action tasks simultaneously. In the first stage, the model is trained on video generation, and in the second stage, it is fine-tuned on both video and action tasks.
How long does it take to train UVA?
Training time depends on both the size of the dataset and the complexity of the task. For the UMI task, we sampled 500 trajectories from each of the three datasets and trained the model using 8 H100 GPUs. The video generation task was trained for 2 days, while the joint video and action generation requires an additional 2 days.
What's the next step for UVA?
We believe there is still significant potential in UVA that remains unexplored, and we leave this for future work.
Additional video data: UVA can leverage large amounts of actionless video data, which could provide valuable additional supervision. We plan to pretrain UVA on additional video data in the future.
Multi-modality: UVA can be naturally extended to predict modalities beyond video and action, such as sound and force, by incorporating additional diffusion heads, offering a more comprehensive and versatile framework.
Better architecture: The model architecture can be futuer improved by replacing the diffusion heads with flow matching.
Larger model size: UVA's performance may currently be limited by the model size. We plan to explore larger models in the future.
This repository is provided under the MIT license. For more details, please refer to LICENSE.
- Lots of code are inherited from Diffusion Policy and MAR.
- For real-world UMI experiments, we use the public datasets collected by UMI and Data Scaling Laws in Imitation Learning for Robotic Manipulation.