ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

Zeyi Liu ¹, Cheng Chi^1,2, Eric Cousineau³, Naveen Kuppuswamy³, Benjamin Burchfiel³, Shuran Song^1,2

¹Stanford University, ²Columbia University, ³Toyota Research Institute

Preparation

The hardware and software is built on top of Universal Manipulation Interface (UMI). Please review the UMI paper and UMI GitHub repository beforehand to learn about the context.

Software Installation

Please refer to the UMI repository for installing docker and system-level dependencies.

We provide a new conda_environment.yaml with additional dependencies. To create a conda environment named maniwav:

$ cd maniwav
$ mamba env create -f conda_environment.yaml
$ conda activate maniwav
(maniwav)$

If you see a PortAudio not found error when installing the sounddevice package, run

sudo apt-get install libportaudio2

Hardware Installation

To add contact microphone to the UMI gripper:

Download the gripper CAD model with microphone holder:

wget https://real.stanford.edu/maniwav/CE3S1_soft_finger_contact_mic.gcode

Contact microphone (x2): https://www.amazon.com/gp/product/B08RBSQ7QD/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1
Cable (x2): https://www.amazon.com/gp/product/B09Z2FFZST/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&th=1
Grip tape: https://www.amazon.com/gp/product/B0093CQPW8/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1
Soft gripper printing material (green): https://ninjatek.com/shop/ninjaflex/

For the rest of the device, please see the UMI hardware guide for reference.

Dataset

We release the in-the-wild bagel flipping dataset and policy checkpoint at https://real.stanford.edu/maniwav/, which you can try directly training/evaluating on your own robot. You are encouraged to organize all your dataset under the data/ folder under the root directory.

Download the zarr dataset:

wget https://real.stanford.edu/maniwav/data/bagel_in_wild/replay_buffer.zarr.zip

Download the original demo videos with SLAM results (can be skipped if you don't need the raw mp4 videos):

wget --recursive --no-parent --no-host-directories --cut-dirs=2 --relative https://real.stanford.edu/maniwav/data/bagel_in_wild/demos/

If you have your own data (namely mp4 videos with sound, and actions extracted from SLAM following the same procedure as UMI), you can run the following script to create a replay buffer with audio data. Check the demos folder for the expected file structure.

python scripts_slam_pipeline/07_generate_replay_buffer.py <your-dataset-folder-path> -o <your-dataset-folder-path>/replay_buffer.zarr.zip -ms

Follow this link to download the ESC-50 dataset for noise augmentation, place the folder under data/: https://github.com/karolpiczak/ESC-50#download. For robot noises, we provide an example under data/robot-noise-calib for a UR5 robot, but you are encouraged record the noises for your specific robot.

Training

Tested on NVIDIA GeForce RTX 3090 24 GB.

Multi-GPU training with accelerator example:

CUDA_VISIBLE_DEVICES=<GPU-device-ids> HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --num_processes <ngpus> --main_process_port 29501 train.py --config-name train_diffusion_unet_maniwav_workspace task.dataset_path=data/replay_buffer.zarr.zip training.num_epochs=60 dataloader.batch_size=64 val_dataloader.batch_size=64

Single-GPU training example:

python train.py --config-name train_diffusion_unet_maniwav_workspace task.dataset_path=data/replay_buffer.zarr.zip training.num_epochs=60 dataloader.batch_size=64 val_dataloader.batch_size=64 training.device=<GPU-device-id>

Real World Evaluation

Congratulations🎉! Up to this point, you already have a robot manipulation policy ready to be deployed in the real world. Tested on Ubuntu 22.04. Please review the UMI documention on real world evaluation first.

Example to run the evaluation script:

python scripts_real/eval_real_umi.py --audio_device_id 0 --input checkpoints/bagel_in_wild/in-the-wild-latest.ckpt --output outputs/ours_itw --camera_reorder 0 -md 120 -si 4

To check audio device ids, run:

python -m sounddevice

Refer to the UMI repository for setting up the robot and camera. For microphone, just put contact microphone inside the gripper holder, wrap around with grip tape, and use cable to connect the microphone to the GoPro media mode external mic port. The game capture card will stream both vision and audio to the desktop, and we provide code to read and record audio data automatically. Check the files under umi/real_world.

NOTE: Remember to calibrate the audio latency following Appendix A.1, and update the number here.

Citation

If you find this codebase useful, feel free to cite our paper:

@article{liu2024maniwav,
    title={ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data},
    author={Liu, Zeyi and Chi, Cheng and Cousineau, Eric and Kuppuswamy, Naveen and Burchfiel, Benjamin and Song, Shuran},
    journal={arXiv preprint arXiv:2406.19464},
    year={2024}
}

Contact

If you have questions about the codebase, don't hesitate to reach out to Zeyi. If you opened a GitHub issue, please also shoot me an email with the link to the issue.

License

This repository is released under the MIT license. See LICENSE for additional details.

Acknowledgements

Cheng Chi and Huy Ha for early discussions on the hardware design and codebase.
Toyota Research Institute (TRI) for generously providing the UR5 robot and advice on using UMI.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data/robot-noise-calib		data/robot-noise-calib
diffusion_policy		diffusion_policy
example		example
scripts		scripts
scripts_real		scripts_real
scripts_slam_pipeline		scripts_slam_pipeline
tests		tests
umi		umi
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_environment.yaml		conda_environment.yaml
franka_instruction.md		franka_instruction.md
run_slam_pipeline.py		run_slam_pipeline.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

Preparation

Software Installation

Hardware Installation

Dataset

Training

Real World Evaluation

Citation

Contact

License

Acknowledgements

About

Releases

Packages

Languages

License

real-stanford/maniwav

Folders and files

Latest commit

History

Repository files navigation

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

Preparation

Software Installation

Hardware Installation

Dataset

Training

Real World Evaluation

Citation

Contact

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages