Skip to content

Official codebase of paper "ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data"

License

Notifications You must be signed in to change notification settings

real-stanford/maniwav

 
 

Repository files navigation

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

[Project page] [Paper] [Dataset]

Zeyi Liu 1, Cheng Chi1,2, Eric Cousineau3, Naveen Kuppuswamy3, Benjamin Burchfiel3, Shuran Song1,2

1Stanford University, 2Columbia University, 3Toyota Research Institute

Preparation

The hardware and software is built on top of Universal Manipulation Interface (UMI). Please review the UMI paper and UMI GitHub repository beforehand to learn about the context.

Software Installation

Please refer to the UMI repository for installing docker and system-level dependencies.

We provide a new conda_environment.yaml with additional dependencies. To create a conda environment named maniwav:

$ cd maniwav
$ mamba env create -f conda_environment.yaml
$ conda activate maniwav
(maniwav)$

If you see a PortAudio not found error when installing the sounddevice package, run

sudo apt-get install libportaudio2

Hardware Installation

To add contact microphone to the UMI gripper:

For the rest of the device, please see the UMI hardware guide for reference.

Dataset

We release the in-the-wild bagel flipping dataset and policy checkpoint at https://real.stanford.edu/maniwav/, which you can try directly training/evaluating on your own robot. You are encouraged to organize all your dataset under the data/ folder under the root directory.

Download the zarr dataset:

wget https://real.stanford.edu/maniwav/data/bagel_in_wild/replay_buffer.zarr.zip

Download the original demo videos with SLAM results (can be skipped if you don't need the raw mp4 videos):

wget --recursive --no-parent --no-host-directories --cut-dirs=2 --relative https://real.stanford.edu/maniwav/data/bagel_in_wild/demos/

If you have your own data (namely mp4 videos with sound, and actions extracted from SLAM following the same procedure as UMI), you can run the following script to create a replay buffer with audio data. Check the demos folder for the expected file structure.

python scripts_slam_pipeline/07_generate_replay_buffer.py <your-dataset-folder-path> -o <your-dataset-folder-path>/replay_buffer.zarr.zip -ms

Follow this link to download the ESC-50 dataset for noise augmentation, place the folder under data/: https://github.com/karolpiczak/ESC-50#download. For robot noises, we provide an example under data/robot-noise-calib for a UR5 robot, but you are encouraged record the noises for your specific robot.

Training

Tested on NVIDIA GeForce RTX 3090 24 GB.

Multi-GPU training with accelerator example:

CUDA_VISIBLE_DEVICES=<GPU-device-ids> HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --num_processes <ngpus> --main_process_port 29501 train.py --config-name train_diffusion_unet_maniwav_workspace task.dataset_path=data/replay_buffer.zarr.zip training.num_epochs=60 dataloader.batch_size=64 val_dataloader.batch_size=64

Single-GPU training example:

python train.py --config-name train_diffusion_unet_maniwav_workspace task.dataset_path=data/replay_buffer.zarr.zip training.num_epochs=60 dataloader.batch_size=64 val_dataloader.batch_size=64 training.device=<GPU-device-id>

Real World Evaluation

Congratulations🎉! Up to this point, you already have a robot manipulation policy ready to be deployed in the real world. Tested on Ubuntu 22.04. Please review the UMI documention on real world evaluation first.

Example to run the evaluation script:

python scripts_real/eval_real_umi.py --audio_device_id 0 --input checkpoints/bagel_in_wild/in-the-wild-latest.ckpt --output outputs/ours_itw --camera_reorder 0 -md 120 -si 4

To check audio device ids, run:

python -m sounddevice

Refer to the UMI repository for setting up the robot and camera. For microphone, just put contact microphone inside the gripper holder, wrap around with grip tape, and use cable to connect the microphone to the GoPro media mode external mic port. The game capture card will stream both vision and audio to the desktop, and we provide code to read and record audio data automatically. Check the files under umi/real_world.

NOTE: Remember to calibrate the audio latency following Appendix A.1, and update the number here.

Citation

If you find this codebase useful, feel free to cite our paper:

@article{liu2024maniwav,
    title={ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data},
    author={Liu, Zeyi and Chi, Cheng and Cousineau, Eric and Kuppuswamy, Naveen and Burchfiel, Benjamin and Song, Shuran},
    journal={arXiv preprint arXiv:2406.19464},
    year={2024}
}

Contact

If you have questions about the codebase, don't hesitate to reach out to Zeyi. If you opened a GitHub issue, please also shoot me an email with the link to the issue.

License

This repository is released under the MIT license. See LICENSE for additional details.

Acknowledgements

  • Cheng Chi and Huy Ha for early discussions on the hardware design and codebase.
  • Toyota Research Institute (TRI) for generously providing the UR5 robot and advice on using UMI.

About

Official codebase of paper "ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Lua 0.1%