Inspired by Model Cards for Model Reporting (Mitchell et al.) and Lessons from Archives (Jo & Gebru), we’re providing some accompanying information about the VIMA model.
VIMA (VisuoMotor Attention) is a novel Transformer agent that ingests multimodal prompts and outputs robot arm control autoregressively. VIMA is developed primarily by researchers at Stanford/NVIDIA.
October 2022
VIMA model consists of a pretrained T5 model as the prompt encoder, several tokenizers to process multimodal inputs, and a causal decoder that augoregressively predicts actions given the prompt and interaction history.
We released 7 checkpoints covering a spectrum of model capacity from 2M to 200M.
The model is intended to be used alongside VIMA-Bench to study general robot manipulation with multimodal prompts.
The primary intended users of these models are AI researchers in robotics, multimodal learning, embodied agents, foundation models, etc.
The models were trained with data generated by oracles implemented in VIMA-Bench. It includes 650K successful trajectories for behavior cloning. We use 600K trajectories for training. The remaining 50K trajectories are held out for validation purpose.
We quantify the performance of trained models using task success percentage aggregated over multiple tasks. We evaluate models' performance on task suite from VIMA-Bench and follow the proposed evaluation protocol. See our paper for more details.
Our provided model checkpoints are pre-trained on VIMA-Bench, which may not directly generalize to other simulators or real world. Limitations are further discussed in the paper.
Our paper is posted on arXiv. If you find our work useful, please consider citing us!
@article{jiang2022vima,
title = {VIMA: General Robot Manipulation with Multimodal Prompts},
author = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
year = {2022},
journal = {arXiv preprint arXiv: Arxiv-2210.03094}
}