Skip to content

Latest commit

 

History

History
427 lines (329 loc) · 20 KB

README.md

File metadata and controls

427 lines (329 loc) · 20 KB

Cosmos Autoregressive-based World Foundation Models

Table of Contents

This page details the steps for using the Cosmos autoregressive-based world foundation models.

Getting Started

Set Up Docker Environment

Follow our Installation Guide to set up the Docker environment. All commands on this page should be run inside Docker.

Download Checkpoints

  1. Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained').

  2. Log in to Hugging Face with the access token:

huggingface-cli login
  1. Download the Cosmos model weights from Hugging Face:
PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B
  1. The downloaded files should be in the following structure:
checkpoints/
├── Cosmos-1.0-Autoregressive-4B
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Autoregressive-5B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Autoregressive-12B
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Autoregressive-13B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Tokenizer-CV8x8x8
│   ├── decoder.jit
│   ├── encoder.jit
│   └── mean_std.pt
├── Cosmos-1.0-Tokenizer-DV8x16x16
│   ├── decoder.jit
│   └── encoder.jit
├── Cosmos-1.0-Diffusion-7B-Decoder-DV8x16x16ToCV8x8x8
│   ├── aux_vars.pt
│   └── model.pt
└── Cosmos-1.0-Guardrail
    ├── aegis/
    ├── blocklist/
    ├── face_blur_filter/
    └── video_content_safety_filter/

Usage

Model Types

There are two model types available for autoregressive world generation:

  1. Base: Supports world generation from image/video input
  • Models: Cosmos-1.0-Autoregressive-4B and Cosmos-1.0-Autoregressive-12B
  • Inference script: base.py
  1. Video2World: Supports world generation from image/video input and text input
  • Models: Cosmos-1.0-Autoregressive-5B-Video2World and Cosmos-1.0-Autoregressive-13B-Video2World
  • Inference script: video2world.py

Our models now support video extension up to 33 frames. Starting from either a single image or a 9-frame video input, they can generate the remaining frames to reach the 33-frame length (generating 32 or 24 frames, respectively).

We have evaluated all eight possible configurations (4 models × 2 vision input types: image or video) using 100 test videos on physical AI topics. Below are the failure rates for each configuration:

Model Image input Video input (9 frames)
Cosmos-1.0-Autoregressive-4B 15% 1%
Cosmos-1.0-Autoregressive-5B-Video2World 7% 2%
Cosmos-1.0-Autoregressive-12B 2% 1%
Cosmos-1.0-Autoregressive-13B-Video2World 3% 0%

We define failure cases as videos with severe distortions, such as:

  • Sudden appearance of large unexpected objects
  • Video degrading to a single solid color

Note that the following are not considered failures in our analysis:

  • Static video frames
  • Minor object distortions or artifacts

Single and Batch Generation

We support both single and batch video generation.

For generating a single video, base mode requires the input argument --input_image_or_video_path (image/video input), while video2world mode requires both --input_image_or_video_path (image/video input) and --prompt (text input).

Note that our model only works with 1024x640 resolution videos. If the input image/video is not in this resolution, it will be resized and cropped.

For generating a batch of videos, both base and video2world require --batch_input_path (path to a JSONL file). For base, the JSONL file should contain one visual input per line in the following format, where each line must contain a "visual_input" field:

{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}

For video2world, each line in the JSONL file must contain both "prompt" and "visual_input" fields:

{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}

Sample Commands

There are two main demo scripts for autoregressive world generation: base.py and video2world.py. Below you will find sample commands for single and batch generation, as well as commands for running with low-memory GPUs using model offloading. We also provide a memory usage table comparing different offloading strategies to help with configuration.

Base (base.py): 4B and 12B

Generates world from image/video input.

The input_type argument can be either video or image. We have tuned the sampling parameters top_p and temperature to achieve the best performance. Please use the provided values in the command examples.

Note that the command examples below all use video input. If you want to use image input, please change the input_type to image.

Single Generation
# Example using 4B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0

# Example for low-memory GPUs using 4B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer

# Example using 12B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0

# Example for low-memory GPUs using 12B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer
Batch Generation
# Example using 4B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0

# Example using 12B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0
Example Output

Here is an example output video generated using base.py with image input, using Cosmos-1.0-Autoregressive-12B:

output_from_image_input_12b.mp4

The input image used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.jpg. The image is from BDD dataset.

Here is an example output video generated using base.py with 9-frame video input, using Cosmos-1.0-Autoregressive-12B:

output_from_video_input_12b.mp4

The input video used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.mp4.

Inference Time and GPU Memory Usage

These numbers may vary based on system specifications and are provided for reference only.

Offloading Strategy Cosmos-1.0-Autoregressive-4B Cosmos-1.0-Autoregressive-12B
No offloading 31.3 GB 47.5 GB
Guardrails 28.9 GB 45.2 GB
Guardrails & Diffusion decoder 28.5 GB 43.1 GB
Guardrails & Diffusion decoder & Tokenizer 27.3 GB 42.9 GB
Guardrails & Diffusion decoder & Tokenizer & AR model 18.7 GB 27.4 GB

End-to-end inference runtime on one H100 without offloading and after model initialization:

Cosmos-1.0-Autoregressive-4B Cosmos-1.0-Autoregressive-12B
~62 seconds ~119 seconds

Video2World (video2world.py): 5B and 13B

Generates world from image/video and text input.

The input_type argument can be either text_and_video or text_and_image. We have tuned the sampling parameters top_p and temperature to achieve the best performance. Please use the provided values in the command examples.

Note that the command examples below all use video input. If you want to use image input, please change the input_type to text_and_image.

Single Generation
# Example using 5B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0

# Example for low-memory GPUs using 5B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer \
    --offload_text_encoder_model

# Example using 13B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models

# Example for low-memory GPUs using 13B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer \
    --offload_text_encoder_model
Batch Generation
# Example using 5B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0

# Example using 13B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models
Example Output

Here is an example output video generated using video2world.py with image input, using Cosmos-1.0-Autoregressive-13B-Video2World:

output_from_image_input_13b.mp4

The input image used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.jpg. The prompt for generating the video is:

A driving video captures a serene urban street scene on a sunny day. The camera is mounted on the dashboard of a moving vehicle, providing a first-person perspective as it travels down a two-lane road. The street is lined with parked cars on both sides, predominantly black and silver sedans and SUVs. The road is flanked by a mix of residential and commercial buildings, with a prominent red-brick building on the left side, featuring multiple windows and a flat roof. The sky is clear with a few scattered clouds, casting soft shadows on the street. Trees with lush green foliage line the right side of the road, providing a natural contrast to the urban environment. The camera remains steady, maintaining a consistent forward motion, suggesting a leisurely drive. Traffic is light, with a few vehicles moving in the opposite direction, including a black sedan and a yellow taxi. Street signs are visible, including a no-parking sign on the right. The overall atmosphere is calm and peaceful, with no pedestrians visible, emphasizing the focus on the drive and the surrounding urban landscape.

Here is an example output video generated using video2world.py with 9-frame video input, using Cosmos-1.0-Autoregressive-13B-Video2World:

output_from_video_input_13b.mp4

The input video used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.mp4. The prompt for generating the video is:

A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions.
Inference Time and GPU Memory Usage

These numbers may vary based on system specifications and are provided for reference only.

Offloading Strategy Cosmos-1.0-Autoregressive-5B-Video2World Cosmos-1.0-Autoregressive-13B-Video2World
No offloading 66.2 GB > 80 GB
Guardrails 58.7 GB 76.6 GB
Guardrails & T5 encoder 41.3 GB 58.0 GB
Guardrails & T5 encoder & Diffusion decoder 29.0 GB 46.9 GB
Guardrails & T5 encoder & Diffusion decoder & Tokenizer 28.8 GB 46.7 GB
Guardrails & T5 encoder & Diffusion decoder & Tokenizer & AR model 21.1 GB 30.9 GB

End-to-end inference runtime on one H100 with no offloading for 5B model and guardrail offloading for 13B, after model initialization:

Cosmos-1.0-Autoregressive-5B-Video2World Cosmos-1.0-Autoregressive-13B-Video2World
~73 seconds ~150 seconds

Arguments

Common Parameters

Parameter Description Default
--checkpoint_dir Directory containing model weights "checkpoints"
--video_save_name Output video filename for single video generation "output"
--video_save_folder Folder where all output videos are stored "outputs/"
--input_image_or_video_path Input image or video path. Required for single video generation None
--batch_input_path Folder containing input images or videos. Required for batch video generation None
--num_input_frames Number of input frames to use for Video2World prediction 9
--temperature Temperature used while sampling 1.0 (recommend using values in sample commands provided)
--top_p Top-p value for top-p sampling 0.8 (recommend using values in sample commands provided)
--seed Random seed 0
--disable_diffusion_decoder When set to True, use discrete tokenizer to decode discrete tokens to video. Otherwise, use diffusion decoder to decode video False
--offload_guardrail_models Offload guardrail models after inference, used for low-memory GPUs False
--offload_diffusion_decoder Offload diffusion decoder after inference, used for low-memory GPUs False
--offload_ar_model Offload AR model after inference, used for low-memory GPUs False
--offload_prompt_upsampler Offload prompt upsampler after inference, used for low-memory GPUs False

Base Specific Parameters

Parameter Description Default
--ar_model_dir Directory containing AR model weight "Cosmos-1.0-Autoregressive-4B"
--input_type Input type, either video or image "video"

Video2World Specific Parameters

Parameter Description Default
--ar_model_dir Directory containing AR model weight "Cosmos-1.0-Autoregressive-4B"
--input_type Input type, either text_and_video or text_and_image "text_and_video"
--prompt Text prompt for single video generation. Required for single video generation None
--input_prompts_path Path to JSONL file for batch video generation. Required for batch video generation None
--offload_text_encoder_model Offload text encoder after inference, used for low-memory GPUs False

Safety Features

The model uses a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed and will be blurred by the guardrail.

For more information, check out the Cosmos Guardrail Documentation.