Cosmos Autoregressive-based World Foundation Models

Getting Started

Set Up Docker Environment

Follow our Installation Guide to set up the Docker environment. All commands on this page should be run inside Docker.

Download Checkpoints

Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained').
Log in to Hugging Face with the access token:

huggingface-cli login

Download the Cosmos model weights from Hugging Face:

PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B

The downloaded files should be in the following structure:

checkpoints/
├── Cosmos-1.0-Autoregressive-4B
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Autoregressive-5B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Autoregressive-12B
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Autoregressive-13B-Video2World
│   ├── model.pt
│   └── config.json
├── Cosmos-1.0-Tokenizer-CV8x8x8
│   ├── decoder.jit
│   ├── encoder.jit
│   └── mean_std.pt
├── Cosmos-1.0-Tokenizer-DV8x16x16
│   ├── decoder.jit
│   └── encoder.jit
├── Cosmos-1.0-Diffusion-7B-Decoder-DV8x16x16ToCV8x8x8
│   ├── aux_vars.pt
│   └── model.pt
└── Cosmos-1.0-Guardrail
    ├── aegis/
    ├── blocklist/
    ├── face_blur_filter/
    └── video_content_safety_filter/

Usage

Model Types

There are two model types available for autoregressive world generation:

Base: Supports world generation from image/video input

Models: Cosmos-1.0-Autoregressive-4B and Cosmos-1.0-Autoregressive-12B
Inference script: base.py

Video2World: Supports world generation from image/video input and text input

Models: Cosmos-1.0-Autoregressive-5B-Video2World and Cosmos-1.0-Autoregressive-13B-Video2World
Inference script: video2world.py

Our models now support video extension up to 33 frames. Starting from either a single image or a 9-frame video input, they can generate the remaining frames to reach the 33-frame length (generating 32 or 24 frames, respectively).

We have evaluated all eight possible configurations (4 models × 2 vision input types: image or video) using 100 test videos on physical AI topics. Below are the failure rates for each configuration:

Model	Image input	Video input (9 frames)
Cosmos-1.0-Autoregressive-4B	15%	1%
Cosmos-1.0-Autoregressive-5B-Video2World	7%	2%
Cosmos-1.0-Autoregressive-12B	2%	1%
Cosmos-1.0-Autoregressive-13B-Video2World	3%	0%

We define failure cases as videos with severe distortions, such as:

Sudden appearance of large unexpected objects
Video degrading to a single solid color

Note that the following are not considered failures in our analysis:

Static video frames
Minor object distortions or artifacts

Single and Batch Generation

We support both single and batch video generation.

For generating a single video, base mode requires the input argument --input_image_or_video_path (image/video input), while video2world mode requires both --input_image_or_video_path (image/video input) and --prompt (text input).

Note that our model only works with 1024x640 resolution videos. If the input image/video is not in this resolution, it will be resized and cropped.

For generating a batch of videos, both base and video2world require --batch_input_path (path to a JSONL file). For base, the JSONL file should contain one visual input per line in the following format, where each line must contain a "visual_input" field:

{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}

For video2world, each line in the JSONL file must contain both "prompt" and "visual_input" fields:

{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}

Sample Commands

There are two main demo scripts for autoregressive world generation: base.py and video2world.py. Below you will find sample commands for single and batch generation, as well as commands for running with low-memory GPUs using model offloading. We also provide a memory usage table comparing different offloading strategies to help with configuration.

Base (base.py): 4B and 12B

Generates world from image/video input.

The input_type argument can be either video or image. We have tuned the sampling parameters top_p and temperature to achieve the best performance. Please use the provided values in the command examples.

Note that the command examples below all use video input. If you want to use image input, please change the input_type to image.

Single Generation

# Example using 4B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0

# Example for low-memory GPUs using 4B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer

# Example using 12B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0

# Example for low-memory GPUs using 12B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --video_save_name=Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer

Batch Generation

# Example using 4B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-4B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-4B \
    --top_p=0.8 \
    --temperature=1.0

# Example using 12B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/base.py \
    --input_type=video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/base.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-12B \
    --ar_model_dir=Cosmos-1.0-Autoregressive-12B \
    --top_p=0.9 \
    --temperature=1.0

Example Output

Here is an example output video generated using base.py with image input, using Cosmos-1.0-Autoregressive-12B:

output_from_image_input_12b.mp4

The input image used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.jpg. The image is from BDD dataset.

Here is an example output video generated using base.py with 9-frame video input, using Cosmos-1.0-Autoregressive-12B:

output_from_video_input_12b.mp4

The input video used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.mp4.

Inference Time and GPU Memory Usage

These numbers may vary based on system specifications and are provided for reference only.

Offloading Strategy	Cosmos-1.0-Autoregressive-4B	Cosmos-1.0-Autoregressive-12B
No offloading	31.3 GB	47.5 GB
Guardrails	28.9 GB	45.2 GB
Guardrails & Diffusion decoder	28.5 GB	43.1 GB
Guardrails & Diffusion decoder & Tokenizer	27.3 GB	42.9 GB
Guardrails & Diffusion decoder & Tokenizer & AR model	18.7 GB	27.4 GB

End-to-end inference runtime on one H100 without offloading and after model initialization:

Cosmos-1.0-Autoregressive-4B	Cosmos-1.0-Autoregressive-12B
~62 seconds	~119 seconds

Video2World (video2world.py): 5B and 13B

Generates world from image/video and text input.

The input_type argument can be either text_and_video or text_and_image. We have tuned the sampling parameters top_p and temperature to achieve the best performance. Please use the provided values in the command examples.

Note that the command examples below all use video input. If you want to use image input, please change the input_type to text_and_image.

Single Generation

# Example using 5B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0

# Example for low-memory GPUs using 5B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer \
    --offload_text_encoder_model

# Example using 13B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models

# Example for low-memory GPUs using 13B model with model offloading
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --input_image_or_video_path=cosmos1/models/autoregressive/assets/v1p0/input.mp4 \
    --prompt="A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions." \
    --video_save_name=Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models \
    --offload_diffusion_decoder \
    --offload_ar_model \
    --offload_tokenizer \
    --offload_text_encoder_model

Batch Generation

# Example using 5B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-5B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World \
    --top_p=0.7 \
    --temperature=1.0

# Example using 13B model
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$(pwd) python cosmos1/models/autoregressive/inference/video2world.py \
    --input_type=text_and_video \
    --batch_input_path=cosmos1/models/autoregressive/assets/v1p0/batch_inputs/video2world.jsonl \
    --video_save_folder=outputs/Cosmos-1.0-Autoregressive-13B-Video2World \
    --ar_model_dir=Cosmos-1.0-Autoregressive-13B-Video2World \
    --top_p=0.8 \
    --temperature=1.0 \
    --offload_guardrail_models

Example Output

Here is an example output video generated using video2world.py with image input, using Cosmos-1.0-Autoregressive-13B-Video2World:

output_from_image_input_13b.mp4

The input image used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.jpg. The prompt for generating the video is:

A driving video captures a serene urban street scene on a sunny day. The camera is mounted on the dashboard of a moving vehicle, providing a first-person perspective as it travels down a two-lane road. The street is lined with parked cars on both sides, predominantly black and silver sedans and SUVs. The road is flanked by a mix of residential and commercial buildings, with a prominent red-brick building on the left side, featuring multiple windows and a flat roof. The sky is clear with a few scattered clouds, casting soft shadows on the street. Trees with lush green foliage line the right side of the road, providing a natural contrast to the urban environment. The camera remains steady, maintaining a consistent forward motion, suggesting a leisurely drive. Traffic is light, with a few vehicles moving in the opposite direction, including a black sedan and a yellow taxi. Street signs are visible, including a no-parking sign on the right. The overall atmosphere is calm and peaceful, with no pedestrians visible, emphasizing the focus on the drive and the surrounding urban landscape.

Here is an example output video generated using video2world.py with 9-frame video input, using Cosmos-1.0-Autoregressive-13B-Video2World:

output_from_video_input_13b.mp4

The input video used to generate this video can be found in cosmos1/models/autoregressive/assets/v1p0/input.mp4. The prompt for generating the video is:

A video recorded from a moving vehicle's perspective, capturing roads, buildings, landscapes, and changing weather and lighting conditions.

Inference Time and GPU Memory Usage

These numbers may vary based on system specifications and are provided for reference only.

Offloading Strategy	Cosmos-1.0-Autoregressive-5B-Video2World	Cosmos-1.0-Autoregressive-13B-Video2World
No offloading	66.2 GB	> 80 GB
Guardrails	58.7 GB	76.6 GB
Guardrails & T5 encoder	41.3 GB	58.0 GB
Guardrails & T5 encoder & Diffusion decoder	29.0 GB	46.9 GB
Guardrails & T5 encoder & Diffusion decoder & Tokenizer	28.8 GB	46.7 GB
Guardrails & T5 encoder & Diffusion decoder & Tokenizer & AR model	21.1 GB	30.9 GB

End-to-end inference runtime on one H100 with no offloading for 5B model and guardrail offloading for 13B, after model initialization:

Cosmos-1.0-Autoregressive-5B-Video2World	Cosmos-1.0-Autoregressive-13B-Video2World
~73 seconds	~150 seconds

Arguments

Common Parameters

Parameter	Description	Default
`--checkpoint_dir`	Directory containing model weights	"checkpoints"
`--video_save_name`	Output video filename for single video generation	"output"
`--video_save_folder`	Folder where all output videos are stored	"outputs/"
`--input_image_or_video_path`	Input image or video path. Required for single video generation	None
`--batch_input_path`	Folder containing input images or videos. Required for batch video generation	None
`--num_input_frames`	Number of input frames to use for Video2World prediction	9
`--temperature`	Temperature used while sampling	1.0 (recommend using values in sample commands provided)
`--top_p`	Top-p value for top-p sampling	0.8 (recommend using values in sample commands provided)
`--seed`	Random seed	0
`--disable_diffusion_decoder`	When set to True, use discrete tokenizer to decode discrete tokens to video. Otherwise, use diffusion decoder to decode video	False
`--offload_guardrail_models`	Offload guardrail models after inference, used for low-memory GPUs	False
`--offload_diffusion_decoder`	Offload diffusion decoder after inference, used for low-memory GPUs	False
`--offload_ar_model`	Offload AR model after inference, used for low-memory GPUs	False
`--offload_prompt_upsampler`	Offload prompt upsampler after inference, used for low-memory GPUs	False

Base Specific Parameters

Parameter	Description	Default
`--ar_model_dir`	Directory containing AR model weight	"Cosmos-1.0-Autoregressive-4B"
`--input_type`	Input type, either `video` or `image`	"video"

Video2World Specific Parameters

Parameter	Description	Default
`--ar_model_dir`	Directory containing AR model weight	"Cosmos-1.0-Autoregressive-4B"
`--input_type`	Input type, either `text_and_video` or `text_and_image`	"text_and_video"
`--prompt`	Text prompt for single video generation. Required for single video generation	None
`--input_prompts_path`	Path to JSONL file for batch video generation. Required for batch video generation	None
`--offload_text_encoder_model`	Offload text encoder after inference, used for low-memory GPUs	False

Safety Features

The model uses a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed and will be blurred by the guardrail.

For more information, check out the Cosmos Guardrail Documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cosmos Autoregressive-based World Foundation Models

Table of Contents

Getting Started

Set Up Docker Environment

Download Checkpoints

Usage

Model Types

Single and Batch Generation

Sample Commands

Base (base.py): 4B and 12B

Single Generation

Batch Generation

Example Output

Inference Time and GPU Memory Usage

Video2World (video2world.py): 5B and 13B

Single Generation

Batch Generation

Example Output

Inference Time and GPU Memory Usage

Arguments

Common Parameters

Base Specific Parameters

Video2World Specific Parameters

Safety Features

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cosmos Autoregressive-based World Foundation Models

Table of Contents

Getting Started

Set Up Docker Environment

Download Checkpoints

Usage

Model Types

Single and Batch Generation

Sample Commands

Base (base.py): 4B and 12B

Single Generation

Batch Generation

Example Output

Inference Time and GPU Memory Usage

Video2World (video2world.py): 5B and 13B

Single Generation

Batch Generation

Example Output

Inference Time and GPU Memory Usage

Arguments

Common Parameters

Base Specific Parameters

Video2World Specific Parameters

Safety Features