SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
The abstract from the paper is:
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
This pipeline was contributed by lawrence-cj and chenjy2003. The original codebase can be found here. The original weights can be found under hf.co/Efficient-Large-Model.
Available models:
Model | Recommended dtype |
---|---|
Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers |
torch.bfloat16 |
Efficient-Large-Model/Sana_1600M_1024px_diffusers |
torch.float16 |
Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers |
torch.float16 |
Efficient-Large-Model/Sana_1600M_512px_diffusers |
torch.float16 |
Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers |
torch.float16 |
Efficient-Large-Model/Sana_600M_1024px_diffusers |
torch.float16 |
Efficient-Large-Model/Sana_600M_512px_diffusers |
torch.float16 |
Refer to this collection for more information.
Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in torch.bfloat16
or torch.float32
for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.
Make sure to pass the variant
argument for downloaded checkpoints to use lower disk space. Set it to "fp16"
for models with recommended dtype as torch.float16
, and "bf16"
for models with recommended dtype as torch.bfloat16
. By default, torch.float32
weights are downloaded, which use twice the amount of disk storage. Additionally, torch.float32
weights can be downcasted on-the-fly by specifying the torch_dtype
argument. Read about it in the docs.
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [SanaPipeline
] for inference with bitsandbytes.
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaTransformer2DModel, SanaPipeline
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel
quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = AutoModel.from_pretrained(
"Efficient-Large-Model/Sana_1600M_1024px_diffusers",
subfolder="text_encoder",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = SanaTransformer2DModel.from_pretrained(
"Efficient-Large-Model/Sana_1600M_1024px_diffusers",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)
pipeline = SanaPipeline.from_pretrained(
"Efficient-Large-Model/Sana_1600M_1024px_diffusers",
text_encoder=text_encoder_8bit,
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)
prompt = "a tiny astronaut hatching from an egg on the moon"
image = pipeline(prompt).images[0]
image.save("sana.png")
[[autodoc]] SanaPipeline
- all
- call
[[autodoc]] SanaPAGPipeline
- all
- call
[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput