Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SD3.5 IP-Adapter Pipeline Integration #9987

Merged
merged 55 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
0af910b
Initial pipeline for SD3.5-Large-IP-Adapter
guiyrt Nov 21, 2024
5567438
Added support for single IPAdapter on SD3.5 pipeline
guiyrt Dec 6, 2024
50d09d9
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 6, 2024
0ef36dd
Fixed typo and reverted removal of skip_layers in SD3Transformer2DModel
guiyrt Dec 7, 2024
de8909a
Added new SD3IPAdapterMixin loader
guiyrt Dec 9, 2024
ab0d904
ip_adapter image embeds now considers num_images_per_prompt
guiyrt Dec 9, 2024
d868ddb
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 9, 2024
5aed1d3
Removed usage of einops
guiyrt Dec 9, 2024
4383175
Reverted joint_attention_kwargs default for consistency
guiyrt Dec 9, 2024
461ab73
Corrected einops removal
guiyrt Dec 9, 2024
8323240
Quality and style checks
guiyrt Dec 9, 2024
89c4e63
Quality and style checks
guiyrt Dec 9, 2024
27d574f
Handle None joint_attention_kwargs in JointTransformerBlock
guiyrt Dec 9, 2024
0a48648
Fix test_components_function
hlky Dec 9, 2024
10d0a06
Remove from img2img/inpaint for now
hlky Dec 9, 2024
c78c4fd
Fixed loading ip_adapter state dict
guiyrt Dec 10, 2024
0f6c607
Simpler image encoding
guiyrt Dec 10, 2024
53fd40d
Style check
guiyrt Dec 10, 2024
8039599
Better checks for image prompt considering ip_adapter scale
guiyrt Dec 10, 2024
7333bfc
Minor change correcting checking for ip_adapter embeds
guiyrt Dec 10, 2024
a87895e
Removing old check of ip_adapter scale
guiyrt Dec 10, 2024
4ba374a
Refactor of image_proj (testing)
guiyrt Dec 10, 2024
819dd3e
Revert "Removing old check of ip_adapter scale"
guiyrt Dec 10, 2024
262a3bb
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 10, 2024
ea32e13
Corrected property check
guiyrt Dec 10, 2024
f60751f
Corrected forward() of IPAdapterTimeImageProjectionBlock
guiyrt Dec 11, 2024
b0aa5cb
IPAdapterTimeImageProjectionBlock now uses original attention impleme…
guiyrt Dec 12, 2024
b3dc69a
Clean-up and make style
guiyrt Dec 12, 2024
84aa4a3
Minor changes in code structure
guiyrt Dec 13, 2024
34793fb
make style && make quality
guiyrt Dec 13, 2024
27fe083
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 13, 2024
68169f8
Updated dosctrings and doc entries
guiyrt Dec 16, 2024
d824451
Merge branch 'main' into sd3.5_IPAdapter
hlky Dec 16, 2024
24e6880
make
hlky Dec 16, 2024
43d2e77
More docs and small refactors
guiyrt Dec 17, 2024
05f49e6
Merge remote-tracking branch 'origin' into sd3.5_IPAdapter
guiyrt Dec 17, 2024
44e3847
Fix in loading state dict
guiyrt Dec 18, 2024
178e513
Enabled cpu offload
guiyrt Dec 18, 2024
7899c6a
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 18, 2024
8daca65
Renaming from transformers_sd3 to transformer_sd3
guiyrt Dec 18, 2024
7c918db
Missing rename
guiyrt Dec 18, 2024
99a6d59
Updated docs for SD3 pipeline
guiyrt Dec 18, 2024
3916298
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 18, 2024
02a6d90
Update docs/source/en/api/pipelines/stable_diffusion/stable_diffusion…
guiyrt Dec 18, 2024
64ab7f9
Minor doc correction
guiyrt Dec 18, 2024
b254aa3
Updated img source to hf/documentation-images
guiyrt Dec 18, 2024
5c28161
image_proj is now called from SD3Transformer2DModel
guiyrt Dec 19, 2024
1313501
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 19, 2024
b882e1b
ip_adapter_image_embeds go through joint_attention_kwargs
guiyrt Dec 19, 2024
988447f
Warning for sequential cpu offloading with image_encoder
guiyrt Dec 19, 2024
66c4866
Merge branch 'main' into sd3.5_IPAdapter
guiyrt Dec 19, 2024
98f4521
make style quality
guiyrt Dec 19, 2024
5eef7f1
Merge branch 'main' into sd3.5_IPAdapter
yiyixuxu Dec 19, 2024
18cd8e4
Update src/diffusers/models/attention.py
yiyixuxu Dec 19, 2024
65b477f
Update src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_dif…
yiyixuxu Dec 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,8 @@
title: Textual Inversion
- local: api/loaders/unet
title: UNet
- local: api/loaders/transformer_sd3
title: SD3Transformer2D
- local: api/loaders/peft
title: PEFT
title: Loaders
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/api/attnprocessor.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ An attention processor is a class for applying different types of attention mech

[[autodoc]] models.attention_processor.IPAdapterAttnProcessor2_0

[[autodoc]] models.attention_processor.SD3IPAdapterJointAttnProcessor2_0

## JointAttnProcessor2_0

[[autodoc]] models.attention_processor.JointAttnProcessor2_0
Expand Down
6 changes: 6 additions & 0 deletions docs/source/en/api/loaders/ip_adapter.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading]

[[autodoc]] loaders.ip_adapter.IPAdapterMixin

## SD3IPAdapterMixin

[[autodoc]] loaders.ip_adapter.SD3IPAdapterMixin
- all
- is_ip_adapter_active

## IPAdapterMaskProcessor

[[autodoc]] image_processor.IPAdapterMaskProcessor
29 changes: 29 additions & 0 deletions docs/source/en/api/loaders/transformer_sd3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# SD3Transformer2D

This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.

The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.

<Tip>

To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.

</Tip>

## SD3Transformer2DLoadersMixin

[[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin
- all
- _load_ip_adapter_weights
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,76 @@ image.save("sd3_hello_world.png")
- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large)
- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo)

## Image Prompting with IP-Adapters

An IP-Adapter lets you prompt SD3 with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. To load and use an IP-Adapter, you need:

- `image_encoder`: Pre-trained vision model used to obtain image features, usually a CLIP image encoder.
- `feature_extractor`: Image processor that prepares the input image for the chosen `image_encoder`.
- `ip_adapter_id`: Checkpoint containing parameters of image cross attention layers and image projection.

IP-Adapters are trained for a specific model architecture, so they also work in finetuned variations of the base model. You can use the [`~SD3IPAdapterMixin.set_ip_adapter_scale`] function to adjust how strongly the output aligns with the image prompt. The higher the value, the more closely the model follows the image prompt. A default value of 0.5 is typically a good balance, ensuring the model considers both the text and image prompts equally.

```python
import torch
from PIL import Image

from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor

image_encoder_id = "google/siglip-so400m-patch14-384"
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"

feature_extractor = SiglipImageProcessor.from_pretrained(
image_encoder_id,
torch_dtype=torch.float16
)
image_encoder = SiglipVisionModel.from_pretrained(
image_encoder_id,
torch_dtype=torch.float16
).to( "cuda")

pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-large",
torch_dtype=torch.float16,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
).to("cuda")

pipe.load_ip_adapter(ip_adapter_id)
pipe.set_ip_adapter_scale(0.6)

ref_img = Image.open("image.jpg").convert('RGB')

image = pipe(
width=1024,
height=1024,
prompt="a cat",
negative_prompt="lowres, low quality, worst quality",
num_inference_steps=24,
guidance_scale=5.0,
ip_adapter_image=ref_img
).images[0]

image.save("result.jpg")
```

<div class="justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd3_ip_adapter_example.png"/>
<figcaption class="mt-2 text-sm text-center text-gray-500">IP-Adapter examples with prompt "a cat"</figcaption>
</div>


<Tip>

Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.

</Tip>


## Memory Optimisations for SD3

SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.

### Running Inference with Model Offloading

Expand Down
12 changes: 10 additions & 2 deletions src/diffusers/loaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def text_encoder_attn_modules(text_encoder):
if is_torch_available():
_import_structure["single_file_model"] = ["FromOriginalModelMixin"]

_import_structure["transformer_sd3"] = ["SD3Transformer2DLoadersMixin"]
_import_structure["unet"] = ["UNet2DConditionLoadersMixin"]
_import_structure["utils"] = ["AttnProcsLayers"]
if is_transformers_available():
Expand All @@ -74,19 +75,26 @@ def text_encoder_attn_modules(text_encoder):
"SanaLoraLoaderMixin",
]
_import_structure["textual_inversion"] = ["TextualInversionLoaderMixin"]
_import_structure["ip_adapter"] = ["IPAdapterMixin"]
_import_structure["ip_adapter"] = [
"IPAdapterMixin",
"SD3IPAdapterMixin",
]

_import_structure["peft"] = ["PeftAdapterMixin"]


if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
if is_torch_available():
from .single_file_model import FromOriginalModelMixin
from .transformer_sd3 import SD3Transformer2DLoadersMixin
from .unet import UNet2DConditionLoadersMixin
from .utils import AttnProcsLayers

if is_transformers_available():
from .ip_adapter import IPAdapterMixin
from .ip_adapter import (
IPAdapterMixin,
SD3IPAdapterMixin,
)
from .lora_pipeline import (
AmusedLoraLoaderMixin,
CogVideoXLoraLoaderMixin,
Expand Down
Loading
Loading