-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] SD3.5 IP-Adapter Pipeline Integration #9987
Conversation
I have now a more robust integration of InstantX/SD3.5-Large-IP-Adapter, but there are still some things to clean-up and improve :) A few points that I noted:
A couple things needed to change before merging (I think):
Also this (I had a few more points, but can share them later):
As I mentioned in the issue, as a proud member of the GPU poor community, I can't fit the pipeline in my puny 16GB GPU 😅 import torch
from PIL import Image
from diffusers.models.transformers import SD3Transformer2DModel
from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3 import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor
model_path = 'stabilityai/stable-diffusion-3.5-large'
image_encoder_path = "google/siglip-so400m-patch14-384"
ip_adapter_path = "InstantX/SD3.5-Large-IP-Adapter"
device = "cuda"
image_name = "image.png"
transformer = SD3Transformer2DModel.from_pretrained(
model_path, subfolder="transformer", torch_dtype=torch.bfloat16
)
feature_extractor = SiglipImageProcessor.from_pretrained(
image_encoder_path, torch_dtype=torch.bfloat16
)
image_encoder = SiglipVisionModel.from_pretrained(
image_encoder_path, torch_dtype=torch.bfloat16
)
pipe = StableDiffusion3Pipeline.from_pretrained(
model_path,
transformer=transformer,
torch_dtype=torch.bfloat16,
feature_extractor=feature_extractor,
image_encoder=image_encoder
).to(device)
pipe.load_ip_adapter(ip_adapter_path, subfolder="", weight_name="ip-adapter.bin")
pipe.set_ip_adapter_scale(0.6)
ref_img = Image.open(image_name).convert('RGB')
# please note that SD3.5 Large is sensitive to highres generation like 1536x1536
image = pipe(
width=1024,
height=1024,
prompt='a cat',
negative_prompt="lowres, low quality, worst quality",
num_inference_steps=24,
guidance_scale=5.0,
generator=torch.Generator(device).manual_seed(42),
ip_adapter_image=ref_img,
).images[0]
image.save('result.jpg') |
30e0dda
to
0ef36dd
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @guiyrt
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
Outdated
Show resolved
Hide resolved
@guiyrt Could you share some example outputs here? |
Thanks @hlky for fixing the tests 😃 |
Thanks! I'll go through your comments:
Yes makes sense, can create a new loader class for SD3.5.
Let's follow the original code for
Already resolved.
Yes I think we can do that.
Should be ok because
Looks like this is resolved, will check it.
It would be great if we can use existing modules
Can always be done in a follow-up PR, I don't know how common it is to use multiple IP adapter images anyway.
You can try |
…_3.md Co-authored-by: Steven Liu <[email protected]>
Thanks! Just opened it https://huggingface.co/datasets/huggingface/documentation-images/discussions/404 |
I was trying to figure out why Maybe this was already known before I spiraled into this rabbit hole, but the easy fix for now is to always keep |
Oh, I should mention that this started because I was also breaking the Let me know if I should change something, as this is changing the signature of |
Would you mind posting the traceback? Maybe there's something we can do but if the issue is in Siglip we may need to raise it with Transformers team. We probably don't want to keep Siglip on GPU, it's relatively heavy like CLIP Vision right? Has passing |
This happens with SigLIP from "google/siglip-so400m-patch14-384" has about 430M params and takes about 1GB of VRAM in Traceback when trying to include image_encoder in CPU offloading
Code to reproduceMake sure to comment out `_exclude_from_cpu_offload` in `StableDiffusion3Pipeline` (line 186)import torch
from PIL import Image
from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor
model_path = "stabilityai/stable-diffusion-3.5-large"
image_encoder_path = "google/siglip-so400m-patch14-384"
ip_adapter_path = "InstantX/SD3.5-Large-IP-Adapter"
feature_extractor = SiglipImageProcessor.from_pretrained(
image_encoder_path, torch_dtype=torch.bfloat16
)
image_encoder = SiglipVisionModel.from_pretrained(
image_encoder_path, torch_dtype=torch.bfloat16
)
pipe = StableDiffusion3Pipeline.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
)
pipe.load_ip_adapter(ip_adapter_path, revision="f1f54ca369ae759f9278ae9c87d46def9f133c78")
pipe.set_ip_adapter_scale(0.6)
pipe.enable_sequential_cpu_offload()
ref_img = Image.open("image.jpg").convert('RGB')
# please note that SD3.5 Large is sensitive to highres generation like 1536x1536
image = pipe(
width=1024,
height=1024,
prompt="a cat",
negative_prompt="lowres, low quality, worst quality",
num_inference_steps=24,
guidance_scale=5.0,
generator=torch.manual_seed(42),
ip_adapter_image=ref_img
).images[0]
image.save("result.jpg")
This happened with Yes, it fixed the issue with But as long as we don't access Traceback accessing `image_proj` from `StableDiffusion3Pipeline` (fixed in last commit)
TL;DR:
|
Awesome, thanks for checking into this. I think the As for |
As def enable_sequential_cpu_offload(self, *args, **kwargs):
if "image_encoder" not in self._exclude_from_cpu_offload:
logger.warning(
"`pipe.enable_sequential_cpu_offload()` might fail for `image_encoder` if it uses "
"`torch.nn.MultiheadAttention`. You can exclude `image_encoder` from CPU offloading by calling "
"`pipe._exclude_from_cpu_offload.append('image_encoder')` before `pipe.enable_sequential_cpu_offload()`."
)
super().enable_sequential_cpu_offload(*args, **kwargs)
Makes sense, I was afraid that when |
Sure, let's add that warning, nice idea. Thank you once again for all the iterations on this. |
I enjoyed and learned a lot over the course of this PR, thanks a lot for the guidance @hlky @yiyixuxu @stevhliu :) Not bad for a first PR ahah Unless we need to update any pipeline tests or you have more change suggestions, I'd say we're golden! We can add IP-Adapters to the rest of the SD3 pipelines, but that could maybe be a different PR, I've seen interest to use it especially with controlnet pipelines (#10129). If that's up for grabs, I'm happy to go for it :) We should try to merge Update checkpoints according to diffusers integration also once this is merged, so the checkpoints can be used. |
I think we can allow the user pass the |
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
Outdated
Show resolved
Hide resolved
* Added support for single IPAdapter on SD3.5 pipeline --------- Co-authored-by: hlky <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: YiYi Xu <[email protected]>
* Added support for single IPAdapter on SD3.5 pipeline --------- Co-authored-by: hlky <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: YiYi Xu <[email protected]>
What does this PR do?
Integrates IP-Adapter for SD3.5 pipeline, as discussed in #9966.
Before submitting
Who can review?
@yiyixuxu
@sayakpaul
@DN6
@asomoza
@fabiorigano
@haofanwang