Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can InstantID be faster? #42

Open
duongna21 opened this issue Jan 23, 2024 · 21 comments
Open

Can InstantID be faster? #42

duongna21 opened this issue Jan 23, 2024 · 21 comments

Comments

@duongna21
Copy link

Thank you for your great work!
I'm curious if we can make InstantID faster? It's 3x slower than the standard SDXL pipeline on my 4090 (2.73it/s vs 8it/s). What slows it down so much?

@haofanwang
Copy link
Member

I think LCM-LoRA and SDXL-turbo would be good choices. Deserve a shot.

@haofanwang
Copy link
Member

You can also turn on xformer for optimization.

@dm33tri
Copy link

dm33tri commented Jan 23, 2024

You can also turn on xformer for optimization.

I found inference to be faster when using AttnProcessor2_0 (from the InstantID repo) than using xFormers. Perhaps code is unfinished, as there is a is_torch2_available function but it's not used?

@haofanwang
Copy link
Member

Yes. Your gain benefits from torch2.0

@Syndulla
Copy link

You can also turn on xformer for optimization.

I found inference to be faster when using AttnProcessor2_0 (from the InstantID repo) than using xFormers. Perhaps code is unfinished, as there is a is_torch2_available function but it's not used?

Would you mind sharing how you implemented AttnProcessor2_0? I'd like to test it.

@wangqixun
Copy link
Member

The simplest solution:
set num_inference_steps=20

The image quality is good enough. Speed up

@dm33tri
Copy link

dm33tri commented Jan 24, 2024

You can also turn on xformer for optimization.

I found inference to be faster when using AttnProcessor2_0 (from the InstantID repo) than using xFormers. Perhaps code is unfinished, as there is a is_torch2_available function but it's not used?

Would you mind sharing how you implemented AttnProcessor2_0? I'd like to test it.

It's already implemented, you just swap AttnProcessor to AttnProcessor2_0 in pipeline_stable_diffusion_xl_instantid.py

@ResearcherXman
Copy link
Member

We will update soon

@ResearcherXman
Copy link
Member

Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit.

@zvit
Copy link

zvit commented Jan 27, 2024

@haofanwang Where do you turn on xformer?

@inck86
Copy link

inck86 commented Jan 27, 2024

Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit.

"To use it, you just need to load it and infer with a small num_inference_steps"

I add in app.py and it was downloaded:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="latent-consistency/lcm-lora-sdxl", filename="pytorch_lora_weights.safetensors", local_dir="./checkpoints")

but where should I insert this to make it work? :

from diffusers import LCMScheduler

lcm_lora_path = "./checkpoints/pytorch_lora_weights.safetensors"

pipe.load_lora_weights(lcm_lora_path)
pipe.fuse_lora()
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

num_inference_steps = 10
guidance_scale = 0

@inck86
Copy link

inck86 commented Jan 28, 2024

Thank you for your great work! I'm curious if we can make InstantID faster? It's 3x slower than the standard SDXL pipeline on my 4090 (2.73it/s vs 8it/s). What slows it down so much?

One guy from my channel wrote: Installed locally on ubuntu. The 3090 card. On average, an image takes about 22 seconds..
Maybe it's about Ubuntu?

@johndpope
Copy link
Contributor

johndpope commented Jan 29, 2024

I'm on 3090 - 24gb VRAM - got 100gb of RAM (DDR5) / nvme samsung

I asked chatgpt to help profile - it suggested cprofile (comes out of the box)
seeing 16.289 seconds

am running this faceswap.py code (lcm)
https://gist.github.com/johndpope/71b70e876a28128228db8bcee2355b0a

#89

import cProfile
import faceswap  # Replace with your script/module name
import pstats


cProfile.run('faceswap.main()', 'profile_output')
p = pstats.Stats('profile_output')
p.sort_stats('cumulative').print_stats(10)  # Adjust as needed

Screenshot from 2024-01-29 17-35-11

Mon Jan 29 17:34:06 2024    profile_output

         16026416 function calls (11379666 primitive calls) i**n 16.289 seconds**

   Ordered by: cumulative time
   List reduced from 4085 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    334/1    0.006    0.000   16.295   16.295 {built-in method builtins.exec}
        1    0.036    0.036   16.295   16.295 <string>:1(<module>)
        1    0.004    0.004   16.260   16.260 /media/2TB/InstantID/faceswap.py:98(main)
        1    0.000    0.000    4.307    4.307 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/_contextlib.py:112(decorate_context)
        1    0.002    0.002    4.307    4.307 /media/2TB/InstantID/pipeline_stable_diffusion_xl_instantid_inpaint.py:154(__call__)
        1    0.000    0.000    3.390    3.390 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/diffusers/loaders/lora.py:1407(load_lora_weights)
     33/4    0.099    0.003    3.333    0.833 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:102(_inner_fn)
        1    0.000    0.000    2.889    2.889 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/diffusers/loaders/lora.py:374(load_lora_into_unet)
        1    0.000    0.000    2.742    2.742 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/peft/mapping.py:136(inject_adapter_in_model)
        1    0.000    0.000    2.742    2.742 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/peft/tuners/lora/model.py:110(__init__)

@haofanwang
Copy link
Member

dreamshaperXL_turboDpmppSDE is good turbo model and can generate good results with fewer steps.

@inck86
Copy link

inck86 commented Feb 12, 2024

on Forge UI it works more, more faster!
About 1.5min to 1280x1024 for my 3060/12!

@patientx
Copy link

patientx commented Feb 14, 2024

Ryzen 3600x - 8 GB RX 6600 - 16 gb ram , swap on nvme drive. Using comfyui. (--directml --use-split-cross-attention --lowvram)

Normal sdxl generation is around 10 seconds / step on my machine but when I use instantid it jumps to around six times , 60 to 70 seconds per step :/ Even when I use optimized turbo models that uses 7 steps the result is the same.

I can see that my gpu isn't being used for most of every step only a portion , most of the stuff is going on on cpu and with me using --lowvram option to not get out of memory errors (thankfully I don't get them with this setup) I suspect most of extra time is because of on swapping the system memory on nvme. Mind you , actual generation can be seen on gpu power usage and if say one step is 60 seconds power is maxed for around 10 seconds , exact time without using instantid , normally gpu power usage is always maxed when generating but here most of the time gpu is just idling ,, so gpu isn't effecting this at all.

How can I improve this ? Maybe pruned models would help if it is possible at all ? Thanks.

@rupeshs
Copy link

rupeshs commented Feb 22, 2024

@patientx
Copy link

@patientx Check it out https://github.com/rupeshs/instantidcpu

Tried it at the first few hours ofc :) Unfortunately it is very slow for me , slower than gpu option. Thanks for reminding again.

@onnmah
Copy link

onnmah commented Apr 28, 2024

Ryzen 3600x - 8 GB RX 6600 - 16 gb ram , swap on nvme drive. Using comfyui. (--directml --use-split-cross-attention --lowvram)

Normal sdxl generation is around 10 seconds / step on my machine but when I use instantid it jumps to around six times , 60 to 70 seconds per step :/ Even when I use optimized turbo models that uses 7 steps the result is the same.

I can see that my gpu isn't being used for most of every step only a portion , most of the stuff is going on on cpu and with me using --lowvram option to not get out of memory errors (thankfully I don't get them with this setup) I suspect most of extra time is because of on swapping the system memory on nvme. Mind you , actual generation can be seen on gpu power usage and if say one step is 60 seconds power is maxed for around 10 seconds , exact time without using instantid , normally gpu power usage is always maxed when generating but here most of the time gpu is just idling ,, so gpu isn't effecting this at all.

How can I improve this ? Maybe pruned models would help if it is possible at all ? Thanks.

I had the same issue and realized it's related to the onnxruntime using cpu instead of gpu. Uninstalling it and reinstalling onnxruntime-gpu fixed it for me.

pip uninstall onnxruntime
pip install --upgrade --force-reinstall onnxruntime-gpu

hope it helps.

@RobinChen007
Copy link

Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit.

"To use it, you just need to load it and infer with a small num_inference_steps"

I add in app.py and it was downloaded: from huggingface_hub import hf_hub_download hf_hub_download(repo_id="latent-consistency/lcm-lora-sdxl", filename="pytorch_lora_weights.safetensors", local_dir="./checkpoints")

but where should I insert this to make it work? :

from diffusers import LCMScheduler

lcm_lora_path = "./checkpoints/pytorch_lora_weights.safetensors"

pipe.load_lora_weights(lcm_lora_path) pipe.fuse_lora() pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

num_inference_steps = 10 guidance_scale = 0

I have already tried it, and i found a question, If the lora model were not belong to LCM lora, LCMScheduler will reduce image quality.

@ChaerilM
Copy link

i would like to know how to speed things up as well, im using aws g5, and normal photo generation would take ~20 minute. im using replicate version since i need the api.

but from startup its pretty slow;

  • setting pose, canny, sdxl would take ~3 minutes
  • loading pipeline component takes 20 - 30 lt

when making image, it runs at 40-50lt for 30 cycles. while lora does reduce it cycle to 5. the speed is still same, and in some test i found that 1st cycle is over 100sec, 2nd goes to 70, then 3rd and so on is around 50.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests