Can InstantID be faster? #42

duongna21 · 2024-01-23T16:11:21Z

Thank you for your great work!
I'm curious if we can make InstantID faster? It's 3x slower than the standard SDXL pipeline on my 4090 (2.73it/s vs 8it/s). What slows it down so much?

haofanwang · 2024-01-23T16:52:08Z

I think LCM-LoRA and SDXL-turbo would be good choices. Deserve a shot.

haofanwang · 2024-01-23T16:53:05Z

You can also turn on xformer for optimization.

dm33tri · 2024-01-23T21:52:53Z

You can also turn on xformer for optimization.

I found inference to be faster when using AttnProcessor2_0 (from the InstantID repo) than using xFormers. Perhaps code is unfinished, as there is a is_torch2_available function but it's not used?

haofanwang · 2024-01-24T02:56:51Z

Yes. Your gain benefits from torch2.0

Syndulla · 2024-01-24T08:31:34Z

You can also turn on xformer for optimization.

I found inference to be faster when using AttnProcessor2_0 (from the InstantID repo) than using xFormers. Perhaps code is unfinished, as there is a is_torch2_available function but it's not used?

Would you mind sharing how you implemented AttnProcessor2_0? I'd like to test it.

wangqixun · 2024-01-24T08:35:53Z

The simplest solution:
set num_inference_steps=20

The image quality is good enough. Speed up

dm33tri · 2024-01-24T09:39:52Z

You can also turn on xformer for optimization.

I found inference to be faster when using AttnProcessor2_0 (from the InstantID repo) than using xFormers. Perhaps code is unfinished, as there is a is_torch2_available function but it's not used?

Would you mind sharing how you implemented AttnProcessor2_0? I'd like to test it.

It's already implemented, you just swap AttnProcessor to AttnProcessor2_0 in pipeline_stable_diffusion_xl_instantid.py

ResearcherXman · 2024-01-24T09:58:49Z

We will update soon

ResearcherXman · 2024-01-24T14:40:38Z

Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit.

zvit · 2024-01-27T13:10:31Z

@haofanwang Where do you turn on xformer?

inck86 · 2024-01-27T15:29:51Z

Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit.

"To use it, you just need to load it and infer with a small num_inference_steps"

I add in app.py and it was downloaded:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="latent-consistency/lcm-lora-sdxl", filename="pytorch_lora_weights.safetensors", local_dir="./checkpoints")

but where should I insert this to make it work? :

from diffusers import LCMScheduler

lcm_lora_path = "./checkpoints/pytorch_lora_weights.safetensors"

pipe.load_lora_weights(lcm_lora_path)
pipe.fuse_lora()
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

num_inference_steps = 10
guidance_scale = 0

inck86 · 2024-01-28T12:27:23Z

Thank you for your great work! I'm curious if we can make InstantID faster? It's 3x slower than the standard SDXL pipeline on my 4090 (2.73it/s vs 8it/s). What slows it down so much?

One guy from my channel wrote: Installed locally on ubuntu. The 3090 card. On average, an image takes about 22 seconds..
Maybe it's about Ubuntu?

johndpope · 2024-01-29T06:39:41Z

I'm on 3090 - 24gb VRAM - got 100gb of RAM (DDR5) / nvme samsung

I asked chatgpt to help profile - it suggested cprofile (comes out of the box)
seeing 16.289 seconds

am running this faceswap.py code (lcm)
https://gist.github.com/johndpope/71b70e876a28128228db8bcee2355b0a

#89

import cProfile
import faceswap  # Replace with your script/module name
import pstats


cProfile.run('faceswap.main()', 'profile_output')
p = pstats.Stats('profile_output')
p.sort_stats('cumulative').print_stats(10)  # Adjust as needed

Mon Jan 29 17:34:06 2024    profile_output

         16026416 function calls (11379666 primitive calls) i**n 16.289 seconds**

   Ordered by: cumulative time
   List reduced from 4085 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    334/1    0.006    0.000   16.295   16.295 {built-in method builtins.exec}
        1    0.036    0.036   16.295   16.295 <string>:1(<module>)
        1    0.004    0.004   16.260   16.260 /media/2TB/InstantID/faceswap.py:98(main)
        1    0.000    0.000    4.307    4.307 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/_contextlib.py:112(decorate_context)
        1    0.002    0.002    4.307    4.307 /media/2TB/InstantID/pipeline_stable_diffusion_xl_instantid_inpaint.py:154(__call__)
        1    0.000    0.000    3.390    3.390 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/diffusers/loaders/lora.py:1407(load_lora_weights)
     33/4    0.099    0.003    3.333    0.833 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:102(_inner_fn)
        1    0.000    0.000    2.889    2.889 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/diffusers/loaders/lora.py:374(load_lora_into_unet)
        1    0.000    0.000    2.742    2.742 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/peft/mapping.py:136(inject_adapter_in_model)
        1    0.000    0.000    2.742    2.742 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/peft/tuners/lora/model.py:110(__init__)

haofanwang · 2024-01-30T07:53:48Z

dreamshaperXL_turboDpmppSDE is good turbo model and can generate good results with fewer steps.

inck86 · 2024-02-12T16:30:50Z

on Forge UI it works more, more faster!
About 1.5min to 1280x1024 for my 3060/12!

patientx · 2024-02-14T01:07:57Z

Ryzen 3600x - 8 GB RX 6600 - 16 gb ram , swap on nvme drive. Using comfyui. (--directml --use-split-cross-attention --lowvram)

Normal sdxl generation is around 10 seconds / step on my machine but when I use instantid it jumps to around six times , 60 to 70 seconds per step :/ Even when I use optimized turbo models that uses 7 steps the result is the same.

I can see that my gpu isn't being used for most of every step only a portion , most of the stuff is going on on cpu and with me using --lowvram option to not get out of memory errors (thankfully I don't get them with this setup) I suspect most of extra time is because of on swapping the system memory on nvme. Mind you , actual generation can be seen on gpu power usage and if say one step is 60 seconds power is maxed for around 10 seconds , exact time without using instantid , normally gpu power usage is always maxed when generating but here most of the time gpu is just idling ,, so gpu isn't effecting this at all.

How can I improve this ? Maybe pruned models would help if it is possible at all ? Thanks.

rupeshs · 2024-02-22T15:27:51Z

@patientx Check it out https://github.com/rupeshs/instantidcpu

patientx · 2024-02-23T10:21:55Z

@patientx Check it out https://github.com/rupeshs/instantidcpu

Tried it at the first few hours ofc :) Unfortunately it is very slow for me , slower than gpu option. Thanks for reminding again.

onnmah · 2024-04-28T11:56:58Z

Ryzen 3600x - 8 GB RX 6600 - 16 gb ram , swap on nvme drive. Using comfyui. (--directml --use-split-cross-attention --lowvram)

Normal sdxl generation is around 10 seconds / step on my machine but when I use instantid it jumps to around six times , 60 to 70 seconds per step :/ Even when I use optimized turbo models that uses 7 steps the result is the same.

I can see that my gpu isn't being used for most of every step only a portion , most of the stuff is going on on cpu and with me using --lowvram option to not get out of memory errors (thankfully I don't get them with this setup) I suspect most of extra time is because of on swapping the system memory on nvme. Mind you , actual generation can be seen on gpu power usage and if say one step is 60 seconds power is maxed for around 10 seconds , exact time without using instantid , normally gpu power usage is always maxed when generating but here most of the time gpu is just idling ,, so gpu isn't effecting this at all.

How can I improve this ? Maybe pruned models would help if it is possible at all ? Thanks.

I had the same issue and realized it's related to the onnxruntime using cpu instead of gpu. Uninstalling it and reinstalling onnxruntime-gpu fixed it for me.

pip uninstall onnxruntime
pip install --upgrade --force-reinstall onnxruntime-gpu

hope it helps.

RobinChen007 · 2024-07-08T03:15:45Z

Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit.

"To use it, you just need to load it and infer with a small num_inference_steps"

I add in app.py and it was downloaded: from huggingface_hub import hf_hub_download hf_hub_download(repo_id="latent-consistency/lcm-lora-sdxl", filename="pytorch_lora_weights.safetensors", local_dir="./checkpoints")

but where should I insert this to make it work? :

from diffusers import LCMScheduler

lcm_lora_path = "./checkpoints/pytorch_lora_weights.safetensors"

pipe.load_lora_weights(lcm_lora_path) pipe.fuse_lora() pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

num_inference_steps = 10 guidance_scale = 0

I have already tried it, and i found a question, If the lora model were not belong to LCM lora, LCMScheduler will reduce image quality.

ChaerilM · 2024-09-28T12:04:48Z

i would like to know how to speed things up as well, im using aws g5, and normal photo generation would take ~20 minute. im using replicate version since i need the api.

but from startup its pretty slow;

setting pose, canny, sdxl would take ~3 minutes
loading pipeline component takes 20 - 30 lt

when making image, it runs at 40-50lt for 30 cycles. while lora does reduce it cycle to 5. the speed is still same, and in some test i found that 1st cycle is over 100sec, 2nd goes to 70, then 3rd and so on is around 50.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can InstantID be faster? #42

Can InstantID be faster? #42

duongna21 commented Jan 23, 2024

haofanwang commented Jan 23, 2024

haofanwang commented Jan 23, 2024

dm33tri commented Jan 23, 2024

haofanwang commented Jan 24, 2024

Syndulla commented Jan 24, 2024

wangqixun commented Jan 24, 2024

dm33tri commented Jan 24, 2024

ResearcherXman commented Jan 24, 2024

ResearcherXman commented Jan 24, 2024

zvit commented Jan 27, 2024 •

edited

Loading

inck86 commented Jan 27, 2024

inck86 commented Jan 28, 2024

johndpope commented Jan 29, 2024 •

edited

Loading

haofanwang commented Jan 30, 2024

inck86 commented Feb 12, 2024

patientx commented Feb 14, 2024 •

edited

Loading

rupeshs commented Feb 22, 2024

patientx commented Feb 23, 2024

onnmah commented Apr 28, 2024

RobinChen007 commented Jul 8, 2024

ChaerilM commented Sep 28, 2024

Can InstantID be faster? #42

Can InstantID be faster? #42

Comments

duongna21 commented Jan 23, 2024

haofanwang commented Jan 23, 2024

haofanwang commented Jan 23, 2024

dm33tri commented Jan 23, 2024

haofanwang commented Jan 24, 2024

Syndulla commented Jan 24, 2024

wangqixun commented Jan 24, 2024

dm33tri commented Jan 24, 2024

ResearcherXman commented Jan 24, 2024

ResearcherXman commented Jan 24, 2024

zvit commented Jan 27, 2024 • edited Loading

inck86 commented Jan 27, 2024

inck86 commented Jan 28, 2024

johndpope commented Jan 29, 2024 • edited Loading

haofanwang commented Jan 30, 2024

inck86 commented Feb 12, 2024

patientx commented Feb 14, 2024 • edited Loading

rupeshs commented Feb 22, 2024

patientx commented Feb 23, 2024

onnmah commented Apr 28, 2024

RobinChen007 commented Jul 8, 2024

ChaerilM commented Sep 28, 2024

zvit commented Jan 27, 2024 •

edited

Loading

johndpope commented Jan 29, 2024 •

edited

Loading

patientx commented Feb 14, 2024 •

edited

Loading