-
-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can InstantID be faster? #42
Comments
I think LCM-LoRA and SDXL-turbo would be good choices. Deserve a shot. |
You can also turn on xformer for optimization. |
I found inference to be faster when using |
Yes. Your gain benefits from torch2.0 |
Would you mind sharing how you implemented AttnProcessor2_0? I'd like to test it. |
The simplest solution: The image quality is good enough. Speed up |
It's already implemented, you just swap |
We will update soon |
Besides of Pytorch 2.0. We have already added an example for LCM-LoRA, which allows for fewer steps and can improve speed quite a bit. |
@haofanwang Where do you turn on xformer? |
"To use it, you just need to load it and infer with a small num_inference_steps" I add in app.py and it was downloaded: but where should I insert this to make it work? : from diffusers import LCMScheduler lcm_lora_path = "./checkpoints/pytorch_lora_weights.safetensors" pipe.load_lora_weights(lcm_lora_path) num_inference_steps = 10 |
One guy from my channel wrote: Installed locally on ubuntu. The 3090 card. On average, an image takes about 22 seconds.. |
I'm on 3090 - 24gb VRAM - got 100gb of RAM (DDR5) / nvme samsung I asked chatgpt to help profile - it suggested cprofile (comes out of the box) am running this faceswap.py code (lcm) import cProfile
import faceswap # Replace with your script/module name
import pstats
cProfile.run('faceswap.main()', 'profile_output')
p = pstats.Stats('profile_output')
p.sort_stats('cumulative').print_stats(10) # Adjust as needed Mon Jan 29 17:34:06 2024 profile_output
16026416 function calls (11379666 primitive calls) i**n 16.289 seconds**
Ordered by: cumulative time
List reduced from 4085 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
334/1 0.006 0.000 16.295 16.295 {built-in method builtins.exec}
1 0.036 0.036 16.295 16.295 <string>:1(<module>)
1 0.004 0.004 16.260 16.260 /media/2TB/InstantID/faceswap.py:98(main)
1 0.000 0.000 4.307 4.307 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/torch/utils/_contextlib.py:112(decorate_context)
1 0.002 0.002 4.307 4.307 /media/2TB/InstantID/pipeline_stable_diffusion_xl_instantid_inpaint.py:154(__call__)
1 0.000 0.000 3.390 3.390 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/diffusers/loaders/lora.py:1407(load_lora_weights)
33/4 0.099 0.003 3.333 0.833 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:102(_inner_fn)
1 0.000 0.000 2.889 2.889 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/diffusers/loaders/lora.py:374(load_lora_into_unet)
1 0.000 0.000 2.742 2.742 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/peft/mapping.py:136(inject_adapter_in_model)
1 0.000 0.000 2.742 2.742 /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/peft/tuners/lora/model.py:110(__init__) |
dreamshaperXL_turboDpmppSDE is good turbo model and can generate good results with fewer steps. |
on Forge UI it works more, more faster! |
Ryzen 3600x - 8 GB RX 6600 - 16 gb ram , swap on nvme drive. Using comfyui. (--directml --use-split-cross-attention --lowvram) Normal sdxl generation is around 10 seconds / step on my machine but when I use instantid it jumps to around six times , 60 to 70 seconds per step :/ Even when I use optimized turbo models that uses 7 steps the result is the same. I can see that my gpu isn't being used for most of every step only a portion , most of the stuff is going on on cpu and with me using --lowvram option to not get out of memory errors (thankfully I don't get them with this setup) I suspect most of extra time is because of on swapping the system memory on nvme. Mind you , actual generation can be seen on gpu power usage and if say one step is 60 seconds power is maxed for around 10 seconds , exact time without using instantid , normally gpu power usage is always maxed when generating but here most of the time gpu is just idling ,, so gpu isn't effecting this at all. How can I improve this ? Maybe pruned models would help if it is possible at all ? Thanks. |
@patientx Check it out https://github.com/rupeshs/instantidcpu |
Tried it at the first few hours ofc :) Unfortunately it is very slow for me , slower than gpu option. Thanks for reminding again. |
I had the same issue and realized it's related to the onnxruntime using cpu instead of gpu. Uninstalling it and reinstalling onnxruntime-gpu fixed it for me. pip uninstall onnxruntime hope it helps. |
I have already tried it, and i found a question, If the lora model were not belong to LCM lora, LCMScheduler will reduce image quality. |
i would like to know how to speed things up as well, im using aws g5, and normal photo generation would take ~20 minute. im using replicate version since i need the api. but from startup its pretty slow;
when making image, it runs at 40-50lt for 30 cycles. while lora does reduce it cycle to 5. the speed is still same, and in some test i found that 1st cycle is over 100sec, 2nd goes to 70, then 3rd and so on is around 50. |
Thank you for your great work!
I'm curious if we can make InstantID faster? It's 3x slower than the standard SDXL pipeline on my 4090 (2.73it/s vs 8it/s). What slows it down so much?
The text was updated successfully, but these errors were encountered: