-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading #221
base: master
Are you sure you want to change the base?
Conversation
@FSSRepo is there a reason you diverge so much from what automatic111's webui api returns? it would return a json containing an "images" array. You seem to return some multiline string:
|
@FSSRepo , very excited by the items you are working on. In particular, I am interested in "the kernel by merging the conv2d operation (im2col + GEMM)". Can you elaborate more on this? Is |
@FSSRepo I sugest changing the default port to either
|
I am going to do something similar to flash attention. I am going to divide the blocks generated by im2col into smaller parts and perform matrix multiplication using tensor cores. However, upon analyzing the algorithm, it will have to iterate along the channels, which could be up to 102,400 iterations minimum sequentially. This could incur a large number of memory accesses (hence the regression in performance).
I will try to make the endpoints equivalent to automatic111 to continue considering standard functionalities, so to speak. |
@leejet I'm not sure if this could lead to memory leaks since it needs to be created for each model, and it's a lot 10MB to store just the metadata of the tensors, this can be stable-diffusion.cpp/ggml_extend.hpp Lines 972 to 982 in afea457
There are some arbitrary memory space additions like this: stable-diffusion.cpp/control.hpp Lines 331 to 334 in afea457
|
The additional space for |
…nt (dark mode) + vae: 60% less memory usage
@FSSRepo for latest commit I had to use ggerganov/llama.cpp@2f538b9 (from the flash-attn llama.cpp branch)
(alot of those) |
For now, the kernel I created to avoid the overhead of im2col results in a 50% reduction in performance, even though it's only applied to the operation that generates a tensor of up to 1.2GB for a 512x512 image. I'll try to optimize it further later on. As for the flash attention kernel, it doesn't improve the overall inference performance as expected because it's only applied when head_dim is 40, which generates a tensor [4096, 4096, 8] weighing 500 MB for a 512x512 image and 2 GB for a 1024x1024 image, though only with SD 1.5. All these changes haven't been tested on SDXL yet. I'll rent an RTX 3060 for quick tests. |
@Green-Sky I'll try to do tests on RTX 3060, mainly with CUDA Toolkit 11.8. The truth is that there isn't a standard API for stable diffusion. For example, the ComfyUI API isn't the same as Automatic111's, nor is the one for Fooocus. It's somewhat like in the LLMs that have the ChatML API, which emulates the OpenAI endpoints (llama.cpp, vLLM, text-generation-webui, and others). |
Yea, there is no "the one" web api, but automatic1111's api is (or was?) the one with the most usage. Truthfully I made a small chatbot with it, before comfyui was a thing. Thanks again for doing this PR :) |
@Green-Sky If it seemed easier to me to implement it this way (stream endpoint like chat), since otherwise it would have required websockets or a loop calling an 'http:127.0.0.0:7680/progress' endpoint, which would be the equivalent in Automatic1111 if you want to know the real-time status. |
@Green-Sky I fixed it already. |
@Green-Sky try now |
@FSSRepo it indeed works now again :) Did you disable VAE tiling? because now it fails to allocate the buffer for decoding.
i like the colors :) edit: actually there are easy 6GiB vram available. |
Try enable SD_CONV2D_MEMORY_EFFICIENT this reduces the vae memory usage, or enable VAE tiling manually on the ui |
Can conv2d fused support p40 bro? |
@Green-Sky Thank you for fixing the error, and yes, I have been inactive because I've been feeling a bit demotivated. |
This PR will probably take me a very long time since I'll have to make many modifications to the code. Due to changes in the way of building the computation graph similar to PyTorch (which I'm not entirely convinced about, but I believe it at least facilitates the implementation of new models), and the implementation of new models like SVD and PhotoMaker, the code has grown considerably. It will require a longer inspection than usual.
My goal is to add a REST API and a web client (prompt manager) as fully as possible. As a top priority, I also want to make the SDXL model work on my 4GB GPU. For that, I'll need to improve offloading to keep only the largest tensors that take a long time to move on the GPU, while the rest of the smaller matrices will reside in RAM.
I'll also implement Flash Attention for GPU backend to reduce memory usage in the UNet phase and create a kernel by merging the conv2d operation (im2col + GEMM) at the cost of a slight reduction in performance in the VAE phase (with a 60% reduction in memory usage, even lower if I implement FA in the attention phases of VAE).
I'll work on this in my spare time since I don't have much time due to working on some projects related to llama.cpp.
Build and run server:
The images have questionable quality since I'm using VAE tiling; for that reason, I want to reduce the amount of memory consumed by the VAE phase.
2024-04-05.17-38-44.mp4
Adding the k-quants is going to take a lot of time for me since I'll have to test the model through trial and error (my computer is somewhat slow for repetitive tasks, very frustrating) to see how much quantization affects the different parts of the model. However, there's also the limitation that only the attention weights are quantized, and these usually represent only 45% of the total model weight.
In the last few months, there have been a considerable number of changes in the ggml repository, and I want to use the latest version of ggml. This is going to be a headache.