support 12GB cards

ToTheBeginning · Sep 19, 2024 · 0eae18d · 0eae18d
1 parent 476c6a0
commit 0eae18d
Show file tree

Hide file tree

Showing 3 changed files with 14 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@ We will actively update and maintain this repository in the near future, so plea
 - [x] Local gradio demo is ready now
 - [x] Online HuggingFace demo is ready now [![flux](https://img.shields.io/badge/🤗-PuLID_FLUX_demo-orange)](https://huggingface.co/spaces/yanze/PuLID-FLUX)
 - [x] We have optimized the codes to support consumer-grade GPUS, and now **PuLID-FLUX can run on a 16GB graphic card**. Check the details [here](https://github.com/ToTheBeginning/PuLID/blob/main/docs/pulid_for_flux.md#local-gradio-demo)
+- [x] Support 12GB graphic card
 
 
 Below results are generated with PuLID-FLUX.

diff --git a/docs/pulid_for_flux.md b/docs/pulid_for_flux.md
@@ -25,7 +25,10 @@ Run `python app_flux.py --offload --fp8 --onnx_provider cpu`, the peak memory is
 
 For 24GB graphic memory users, you can run `python app_flux.py --offload --fp8`, the peak memory is under 17GB.
 
-However, there is a difference in image quality between fp8 and bf16, with some degradation in the former. 
+For 12GB graphic memory users, you can run `python app_flux.py --aggressive_offload --fp8 --onnx_provider cpu`, the peak memory is about 11GB. 
+However, using aggressive offload (like sequential offload), the speed will be very slow due to the frequent need for memory transfers between CPU and GPU at each timestep.
+
+Please note that, there is a difference in image quality between fp8 and bf16, with some degradation in the former. 
 Specifically, the details of the face may be slightly worse, but the layout is similar. If you want the best results
 of PuLID-FLUX or you have the resources, please use bf16 rather than fp8.
 We have included a comparison in the table below.

diff --git a/flux/model.py b/flux/model.py
@@ -127,8 +127,16 @@ def forward(
 
         img = torch.cat((txt, img), 1)
         if aggressive_offload:
-            self.single_blocks = self.single_blocks.to(DEVICE)
+            # put half of the single blcoks to gpu
+            for i in range(len(self.single_blocks) // 2):
+                self.single_blocks[i] = self.single_blocks[i].to(DEVICE)
         for i, block in enumerate(self.single_blocks):
+            if aggressive_offload and i == len(self.single_blocks)//2:
+                # put first half of the single blcoks to cpu and last half to gpu
+                for j in range(len(self.single_blocks) // 2):
+                    self.single_blocks[j].cpu()
+                for j in range(len(self.single_blocks) // 2, len(self.single_blocks)):
+                    self.single_blocks[j] = self.single_blocks[j].to(DEVICE)
             x = block(img, vec=vec, pe=pe)
             real_img, txt = x[:, txt.shape[1]:, ...], x[:, :txt.shape[1], ...]