- Please see and Comment - Only 4 steps, amazing images #1205
Replies: 5 comments 2 replies
-
I'll give this a shot. 12gb vram rtx 3060 and 16gb system ram. I'll let you know my results later, because these look amazing |
Beta Was this translation helpful? Give feedback.
-
Wow that's quite the feat for just 4 steps! We've seen with Turbo and such models before that "creativity" was a bit limited, with results somewhat similar to each other on random seeds, how does this merge perform? |
Beta Was this translation helpful? Give feedback.
-
I get a weird error when I try to load this and Forge crashes. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I finally got it working :D I only tested with one prompt but it's pretty nice~ thanks for the recommendation |
Beta Was this translation helpful? Give feedback.
-
Hi mate,
When I use the flux1-dev-bnb-nf4-v2.safetensors version, the image generation also takes a while. As I use it, the time decreases, I don't understand why.
So, I'm testing a merged version that I found on huggingface https://huggingface.co/drbaph/FLUX.1-schnell-dev-merged-fp8-4step and I'm liking the results.
With this merged version, I can generate very good images with only 4 steps. The time to generate each image is approximately 20s.
My Setup is:
Windows 11 Pro - Raizen 5 3500X
32GB RAM
8GB VRAM RTX 3050
Torch 2.3.1+cu121 autocast half
cuda: 12.1
cudnn: 8907
driver: 560.70
diffusers: 0.29.2
transformers: 4.44.0
python: 3.10.6
See some results:
All these images were generated using the FLUX.1-schnell-dev-merged-fp8-4step model, with only 4 steps, without VAE, with BNB-NF4. The average time of each generation was only 20s.
The first loading takes a while, but after loading the module, from the second creation onwards, the time is reduced considerably.
See more details below:
Stable Diffusion PATH: F:\ForgeFlux\webui
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-310-g695aad95
Commit hash: 695aad95e45a1cd24c016d3232d2be08b7faad17
CUDA 12.1
Launching Web UI with arguments: --precision full --opt-split-attention --always-batch-cond-uncond --no-half --skip-torch-cuda-test --pin-shared-memory --cuda-malloc --cuda-stream --ckpt-dir 'F:\ModelsForge\Checkpoints' --lora-dir 'F:\ModelsForge\Loras'
Using cudaMallocAsync backend.
Total VRAM 8191 MB, total RAM 32705 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Always pin shared GPU memory
Device: cuda:0 NVIDIA GeForce RTX 3050 : cudaMallocAsync
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: True
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: F:\ForgeFlux\webui\models\ControlNetPreprocessor
[-] ADetailer initialized. version: 24.8.0, num models: 10
sd-webui-prompt-all-in-one background API service started successfully.
17:32:07 - ReActor - STATUS - Running v0.7.1-a1 on Device: CUDA
2024-08-16 17:32:09,701 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'F:\ModelsForge\Checkpoints\FLUX1-SchnellDev-Merged-fp8-4step.safetensors', 'hash': '9e0fb423'}, 'additional_modules': [], 'unet_storage_dtype': 'nf4'}
Running on local URL: http://127.0.0.1:7860
To create a public link, set
share=True
inlaunch()
.IIB Database file has been successfully backed up to the backup folder.
Startup time: 35.7s (prepare environment: 12.8s, launcher: 2.4s, import torch: 3.8s, initialize shared: 0.1s, other imports: 1.1s, opts onchange: 0.8s, list SD models: 0.4s, load scripts: 5.5s, create ui: 5.0s, gradio launch: 2.7s, app_started_callback: 1.0s).
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
Loading Model: {'checkpoint_info': {'filename': 'F:\ModelsForge\Checkpoints\FLUX1-SchnellDev-Merged-fp8-4step.safetensors', 'hash': '9e0fb423'}, 'additional_modules': [], 'unet_storage_dtype': 'nf4'}
[Unload] Trying to free 953674316406250018963456.00 MB for cuda:0 with 0 models keep loaded ...
StateDict Keys: {'transformer': 776, 'vae': 244, 'text_encoder': 198, 'text_encoder_2': 220, 'ignore': 0}
Using Default T5 Data Type: torch.float16
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 93.1s (unload existing model: 0.2s, forge model load: 92.8s).
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
To load target model JointTextEncoder
Begin to load 1 model
[Unload] Trying to free 13465.80 MB for cuda:0 with 0 models keep loaded ...
[Memory Management] Current Free GPU Memory: 7184.00 MB
[Memory Management] Required Model Memory: 9570.62 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -3410.62 MB
[Memory Management] Loaded to GPU for backward capability: 73.14 MB
[Memory Management] Loaded to CPU Swap: 4790.00 MB (blocked method)
[Memory Management] Loaded to GPU: 4852.99 MB
Moving model(s) has taken 18.67 seconds
Distilled CFG Scale will be ignored for Schnell
To load target model KModel
Begin to load 1 model
[Unload] Trying to free 9137.91 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 1637.82 MB ...
[Unload] Unload model JointTextEncoder
[Memory Management] Current Free GPU Memory: 7121.19 MB
[Memory Management] Required Model Memory: 6241.47 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -144.28 MB
[Memory Management] Loaded to CPU Swap: 1426.84 MB (blocked method)
[Memory Management] Loaded to GPU: 4814.55 MB
Moving model(s) has taken 151.16 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:32<00:00, 8.12s/it]
To load target model IntegratedAutoencoderKL██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:18<00:00, 4.55s/it]
Begin to load 1 model
[Unload] Trying to free 2730.93 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 2037.39 MB ...
[Unload] Unload model KModel
[Memory Management] Current Free GPU Memory: 7114.03 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 5930.16 MB
Moving model(s) has taken 1.79 seconds
Total progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00, 7.11s/it]
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00, 4.55s/it]
To load target model JointTextEncoder
Begin to load 1 model
[Unload] Trying to free 13560.04 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 6944.74 MB ...
[Unload] Unload model IntegratedAutoencoderKL
[Memory Management] Current Free GPU Memory: 7104.61 MB
[Memory Management] Required Model Memory: 9643.11 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -3562.50 MB
[Memory Management] Loaded to GPU for backward capability: 145.52 MB
[Memory Management] Loaded to CPU Swap: 4998.00 MB (blocked method)
[Memory Management] Loaded to GPU: 4789.86 MB
Moving model(s) has taken 1.51 seconds
Distilled CFG Scale will be ignored for Schnell
To load target model KModel
Begin to load 1 model
[Unload] Trying to free 9137.91 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 1796.03 MB ...
[Unload] Unload model JointTextEncoder
[Memory Management] Current Free GPU Memory: 7100.03 MB
[Memory Management] Required Model Memory: 6241.47 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -165.44 MB
[Memory Management] Loaded to CPU Swap: 1446.64 MB (blocked method)
[Memory Management] Loaded to GPU: 4794.75 MB
Moving model(s) has taken 2.27 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.41s/it]
To load target model IntegratedAutoencoderKL██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00, 2.79s/it]
Begin to load 1 model
[Unload] Trying to free 2730.93 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 2040.43 MB ...
[Unload] Unload model KModel
[Memory Management] Current Free GPU Memory: 7099.45 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 5915.57 MB
Moving model(s) has taken 1.00 seconds
Total progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 3.02s/it]
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 2.79s/it]
To load target model KModel
Begin to load 1 model
[Unload] Trying to free 9137.91 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 6939.57 MB ...
[Unload] Unload model IntegratedAutoencoderKL
[Memory Management] Current Free GPU Memory: 7099.45 MB
[Memory Management] Required Model Memory: 6241.47 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -166.02 MB
[Memory Management] Loaded to CPU Swap: 1446.64 MB (blocked method)
[Memory Management] Loaded to GPU: 4794.75 MB
Moving model(s) has taken 1.66 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.42s/it]
To load target model IntegratedAutoencoderKL██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00, 2.78s/it]
Begin to load 1 model
[Unload] Trying to free 2730.93 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 1995.33 MB ...
[Unload] Unload model KModel
[Memory Management] Current Free GPU Memory: 7098.29 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: 5914.42 MB
Moving model(s) has taken 1.00 seconds
Total progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 3.01s/it]
Total progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00, 2.78s/it]
See, only 17s to generate the image:
Result in 4 steps, 17s...
Feel free to test it too
Best Regards here from Brazil
Beta Was this translation helpful? Give feedback.
All reactions