Does Multi GPU help? - having a hard time finetuning stable-diffusion 3.5 medium on a 16gb GPU #1307

NirmalF7 · 2025-01-29T18:14:36Z

NirmalF7
Jan 29, 2025

I am trying to fine-tune stable diffusion 3.5 medium and keep running out of space on my 16gb GPU. I upgraded to a 4x16gb machine and set training_num_processes=4 but all 4 GPUs then crashed. For context here is my config file (I have tried to reduce everything to as small as possible, even a LoRA rank of 4 just for testing). I think it may be a similar problem to this discussion #1306

{ "--model_type": "lora", "--model_family": "sd3", "--resume_from_checkpoint": "latest", "--checkpointing_steps": 200, "--checkpoints_total_limit": 60, "--learning_rate": 1.05e-3, "--pretrained_model_name_or_path": "stabilityai/stable-diffusion-3.5-medium", "--report_to": "wandb", "--tracker_project_name": "sd35-training", "--tracker_run_name": "simpletuner-esa-satellite", "--max_train_steps": 3300, "--num_train_epochs": 0, "--data_backend_config": "-----------------------", "--output_dir": "----------------------", "--push_to_hub": false, "--push_checkpoints_to_hub": false, "--hub_model_id": "sd35-training", "--resolution": 1024, "--resolution_type": "pixel", "--minimum_image_size": 1024, "--instance_prompt": "abcx", "--validation_prompt": "abcx, a car on a motorway", "--validation_guidance": 7.5, "--validation_guidance_rescale": 0.0, "--validation_steps": 200, "--validation_num_inference_steps": 30, "--validation_negative_prompt": "blurry, cropped, ugly", "--validation_seed": 42, "--validation_resolution": 1024, "--train_batch_size": 1, "--gradient_accumulation_steps": 1, "--lr_scheduler": "cosine", "--lr_warmup_steps": 330, "--caption_dropout_probability": 0.1, "--metadata_update_interval": 65, "--vae_batch_size": 1, "--delete_unwanted_images": false, "--delete_problematic_images": false, "--training_scheduler_timestep_spacing": "trailing", "--inference_scheduler_timestep_spacing": "trailing", "--snr_gamma": 5, "--enable_xformers_memory_efficient_attention": true, "--gradient_checkpointing": true, "--allow_tf32": true, "--optimizer": "adamw_bf16", "--use_ema": false, "--ema_decay": 0.999, "--seed": 42, "--mixed_precision": "bf16", "--lora_rank": 4, "--lora_alpha": 4, "--lora_type": "standard", "--base_model_precision": "int2-quanto", "--text_encoder_1_precision": "no_change", "--text_encoder_2_precision": "no_change", "--text_encoder_3_precision": "no_change", "--quantize_via":"cpu"

it fails every time here:
Discovering cache objects..
Processing bucket 1.0: 0%| | 0/2 [00:00<?, ?it/s]2025-01-29 17:58:15,446 [ERROR] Error encoding images

This is a dockized environment if that makes any difference, but I have made sure my container can see the GPUs and copied the template docker file. Thanks for any help :)

bghira · 2025-01-29T18:38:48Z

bghira
Jan 29, 2025
Maintainer

this is --vae_batch_size being too high

5 replies

bghira Jan 29, 2025
Maintainer

unfortunately it looks like you already have it set to 1. you might have to try enabling VAE tiling and slicing.

NirmalF7 Jan 29, 2025
Author

If I try this and get no improvements I assume my only option is a bigger GPU ?

bghira Jan 29, 2025
Maintainer

I would say that is a safe bet

NirmalF7 Jan 30, 2025
Author

Why would a singular 32gb GPU be better than 4x16gb GPUs?

bghira Jan 30, 2025
Maintainer

because the parameters are not sharded.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Multi GPU help? - having a hard time finetuning stable-diffusion 3.5 medium on a 16gb GPU #1307

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does Multi GPU help? - having a hard time finetuning stable-diffusion 3.5 medium on a 16gb GPU #1307

NirmalF7 Jan 29, 2025

Replies: 1 comment · 5 replies

bghira Jan 29, 2025 Maintainer

bghira Jan 29, 2025 Maintainer

NirmalF7 Jan 29, 2025 Author

bghira Jan 29, 2025 Maintainer

NirmalF7 Jan 30, 2025 Author

bghira Jan 30, 2025 Maintainer

NirmalF7
Jan 29, 2025

Replies: 1 comment 5 replies

bghira
Jan 29, 2025
Maintainer

bghira Jan 29, 2025
Maintainer

NirmalF7 Jan 29, 2025
Author

bghira Jan 29, 2025
Maintainer

NirmalF7 Jan 30, 2025
Author

bghira Jan 30, 2025
Maintainer