Replies: 1 comment 5 replies
-
this is |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to fine-tune stable diffusion 3.5 medium and keep running out of space on my 16gb GPU. I upgraded to a 4x16gb machine and set training_num_processes=4 but all 4 GPUs then crashed. For context here is my config file (I have tried to reduce everything to as small as possible, even a LoRA rank of 4 just for testing). I think it may be a similar problem to this discussion #1306
{ "--model_type": "lora", "--model_family": "sd3", "--resume_from_checkpoint": "latest", "--checkpointing_steps": 200, "--checkpoints_total_limit": 60, "--learning_rate": 1.05e-3, "--pretrained_model_name_or_path": "stabilityai/stable-diffusion-3.5-medium", "--report_to": "wandb", "--tracker_project_name": "sd35-training", "--tracker_run_name": "simpletuner-esa-satellite", "--max_train_steps": 3300, "--num_train_epochs": 0, "--data_backend_config": "-----------------------", "--output_dir": "----------------------", "--push_to_hub": false, "--push_checkpoints_to_hub": false, "--hub_model_id": "sd35-training", "--resolution": 1024, "--resolution_type": "pixel", "--minimum_image_size": 1024, "--instance_prompt": "abcx", "--validation_prompt": "abcx, a car on a motorway", "--validation_guidance": 7.5, "--validation_guidance_rescale": 0.0, "--validation_steps": 200, "--validation_num_inference_steps": 30, "--validation_negative_prompt": "blurry, cropped, ugly", "--validation_seed": 42, "--validation_resolution": 1024, "--train_batch_size": 1, "--gradient_accumulation_steps": 1, "--lr_scheduler": "cosine", "--lr_warmup_steps": 330, "--caption_dropout_probability": 0.1, "--metadata_update_interval": 65, "--vae_batch_size": 1, "--delete_unwanted_images": false, "--delete_problematic_images": false, "--training_scheduler_timestep_spacing": "trailing", "--inference_scheduler_timestep_spacing": "trailing", "--snr_gamma": 5, "--enable_xformers_memory_efficient_attention": true, "--gradient_checkpointing": true, "--allow_tf32": true, "--optimizer": "adamw_bf16", "--use_ema": false, "--ema_decay": 0.999, "--seed": 42, "--mixed_precision": "bf16", "--lora_rank": 4, "--lora_alpha": 4, "--lora_type": "standard", "--base_model_precision": "int2-quanto", "--text_encoder_1_precision": "no_change", "--text_encoder_2_precision": "no_change", "--text_encoder_3_precision": "no_change", "--quantize_via":"cpu"
it fails every time here:
Discovering cache objects..
Processing bucket 1.0: 0%| | 0/2 [00:00<?, ?it/s]2025-01-29 17:58:15,446 [ERROR] Error encoding images
This is a dockized environment if that makes any difference, but I have made sure my container can see the GPUs and copied the template docker file. Thanks for any help :)
Beta Was this translation helpful? Give feedback.
All reactions