Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to execute the training process: #171

Open
JianZhang-chick opened this issue Jul 24, 2024 · 2 comments
Open

Failed to execute the training process: #171

JianZhang-chick opened this issue Jul 24, 2024 · 2 comments

Comments

@JianZhang-chick
Copy link

When I execute the command line, an error occurs. Can anyone solve it?
accelerate launch -m --config_file accelerate_config.yaml --machine_rank 0 --main_process_ip 0.0.0.0 --main_process_port 20055 --num_machines 1 --num_processes 4 scripts.train_stage1 --config ./configs/train/stage1.yaml

@JianZhang-chick
Copy link
Author

[2024-07-25 01:18:44,519] torch.distributed.run: [WARNING]
[2024-07-25 01:18:44,519] torch.distributed.run: [WARNING] *****************************************
[2024-07-25 01:18:44,519] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-25 01:18:44,519] torch.distributed.run: [WARNING] *****************************************
[2024-07-25 01:18:49,819] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-25 01:18:49,820] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-25 01:18:49,820] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-25 01:18:49,820] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-07-25 01:18:50,592] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-25 01:18:50,592] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-25 01:18:50,592] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-07-25 01:18:50,600] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-25 01:18:50,621] [INFO] [comm.py:637:init_distributed] cdb=None
INFO:main:Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: no
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': False}, 'bf16': {'enabled': False}}

INFO:main:Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: no
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': False}, 'bf16': {'enabled': False}}

INFO:main:Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: no
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': False}, 'bf16': {'enabled': False}}

INFO:main:Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: no
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': False}, 'bf16': {'enabled': False}}

{'scaling_factor', 'latents_std', 'latents_mean', 'force_upcast'} was not found in config. Values will be initialized to default values.
The config attributes {'center_input_sample': False, 'out_channels': 4} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'_landmark_net', 'use_linear_projection', 'resnet_time_scale_shift', 'transformer_layers_per_block', 'conv_in_kernel', 'time_embedding_act_fn', 'dual_cross_attention', 'time_cond_proj_dim', 'addition_time_embed_dim', 'num_attention_heads', 'encoder_hid_dim', 'time_embedding_type', 'mid_block_type', 'mid_block_only_cross_attention', 'only_cross_attention', 'num_class_embeds', '_center_input_sample', 'attention_type', 'addition_embed_type_num_heads', 'encoder_hid_dim_type', 'time_embedding_dim', 'reverse_transformer_layers_per_block', 'addition_embed_type', 'class_embeddings_concat', '_out_channels', 'dropout', 'projection_class_embeddings_input_dim', 'timestep_post_act', 'class_embed_type', 'upcast_attention'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight']
INFO:hallo.models.unet_3d:loaded temporal unet's pretrained weights from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/stable-diffusion-v1-5/unet ...
INFO:hallo.models.unet_3d:loaded temporal unet's pretrained weights from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/stable-diffusion-v1-5/unet ...
INFO:hallo.models.unet_3d:loaded temporal unet's pretrained weights from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/stable-diffusion-v1-5/unet ...
The config attributes {'center_input_sample': False} were passed to UNet3DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'use_linear_projection', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'only_cross_attention', 'num_class_embeds', 'upcast_attention'} was not found in config. Values will be initialized to default values.
INFO:hallo.models.unet_3d:loaded temporal unet's pretrained weights from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/stable-diffusion-v1-5/unet ...
Load motion module params from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/motion_module/mm_sd_v15_v2.ckpt
Load motion module params from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/motion_module/mm_sd_v15_v2.ckpt
Load motion module params from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/motion_module/mm_sd_v15_v2.ckpt
Load motion module params from /home/hch/PycharmProjects/hallo/scripts/pretrained_models/motion_module/mm_sd_v15_v2.ckpt
INFO:hallo.models.unet_3d:Loaded 453.20928M-parameter motion module
INFO:hallo.models.unet_3d:Loaded 453.20928M-parameter motion module
INFO:hallo.models.unet_3d:Loaded 453.20928M-parameter motion module
INFO:hallo.models.unet_3d:Loaded 453.20928M-parameter motion module
INFO:main:Total trainable params 1226
[2024-07-25 01:19:15,570] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-07-25 01:19:16,396] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-07-25 01:19:16,400] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-07-25 01:19:16,400] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-07-25 01:19:16,593] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW8bit
[2024-07-25 01:19:16,593] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW8bit type=<class 'bitsandbytes.optim.adamw.AdamW8bit'>
[2024-07-25 01:19:16,593] [WARNING] [engine.py:1179:do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-07-25 01:19:16,593] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 2 optimizer
[2024-07-25 01:19:16,593] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 500,000,000
[2024-07-25 01:19:16,593] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 500,000,000
[2024-07-25 01:19:16,593] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-07-25 01:19:16,593] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[2024-07-25 01:19:18,040] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-07-25 01:19:18,041] [INFO] [utils.py:782:see_memory_usage] MA 5.36 GB Max_MA 5.69 GB CA 6.03 GB Max_CA 6 GB
[2024-07-25 01:19:18,041] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 27.3 GB, percent = 21.7%
[2024-07-25 01:19:18,203] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-07-25 01:19:18,204] [INFO] [utils.py:782:see_memory_usage] MA 5.36 GB Max_MA 6.01 GB CA 6.69 GB Max_CA 7 GB
[2024-07-25 01:19:18,204] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 27.43 GB, percent = 21.8%
[2024-07-25 01:19:18,204] [INFO] [stage_1_and_2.py:543:init] optimizer state initialized
[2024-07-25 01:19:18,363] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-07-25 01:19:18,363] [INFO] [utils.py:782:see_memory_usage] MA 5.36 GB Max_MA 5.36 GB CA 6.69 GB Max_CA 7 GB
[2024-07-25 01:19:18,364] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 27.57 GB, percent = 21.9%
[2024-07-25 01:19:18,370] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-07-25 01:19:18,370] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-07-25 01:19:18,370] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-07-25 01:19:18,370] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[(0.9, 0.999)]
[2024-07-25 01:19:18,374] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] amp_params ................... False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] bfloat16_enabled ............. False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0eb5e81ab0>
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-07-25 01:19:18,375] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] dump_state ................... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] fp16_auto_cast ............... None
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] fp16_enabled ................. False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] gradient_clipping ............ 0.0
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] loss_scale ................... 0
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] pld_params ................... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-07-25 01:19:18,376] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] steps_per_print .............. inf
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] train_batch_size ............. 4
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 1
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] use_data_before_expert_parallel
False
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] world_size ................... 4
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-07-25 01:19:18,377] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-07-25 01:19:18,377] [INFO] [config.py:987:print_user_config] json = {
"train_batch_size": 4,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"fp16": {
"enabled": false
},
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
INFO:main:save config to ./exp_output/stage2
INFO:main:***** Running training *****
INFO:main: Num examples = 20
INFO:main: Num Epochs = 6000
INFO:main: Instantaneous batch size per device = 1
INFO:main: Total train batch size (w. parallel, distributed & accumulation) = 4
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 30000
INFO:main:Loading checkpoint from ./exp_output/stage2/checkpoints
Could not find checkpoint under ./exp_output/stage2/checkpoints, start training from scratch
Steps: 0%| | 0/30000 [00:00<?, ?it/s][2024-07-25 01:19:22,864] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Steps: 0%| | 1/30000 [00:04<36:48:19, 4.42s/it]Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'use_tf32': '1', 'prefer_nhwc': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_enable': '0', 'use_ep_level_unified_stream': '0', 'device_id': '0', 'has_user_compute_stream': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'user_compute_stream': '0', 'cudnn_conv_use_max_workspace': '1'}}
find model: /home/hch/PycharmProjects/hallo/scripts/pretrained_models/face_analysis/models/1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'use_tf32': '1', 'prefer_nhwc': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_enable': '0', 'use_ep_level_unified_stream': '0', 'device_id': '0', 'has_user_compute_stream': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'user_compute_stream': '0', 'cudnn_conv_use_max_workspace': '1'}}
find model: /home/hch/PycharmProjects/hallo/scripts/pretrained_models/face_analysis/models/2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'use_tf32': '1', 'prefer_nhwc': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_enable': '0', 'use_ep_level_unified_stream': '0', 'device_id': '0', 'has_user_compute_stream': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'user_compute_stream': '0', 'cudnn_conv_use_max_workspace': '1'}}
find model: /home/hch/PycharmProjects/hallo/scripts/pretrained_models/face_analysis/models/genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'use_tf32': '1', 'prefer_nhwc': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_enable': '0', 'use_ep_level_unified_stream': '0', 'device_id': '0', 'has_user_compute_stream': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'user_compute_stream': '0', 'cudnn_conv_use_max_workspace': '1'}}
find model: /home/hch/PycharmProjects/hallo/scripts/pretrained_models/face_analysis/models/glintr100.onnx recognition ['None', 3, 112, 112] 127.5 127.5
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'use_tf32': '1', 'prefer_nhwc': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_enable': '0', 'use_ep_level_unified_stream': '0', 'device_id': '0', 'has_user_compute_stream': '0', 'gpu_external_empty_cache': '0', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'cudnn_conv1d_pad_to_nc1d': '0', 'gpu_mem_limit': '18446744073709551615', 'gpu_external_alloc': '0', 'gpu_external_free': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'do_copy_in_default_stream': '1', 'enable_cuda_graph': '0', 'user_compute_stream': '0', 'cudnn_conv_use_max_workspace': '1'}}
find model: /home/hch/PycharmProjects/hallo/scripts/pretrained_models/face_analysis/models/scrfd_10g_bnkps.onnx detection [1, 3, '?', '?'] 127.5 128.0
set det-size: (640, 640)
Some weights of Wav2VecModel were not initialized from the model checkpoint at /home/hch/PycharmProjects/hallo/scripts/pretrained_models/wav2vec/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:audio_separator.separator.separator:Separator version 0.18.3 instantiating with output_dir: ./exp_output/stage2/validation/.cache/audio_preprocess, output_format: WAV
INFO:audio_separator.separator.separator:Operating System: Linux #126~20.04.1-Ubuntu SMP Mon Jul 1 15:40:07 UTC 2024
INFO:audio_separator.separator.separator:System: Linux Node: user-Super-Server Release: 5.15.0-116-generic Machine: x86_64 Proc: x86_64
INFO:audio_separator.separator.separator:Python Version: 3.10.14
INFO:audio_separator.separator.separator:PyTorch Version: 2.2.2+cu118
INFO:audio_separator.separator.separator:FFmpeg installed: ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
INFO:audio_separator.separator.separator:ONNX Runtime GPU package installed with version: 1.18.1
INFO:audio_separator.separator.separator:CUDA is available in Torch, setting Torch device to CUDA
INFO:audio_separator.separator.separator:ONNXruntime has CUDAExecutionProvider available, enabling acceleration
INFO:audio_separator.separator.separator:Loading model Kim_Vocal_2.onnx...
INFO:audio_separator.separator.separator:Load model duration: 00:00:00
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1721841568.293733 21623 task_runner.cc:85] GPU suport is not available: INTERNAL: ; RET_CHECK failure (mediapipe/gpu/gl_context_egl.cc:77) display != EGL_NO_DISPLAYeglGetDisplay() returned error 0x3000
W0000 00:00:1721841568.299465 21623 face_landmarker_graph.cc:174] Sets FaceBlendshapesGraph acceleration to xnnpack by default.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W0000 00:00:1721841568.340904 27218 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1721841568.351520 27218 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
Processed and saved: ./exp_output/stage2/validation/.cache/1_sep_background.png
Processed and saved: ./exp_output/stage2/validation/.cache/1_sep_face.png
INFO:audio_separator.separator.separator:Starting separation process for audio_file_path: /home/hch/PycharmProjects/hallo/examples/masks/1.png
ERROR:root:Failed to execute the training process:
Steps: 0%| | 1/30000 [00:10<90:39:40, 10.88s/it]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@JianZhang-chick and others