Training is stuck at saving checkpoint for Llama3.2 #1713

apthagowda97 · 2024-09-29T19:10:35Z

Training is stuck at saving checkpoint with the below msg after 1st epoch

1|15|Loss: 2.3797190189361572: 100%|███████████████████████████████████| 15/15 [01:25<00:00,  5.37s/it]INFO:torchtune.utils._logging:Starting checkpoint save...

Config:

resume_from_checkpoint: False
save_adapter_weights_only: False

If I enable save_adapter_weights_only: True different error comes

The text was updated successfully, but these errors were encountered:

felipemello1 · 2024-09-29T19:26:52Z

@ebsmothers, what is that command that we use at meta from when nproc=4 and it gets stuck? Do you think it could be related?

@apthagowda97 , what error do you get when you set save_adapter_weights_only? Can you also share the command/config you use to run training?

ebsmothers · 2024-09-30T04:24:59Z

@felipemello1 the command we use to avoid the hangs is NCCL_SHM_DISABLE=0 but I don’t think it’s relevant for non-Meta hardware (though I guess worth a try).

@apthagowda97 would also be interested to know what kind of hardware you’re running on

apthagowda97 · 2024-10-03T13:53:37Z

Sorry for delayed response.

@felipemello1 when I use save_adapter_weights_only: True I get the below error message at the end of 1st epoch while saving weights.

File "/home/Llama/finetune/torchtune/torchtune/models/convert_weights.py", line 60, in get_mapped_key
    raise Exception(
Exception: Error converting the state dict. Found unexpected key: "layers.0._checkpoint_wrapped_module.attn.q_proj.lora_a.weight". Please make sure you're loading a checkpoint with the right format.

If I disable it i.e save_adapter_weights_only: False then that issue disappears but it take approx. 5-10 min to save the weights.

But I dont get this problem if I do full finetuning where checkpoint is saved within secs.

Here is my config:

model:
  _component_: torchtune.models.llama3_2.lora_llama3_2_3b
  lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
  apply_lora_to_mlp: True
  apply_lora_to_output: False
  lora_rank: 64
  lora_alpha: 128
  lora_dropout: 0.05

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /datadrive/llama3.2-3b/original/tokenizer.model
  max_seq_len: 2048

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /datadrive/llama3.2-3b/
  checkpoint_files: [
    model-00001-of-00002.safetensors,
    model-00002-of-00002.safetensors,
  ]
  recipe_checkpoint: null
  output_dir: /datadrive/output_v1/
  model_type: LLAMA3_2
resume_from_checkpoint: False
save_adapter_weights_only: False

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: "json"
  data_files: "/home/Llama/finetune/dataset/dataset.json"
  train_on_input: True
  split: train
  conversation_column: conversation
  conversation_style: sharegpt
seed: 42
shuffle: True
batch_size: 64

# Optimizer and Scheduler
optimizer:
  _component_: torch.optim.AdamW
  fused: True
  weight_decay: 0.01
  lr: 1e-4
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 64

loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss

# Training
epochs: 4
max_steps_per_epoch: null
gradient_accumulation_steps: 1
compile: False

# Logging
output_dir: /datadrive/output_v1/
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}
log_every_n_steps: 32
log_peak_memory_stats: True

# Environment
device: cuda
dtype: bf16

# Activations Memory
enable_activation_checkpointing: True
enable_activation_offloading: False

@ebsmothers I am running on Single A100 GPU with cuda 12.4 pytorch.

apthagowda97 changed the title ~~Training is stuck at saving checkpoint in for Llama3.2~~ Training is stuck at saving checkpoint for Llama3.2 Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training is stuck at saving checkpoint for Llama3.2 #1713

Training is stuck at saving checkpoint for Llama3.2 #1713

apthagowda97 commented Sep 29, 2024

felipemello1 commented Sep 29, 2024 •

edited

Loading

ebsmothers commented Sep 30, 2024

apthagowda97 commented Oct 3, 2024

Training is stuck at saving checkpoint for Llama3.2 #1713

Training is stuck at saving checkpoint for Llama3.2 #1713

Comments

apthagowda97 commented Sep 29, 2024

felipemello1 commented Sep 29, 2024 • edited Loading

ebsmothers commented Sep 30, 2024

apthagowda97 commented Oct 3, 2024

felipemello1 commented Sep 29, 2024 •

edited

Loading