more checkpoints #1819

sadrafh · 2024-11-01T18:07:07Z

I have two questions about pretraining LLaMA-2 13B with litGPT:

Configuration for epoch, max_tokens, and max_steps: In the litgpt/config_hub/pretrain/config.yaml, I see options for epoch, max_tokens, and max_steps. I have a value set for max_tokens, but not for epoch or max_steps. Whenever I try to set either of those, I get errors. Could someone help me understand how I should configure these values?

Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?

Thank you in advance for any guidance! question here.

rasbt · 2024-11-01T19:28:57Z

Hi there. These are good points, the settings for epochs and max_steps are there but not supported yet. So, right now, it's limited to setting the token number.

Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?

I think you can do this via the train.save_interval setting.

sadrafh · 2024-11-01T19:45:51Z

Thank you very much for your quick response!

I have a follow-up question: how can I measure the time taken for each train.save_interval? Specifically, I’d like to log the cumulative training time at each checkpoint like the training time at the final checkpoint.

Thank you!

rasbt · 2024-11-01T20:23:43Z

Good question. That's currently not supported/implemented. You'd have to modify that in the training code here:

litgpt/litgpt/pretrain.py

Line 384 in ec02064

    
           if train.save_interval is not None and not is_accumulating and state["step_count"] % train.save_interval == 0:

The easiest way would be to move the training time computation, which is basically just

    train_time = time.perf_counter()
    fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)

    # Save final checkpoint
    save_checkpoint(fabric, state, tokenizer_dir, out_dir / "final" / "lit_model.pth")

    total_tokens = state["iter_num"] * train.micro_batch_size * model.max_seq_length * fabric.world_size

    # Print formatted output
    separator = "-" * 40
    fabric.print(separator)
    fabric.print("| Performance")
    fabric.print(f"| - Total tokens  : {total_tokens:,}")
    fabric.print(f"| - Training Time : {(time.perf_counter()-train_time):.2f} s")

up into the fit function

sadrafh · 2024-11-01T20:58:56Z

Thanks again for your prompt answer.

you mean something like:
if train.save_interval is not None and not is_accumulating and state["step_count"] % train.save_interval == 0:

        # Start the timer for this checkpoint
        checkpoint_start_time = time.perf_counter()

        save_checkpoint(fabric, state, tokenizer_dir, out_dir / f"step-{state['step_count']:08d}" / "lit_model.pth")

        # Calculate time taken for this checkpoint
        checkpoint_elapsed_time = time.perf_counter() - checkpoint_start_time
        fabric.print(f"Checkpoint time: {checkpoint_elapsed_time:.5f} seconds at step {state['step_count']}")

I should not change the save_checkpoint to:

def save_checkpoint(fabric, state, tokenizer_dir, checkpoint_file):
model = state["model"]
checkpoint_file.parent.mkdir(parents=True, exist_ok=True)
fabric.print(f"Saving checkpoint to {str(checkpoint_file)!r}")

start_time=time.time()
fabric.save(checkpoint_file, state)

if fabric.global_rank == 0:
    save_hyperparameters(setup, checkpoint_file.parent)
    if tokenizer_dir is not None:
        copy_config_files(tokenizer_dir, checkpoint_file.parent)
    save_config(model.config, checkpoint_file.parent)

end_time=time.time()

elapsed_time = end_time - start_time
print(f"Checkpoint saved in {elapsed_time:.5f} seconds.")

Am I right?

rasbt · 2024-11-01T21:16:01Z

Yes, that looks correct. I would set train.max_tokens to sth like 1000 and train.save_interval to sth like 250 to try it out before doing a larger run.

sadrafh · 2024-11-01T21:55:56Z

Thanks again for your help.

I added that to the code but still can not see any timing results. Attached are my cmd code, , an image of the added code, and a part of the result.
BTW using fabric.print or the adding or removing the if clause do not matter, as I have tested all them.

rasbt · 2024-11-01T22:00:31Z

Hm, that's weird. Not sure why this is happening.

Did you install LitGPT with the pip development mode (-e) so updates are reflected in general?

pip install -e ".[all]"

Otherwise, I am not sure, maybe the timing needs to be moved to a different place.

rasbt · 2024-11-01T22:02:41Z

Arg, sorry ... it's Friday afternoon and my brain is probably already in weekend mode. Actually, the train.save_interval is not based on max tokens but on steps. So it's probably never triggered. It should be a much smaller number. Maybe try 10 or so.

sadrafh · 2024-11-01T22:31:23Z

yup that works. Thank you so much

sadrafh · 2024-11-01T22:36:20Z

Just one more question: how can I calculate the relationship between steps and the number of tokens?

sadrafh · 2024-11-04T20:20:46Z

Just one more question: how can I calculate the relationship between steps and the number of tokens?

rasbt · 2024-11-05T02:23:37Z

If the microbatch size is equal to the global batch size, I think it should be the following relationship:

max tokens = max_steps * batch_size * max_seq_length

(I think that's it, but I would verify this with a small example run)

sadrafh · 2024-11-05T15:55:31Z

Thank you

sadrafh · 2024-11-14T19:38:05Z

Hi Sebastian,

if I use the above code for big models such as llama3-70b or llama2-70b, I will get a NCCL communication error. is there anyway to modify this?

[rank31]:[E1114 18:35:13.955775341 ProcessGroupNCCL.cpp:607] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
[rank31]:[E1114 18:35:13.956086452 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 31] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank23]:[E1114 18:35:13.374590778 ProcessGroupNCCL.cpp:607] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
[rank31]:[E1114 18:35:13.956122222 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 31] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank23]:[E1114 18:35:13.374876780 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 23] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank31]:[E1114 18:35:13.956150272 ProcessGroupNCCL.cpp:621] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank23]:[E1114 18:35:13.374912098 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 23] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank31]:[E1114 18:35:13.956188982 ProcessGroupNCCL.cpp:627] [Rank 31] To avoid data inconsistency, we are taking the entire process down.
[rank23]:[E1114 18:35:13.374938592 ProcessGroupNCCL.cpp:621] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank31]:[E1114 18:35:13.957892994 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc7780cbf86 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc72a7f2f62 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fc72a7f99a3 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank23]:[E1114 18:35:13.374977809 ProcessGroupNCCL.cpp:627] [Rank 23] To avoid data inconsistency, we are taking the entire process down.
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc72a7fbd8c in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fc7778b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fc77c9e7ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fc77ca79850 in /lib/x86_64-linux-gnu/libc.so.6)

sadrafh added the question Further information is requested label Nov 1, 2024

sadrafh closed this as completed Nov 1, 2024

sadrafh reopened this Nov 1, 2024

sadrafh closed this as completed Nov 5, 2024

sadrafh reopened this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more checkpoints #1819

more checkpoints #1819

sadrafh commented Nov 1, 2024

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024 •

edited

Loading

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024 •

edited

Loading

rasbt commented Nov 1, 2024

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024

sadrafh commented Nov 1, 2024

sadrafh commented Nov 4, 2024

rasbt commented Nov 5, 2024

sadrafh commented Nov 5, 2024

sadrafh commented Nov 14, 2024

more checkpoints #1819

more checkpoints #1819

Comments

sadrafh commented Nov 1, 2024

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024 • edited Loading

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024 • edited Loading

rasbt commented Nov 1, 2024

rasbt commented Nov 1, 2024

sadrafh commented Nov 1, 2024

sadrafh commented Nov 1, 2024

sadrafh commented Nov 4, 2024

rasbt commented Nov 5, 2024

sadrafh commented Nov 5, 2024

sadrafh commented Nov 14, 2024

sadrafh commented Nov 1, 2024 •

edited

Loading

sadrafh commented Nov 1, 2024 •

edited

Loading