Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more checkpoints #1819

Open
sadrafh opened this issue Nov 1, 2024 · 14 comments
Open

more checkpoints #1819

sadrafh opened this issue Nov 1, 2024 · 14 comments
Labels
question Further information is requested

Comments

@sadrafh
Copy link

sadrafh commented Nov 1, 2024

I have two questions about pretraining LLaMA-2 13B with litGPT:

Configuration for epoch, max_tokens, and max_steps: In the litgpt/config_hub/pretrain/config.yaml, I see options for epoch, max_tokens, and max_steps. I have a value set for max_tokens, but not for epoch or max_steps. Whenever I try to set either of those, I get errors. Could someone help me understand how I should configure these values?

Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?

Thank you in advance for any guidance! question here.

@sadrafh sadrafh added the question Further information is requested label Nov 1, 2024
@rasbt
Copy link
Collaborator

rasbt commented Nov 1, 2024

Hi there. These are good points, the settings for epochs and max_steps are there but not supported yet. So, right now, it's limited to setting the token number.

Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?

I think you can do this via the train.save_interval setting.

@sadrafh
Copy link
Author

sadrafh commented Nov 1, 2024

Thank you very much for your quick response!

I have a follow-up question: how can I measure the time taken for each train.save_interval? Specifically, I’d like to log the cumulative training time at each checkpoint like the training time at the final checkpoint.

Thank you!

@rasbt
Copy link
Collaborator

rasbt commented Nov 1, 2024

Good question. That's currently not supported/implemented. You'd have to modify that in the training code here:

if train.save_interval is not None and not is_accumulating and state["step_count"] % train.save_interval == 0:

The easiest way would be to move the training time computation, which is basically just

    train_time = time.perf_counter()
    fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)

    # Save final checkpoint
    save_checkpoint(fabric, state, tokenizer_dir, out_dir / "final" / "lit_model.pth")

    total_tokens = state["iter_num"] * train.micro_batch_size * model.max_seq_length * fabric.world_size

    # Print formatted output
    separator = "-" * 40
    fabric.print(separator)
    fabric.print("| Performance")
    fabric.print(f"| - Total tokens  : {total_tokens:,}")
    fabric.print(f"| - Training Time : {(time.perf_counter()-train_time):.2f} s")

up into the fit function

@sadrafh
Copy link
Author

sadrafh commented Nov 1, 2024

Thanks again for your prompt answer.

you mean something like:
if train.save_interval is not None and not is_accumulating and state["step_count"] % train.save_interval == 0:

        # Start the timer for this checkpoint
        checkpoint_start_time = time.perf_counter()

        save_checkpoint(fabric, state, tokenizer_dir, out_dir / f"step-{state['step_count']:08d}" / "lit_model.pth")

        # Calculate time taken for this checkpoint
        checkpoint_elapsed_time = time.perf_counter() - checkpoint_start_time
        fabric.print(f"Checkpoint time: {checkpoint_elapsed_time:.5f} seconds at step {state['step_count']}")

I should not change the save_checkpoint to:

def save_checkpoint(fabric, state, tokenizer_dir, checkpoint_file):
model = state["model"]
checkpoint_file.parent.mkdir(parents=True, exist_ok=True)
fabric.print(f"Saving checkpoint to {str(checkpoint_file)!r}")

start_time=time.time()
fabric.save(checkpoint_file, state)

if fabric.global_rank == 0:
    save_hyperparameters(setup, checkpoint_file.parent)
    if tokenizer_dir is not None:
        copy_config_files(tokenizer_dir, checkpoint_file.parent)
    save_config(model.config, checkpoint_file.parent)

end_time=time.time()

elapsed_time = end_time - start_time
print(f"Checkpoint saved in {elapsed_time:.5f} seconds.")  

Am I right?

@rasbt
Copy link
Collaborator

rasbt commented Nov 1, 2024

Yes, that looks correct. I would set train.max_tokens to sth like 1000 and train.save_interval to sth like 250 to try it out before doing a larger run.

@sadrafh
Copy link
Author

sadrafh commented Nov 1, 2024

Thanks again for your help.

I added that to the code but still can not see any timing results. Attached are my cmd code, , an image of the added code, and a part of the result.
BTW using fabric.print or the adding or removing the if clause do not matter, as I have tested all them.

Screenshot 2024-11-01 145227 Screenshot 2024-11-01 145329 Screenshot 2024-11-01 145304

@rasbt
Copy link
Collaborator

rasbt commented Nov 1, 2024

Hm, that's weird. Not sure why this is happening.

Did you install LitGPT with the pip development mode (-e) so updates are reflected in general?

pip install -e ".[all]"

Otherwise, I am not sure, maybe the timing needs to be moved to a different place.

@rasbt
Copy link
Collaborator

rasbt commented Nov 1, 2024

Arg, sorry ... it's Friday afternoon and my brain is probably already in weekend mode. Actually, the train.save_interval is not based on max tokens but on steps. So it's probably never triggered. It should be a much smaller number. Maybe try 10 or so.

@sadrafh
Copy link
Author

sadrafh commented Nov 1, 2024

yup that works. Thank you so much
image

@sadrafh sadrafh closed this as completed Nov 1, 2024
@sadrafh
Copy link
Author

sadrafh commented Nov 1, 2024

Just one more question: how can I calculate the relationship between steps and the number of tokens?

@sadrafh sadrafh reopened this Nov 1, 2024
@sadrafh
Copy link
Author

sadrafh commented Nov 4, 2024

Just one more question: how can I calculate the relationship between steps and the number of tokens?

@rasbt
Copy link
Collaborator

rasbt commented Nov 5, 2024

If the microbatch size is equal to the global batch size, I think it should be the following relationship:

max tokens = max_steps * batch_size * max_seq_length

(I think that's it, but I would verify this with a small example run)

@sadrafh
Copy link
Author

sadrafh commented Nov 5, 2024

Thank you

@sadrafh sadrafh closed this as completed Nov 5, 2024
@sadrafh sadrafh reopened this Nov 14, 2024
@sadrafh
Copy link
Author

sadrafh commented Nov 14, 2024

Hi Sebastian,

if I use the above code for big models such as llama3-70b or llama2-70b, I will get a NCCL communication error. is there anyway to modify this?

[rank31]:[E1114 18:35:13.955775341 ProcessGroupNCCL.cpp:607] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
[rank31]:[E1114 18:35:13.956086452 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 31] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank23]:[E1114 18:35:13.374590778 ProcessGroupNCCL.cpp:607] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
[rank31]:[E1114 18:35:13.956122222 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 31] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank23]:[E1114 18:35:13.374876780 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 23] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank31]:[E1114 18:35:13.956150272 ProcessGroupNCCL.cpp:621] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank23]:[E1114 18:35:13.374912098 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 23] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank31]:[E1114 18:35:13.956188982 ProcessGroupNCCL.cpp:627] [Rank 31] To avoid data inconsistency, we are taking the entire process down.
[rank23]:[E1114 18:35:13.374938592 ProcessGroupNCCL.cpp:621] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank31]:[E1114 18:35:13.957892994 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc7780cbf86 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc72a7f2f62 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fc72a7f99a3 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank23]:[E1114 18:35:13.374977809 ProcessGroupNCCL.cpp:627] [Rank 23] To avoid data inconsistency, we are taking the entire process down.
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc72a7fbd8c in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fc7778b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fc77c9e7ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fc77ca79850 in /lib/x86_64-linux-gnu/libc.so.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants