-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more checkpoints #1819
Comments
Hi there. These are good points, the settings for epochs and
I think you can do this via the |
Thank you very much for your quick response! I have a follow-up question: how can I measure the time taken for each train.save_interval? Specifically, I’d like to log the cumulative training time at each checkpoint like the training time at the final checkpoint. Thank you! |
Good question. That's currently not supported/implemented. You'd have to modify that in the training code here: Line 384 in ec02064
The easiest way would be to move the training time computation, which is basically just train_time = time.perf_counter()
fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
# Save final checkpoint
save_checkpoint(fabric, state, tokenizer_dir, out_dir / "final" / "lit_model.pth")
total_tokens = state["iter_num"] * train.micro_batch_size * model.max_seq_length * fabric.world_size
# Print formatted output
separator = "-" * 40
fabric.print(separator)
fabric.print("| Performance")
fabric.print(f"| - Total tokens : {total_tokens:,}")
fabric.print(f"| - Training Time : {(time.perf_counter()-train_time):.2f} s") up into the |
Thanks again for your prompt answer. you mean something like:
I should not change the save_checkpoint to: def save_checkpoint(fabric, state, tokenizer_dir, checkpoint_file):
Am I right? |
Yes, that looks correct. I would set |
Hm, that's weird. Not sure why this is happening. Did you install LitGPT with the
Otherwise, I am not sure, maybe the timing needs to be moved to a different place. |
Arg, sorry ... it's Friday afternoon and my brain is probably already in weekend mode. Actually, the train.save_interval is not based on max tokens but on steps. So it's probably never triggered. It should be a much smaller number. Maybe try 10 or so. |
Just one more question: how can I calculate the relationship between steps and the number of tokens? |
Just one more question: how can I calculate the relationship between steps and the number of tokens? |
If the microbatch size is equal to the global batch size, I think it should be the following relationship: max tokens = max_steps * batch_size * max_seq_length (I think that's it, but I would verify this with a small example run) |
Thank you |
Hi Sebastian, if I use the above code for big models such as llama3-70b or llama2-70b, I will get a NCCL communication error. is there anyway to modify this? [rank31]:[E1114 18:35:13.955775341 ProcessGroupNCCL.cpp:607] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out. |
I have two questions about pretraining LLaMA-2 13B with litGPT:
Configuration for epoch, max_tokens, and max_steps: In the litgpt/config_hub/pretrain/config.yaml, I see options for epoch, max_tokens, and max_steps. I have a value set for max_tokens, but not for epoch or max_steps. Whenever I try to set either of those, I get errors. Could someone help me understand how I should configure these values?
Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?
Thank you in advance for any guidance! question here.
The text was updated successfully, but these errors were encountered: