forked from bigscience-workshop/Megatron-DeepSpeed
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scr: initial integration for Microsoft Megatron-Deepspeed #5
Open
adammoody
wants to merge
25
commits into
ms-main
Choose a base branch
from
scr3
base: ms-main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8806854
to
c6eb429
Compare
…bigscience-workshop#88) * updated curated env * updated to use aml curated env
…nference (bigscience-workshop#99) This PR gets text-generation working in the `examples/generate_text.sh` example working w/ DS inference enabled. For the main fix, the `sample_sequence_batch` function has been updated to perform the softmax when calculating `log_probs`, instead of setting it to 1's using `torch.ones_like(...)`, although a few minor fixes were applied as well. Extra whitespace is also removed.
…rkshop#100) * staging_data_efficiency_v1 (bigscience-workshop#12) * refactor and clean * script refactor * fix * fix * fix * fix * refactor * script * CL diff type * script cleanup * fix for MP * refactor * refactor * fix * apply feedback
* xpu support (bigscience-workshop#55) * port accel abs interfece * WA for run3.6b * move on * fix current_dievice * fix typo * enable to run 345M GPT * delete apex_patch * add TODO xpu compatible tg for xpu WA * use deepspeed launcher * enable run3.6b bf16 * add zero2 config json * readd enable_each_rank_log * fix typos * add ccl arg * fix * use short word * use no-masked-softmax-fusion * readd * set train iters to 10 * remove duplicate line * change assert msg * update format * add whitespace * update path * update note * update * fix typos * delete notes * update format * update xpu check to cuda check * update * clean up file * fix typos * add python based gradient clipping * change condition for python based path
* fix torch six import error Fixes compatibility with Torch 2.0 and fixes this issue: deepspeedai/Megatron-DeepSpeed#117 Same as this PR: deepspeedai/DeepSpeed#2863 * update torch import statement update torch import statement for backward compatibility
* fix a bug when run on bf16+pp * add a space to fix the tab error
…orkshop#138) * fix(training.py): logical bug in eval_iters_calculation * fix(training.py): call update_train_iters to make sure the correctness when rampup batch size is enabled
…op#142) Co-authored-by: Alexander Jipa <[email protected]>
…igscience-workshop#106) use int64_t instead of int32_t to avoid integer overflow. Signed-off-by: yulu.jia <[email protected]> Co-authored-by: yulu.jia <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Integrates SCR into Megatron-DeepSpeed.
This requires the work here to integrate SCR into DeepSpeed: adammoody/DeepSpeed#1
This adds new Megatron arguments for enabling and configuring SCR within a training run:
With SCR, one can write checkpoints to node-local storage, which is often faster than the parallel file system at large scale. By default, SCR writes to
/dev/shm
, and it can be configured to use a node-local SSD if available. Those checkpoints can be transferred to the parallel file system upon a failure, or they can be discarded if there is no failure. This allows one to increase the defensive checkpoint frequency of a training run without adding significant run time or requiring more disk space.For example, one might configure a run to save a checkpoint every 100 steps in
/dev/shm
but only persist a checkpoint on the parallel file system every 2000 steps:For best performance, one should also enable parallel writes of the layer checkpoint files in the DeepSpeed configuration file: