Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scr: initial integration for Microsoft Megatron-Deepspeed #5

Open
wants to merge 25 commits into
base: ms-main
Choose a base branch
from

Conversation

adammoody
Copy link
Owner

@adammoody adammoody commented Oct 27, 2022

Integrates SCR into Megatron-DeepSpeed.

This requires the work here to integrate SCR into DeepSpeed: adammoody/DeepSpeed#1

This adds new Megatron arguments for enabling and configuring SCR within a training run:

--scr - enable SCR for checkpointing
--scr-interval - number of steps between defensive checkpoints
--scr-seconds - number of seconds between defensive checkpoints
--scr-overhead - maximum percent runtime overhead to allow for defensive checkpoints

With SCR, one can write checkpoints to node-local storage, which is often faster than the parallel file system at large scale. By default, SCR writes to /dev/shm, and it can be configured to use a node-local SSD if available. Those checkpoints can be transferred to the parallel file system upon a failure, or they can be discarded if there is no failure. This allows one to increase the defensive checkpoint frequency of a training run without adding significant run time or requiring more disk space.

For example, one might configure a run to save a checkpoint every 100 steps in /dev/shm but only persist a checkpoint on the parallel file system every 2000 steps:

--scr
--scr-interval=100
--save-interval=2000

For best performance, one should also enable parallel writes of the layer checkpoint files in the DeepSpeed configuration file:

  "checkpoint": {
    "parallel_write": {
      "pipeline_stage": true
    }
  }

@adammoody adammoody force-pushed the scr3 branch 2 times, most recently from 8806854 to c6eb429 Compare October 27, 2022 19:55
@adammoody adammoody changed the title scr: initial integration scr: initial integration for Microsoft Megatron-Deepspeed Oct 27, 2022
jomayeri and others added 25 commits November 3, 2022 13:34
…nference (bigscience-workshop#99)

This PR gets text-generation working in the `examples/generate_text.sh` example working w/ DS inference enabled. For the main fix, the `sample_sequence_batch` function has been updated to perform the softmax when calculating `log_probs`, instead of setting it to 1's using `torch.ones_like(...)`, although a few minor fixes were applied as well. Extra whitespace is also removed.
…rkshop#100)

* staging_data_efficiency_v1 (bigscience-workshop#12)

* refactor and clean

* script refactor

* fix

* fix

* fix

* fix

* refactor

* script

* CL diff type

* script cleanup

* fix for MP

* refactor

* refactor

* fix

* apply feedback
* xpu support (bigscience-workshop#55)

* port accel abs interfece

* WA for run3.6b

* move on

* fix current_dievice

* fix typo

* enable to run 345M GPT

* delete apex_patch

* add TODO xpu compatible tg for xpu WA

* use deepspeed launcher

* enable run3.6b bf16

* add zero2 config json

* readd enable_each_rank_log

* fix typos

* add ccl arg

* fix

* use short word

* use no-masked-softmax-fusion

* readd

* set train  iters to 10

* remove duplicate line

* change assert msg

* update format

* add whitespace

* update path

* update note

* update

* fix typos

* delete notes

* update format

* update xpu check to cuda check

* update

* clean up file

* fix typos

* add python based gradient clipping

* change condition for python based path
* fix torch six import error

Fixes compatibility with Torch 2.0 and fixes this issue: deepspeedai/Megatron-DeepSpeed#117

Same as this PR: deepspeedai/DeepSpeed#2863

* update torch import statement

update torch import statement for backward compatibility
* fix a bug when run on bf16+pp

* add a space to fix the tab error
…orkshop#138)

* fix(training.py): logical bug in eval_iters_calculation

* fix(training.py): call update_train_iters to make sure the correctness when rampup batch size is enabled
…igscience-workshop#106)

use int64_t instead of int32_t to avoid integer overflow.

Signed-off-by: yulu.jia <[email protected]>
Co-authored-by: yulu.jia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.