scr: initial integration for Microsoft Megatron-Deepspeed #5

adammoody · 2022-10-27T05:18:18Z

Integrates SCR into Megatron-DeepSpeed.

This requires the work here to integrate SCR into DeepSpeed: adammoody/DeepSpeed#1

This adds new Megatron arguments for enabling and configuring SCR within a training run:

--scr - enable SCR for checkpointing
--scr-interval - number of steps between defensive checkpoints
--scr-seconds - number of seconds between defensive checkpoints
--scr-overhead - maximum percent runtime overhead to allow for defensive checkpoints

With SCR, one can write checkpoints to node-local storage, which is often faster than the parallel file system at large scale. By default, SCR writes to /dev/shm, and it can be configured to use a node-local SSD if available. Those checkpoints can be transferred to the parallel file system upon a failure, or they can be discarded if there is no failure. This allows one to increase the defensive checkpoint frequency of a training run without adding significant run time or requiring more disk space.

For example, one might configure a run to save a checkpoint every 100 steps in /dev/shm but only persist a checkpoint on the parallel file system every 2000 steps:

--scr
--scr-interval=100
--save-interval=2000

For best performance, one should also enable parallel writes of the layer checkpoint files in the DeepSpeed configuration file:

  "checkpoint": {
    "parallel_write": {
      "pipeline_stage": true
    }
  }

…bigscience-workshop#88) * updated curated env * updated to use aml curated env

…nference (bigscience-workshop#99) This PR gets text-generation working in the `examples/generate_text.sh` example working w/ DS inference enabled. For the main fix, the `sample_sequence_batch` function has been updated to perform the softmax when calculating `log_probs`, instead of setting it to 1's using `torch.ones_like(...)`, although a few minor fixes were applied as well. Extra whitespace is also removed.

…rkshop#100) * staging_data_efficiency_v1 (bigscience-workshop#12) * refactor and clean * script refactor * fix * fix * fix * fix * refactor * script * CL diff type * script cleanup * fix for MP * refactor * refactor * fix * apply feedback

* xpu support (bigscience-workshop#55) * port accel abs interfece * WA for run3.6b * move on * fix current_dievice * fix typo * enable to run 345M GPT * delete apex_patch * add TODO xpu compatible tg for xpu WA * use deepspeed launcher * enable run3.6b bf16 * add zero2 config json * readd enable_each_rank_log * fix typos * add ccl arg * fix * use short word * use no-masked-softmax-fusion * readd * set train iters to 10 * remove duplicate line * change assert msg * update format * add whitespace * update path * update note * update * fix typos * delete notes * update format * update xpu check to cuda check * update * clean up file * fix typos * add python based gradient clipping * change condition for python based path

* fix torch six import error Fixes compatibility with Torch 2.0 and fixes this issue: deepspeedai/Megatron-DeepSpeed#117 Same as this PR: deepspeedai/DeepSpeed#2863 * update torch import statement update torch import statement for backward compatibility

* fix a bug when run on bf16+pp * add a space to fix the tab error

…orkshop#138) * fix(training.py): logical bug in eval_iters_calculation * fix(training.py): call update_train_iters to make sure the correctness when rampup batch size is enabled

…op#142) Co-authored-by: Alexander Jipa <[email protected]>

…igscience-workshop#106) use int64_t instead of int32_t to avoid integer overflow. Signed-off-by: yulu.jia <[email protected]> Co-authored-by: yulu.jia <[email protected]>

adammoody force-pushed the scr3 branch 2 times, most recently from 8806854 to c6eb429 Compare October 27, 2022 19:55

adammoody changed the title ~~scr: initial integration~~ scr: initial integration for Microsoft Megatron-Deepspeed Oct 27, 2022

adammoody mentioned this pull request Oct 27, 2022

parallelize writing of layer checkpoint files across data parallel instances deepspeedai/DeepSpeed#1419

Merged

jomayeri and others added 25 commits November 3, 2022 13:34

Modifying loss checking to support bf16. (bigscience-workshop#92)

4cf3f4b

Updated to Curated acpt env and removed deepspeed install from github (…

238ab6d

…bigscience-workshop#88) * updated curated env * updated to use aml curated env

Fix the bug of FusedLayerNorm on ROCm (bigscience-workshop#96)

c685fb5

fix floating point in script (bigscience-workshop#101)

515798f

added unit test for megatron (bigscience-workshop#102)

9a52a2e

fix deprecated numpy types (bigscience-workshop#103)

789f2a9

fix script typo

c240204

data efficiency example update (bigscience-workshop#113)

57e6439

fix(uitls.py): add () after device_name (bigscience-workshop#130)

d3b401d

Remove duplicate content in README (bigscience-workshop#131)

798b303

fix a bug when run on bf16+pp (bigscience-workshop#134)

3ed9f4f

Add a space to fix tab error for loss_scale (bigscience-workshop#136)

ec70e81

* fix a bug when run on bf16+pp * add a space to fix the tab error

fix(training.py): logical bug in eval_iters_calculation (bigscience-w…

1f640c0

…orkshop#138) * fix(training.py): logical bug in eval_iters_calculation * fix(training.py): call update_train_iters to make sure the correctness when rampup batch size is enabled

passing num_experts to BERT and T5 language models (bigscience-worksh…

6951ba2

…op#142) Co-authored-by: Alexander Jipa <[email protected]>

fix integer overflow in the cpp implementation of build_sample_idx (b…

d5c822e

…igscience-workshop#106) use int64_t instead of int32_t to avoid integer overflow. Signed-off-by: yulu.jia <[email protected]> Co-authored-by: yulu.jia <[email protected]>

Fixed an enum value comparision error (bigscience-workshop#132)

7491937

scr: initial integration

958b47f

set variables for well-defined MPI environment

a7ef874

scr: enable scr.should_exit() to stop before allocation ends

5fcfb1b

add --scr-current to specify a checkpoint tag to load

fd304a5

scr: improve comments

3610f96

adammoody force-pushed the scr3 branch from 37efd73 to 3610f96 Compare June 21, 2023 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scr: initial integration for Microsoft Megatron-Deepspeed #5

scr: initial integration for Microsoft Megatron-Deepspeed #5

adammoody commented Oct 27, 2022 •

edited

Loading

scr: initial integration for Microsoft Megatron-Deepspeed #5

Are you sure you want to change the base?

scr: initial integration for Microsoft Megatron-Deepspeed #5

Conversation

adammoody commented Oct 27, 2022 • edited Loading

adammoody commented Oct 27, 2022 •

edited

Loading