parallelize writing of layer checkpoint files across data parallel instances #1419

adammoody · 2021-09-30T19:11:57Z

This is work in progress, but I wanted to open it early for discussion. Also, I wrote this before MOE was added, and it will need to be updated to account for that. I can help with that if this approach is approved.

In case a pipeline stage has multiple layers, this parallelizes the task of writing the layer checkpoint files across the data parallel group. For example, if one is running with two data parallel instances, and if a pipeline stage has 10 layers, this modifies things so that rank 0 will write 5 layers and rank 1 will write the other 5, rather than have rank 0 do all of the work. On my system, this reduces checkpoint cost. It also better balances the total bytes written across ranks.

The main change is to have all procs call _save_checkpoint, and then in module_state_dict, the list of layers is subdivided among the procs that have the same model parallel rank across all data parallel instances.

adammoody · 2021-12-18T02:23:57Z

Hi all, I've got some time to circle back to this. I'm hoping someone on the team can take a look and provide some feedback when they get a chance.

I have rebased this PR on the latest code. It still needs to be adopted to handle MOE. I think the idea should extend to the MOE checkpoint path, as well.

In the meantime, here are some measurements to demonstrate the performance difference.

I am training a model using 16 processes on 4 nodes with the following configuration:

zero stage = 1
tensor parallelism = 2
pipeline parallelism = 1
num layers = 8
hidden size = 5120

This gives 8 data parallel instances of a model that has 1 pipeline stage. There are 8 layers in that single stage.

Before this PR, the processes that have rank 0 within the data parallel group write all 8 layers of the transformer. With this PR, all processes in the job write 1 layer of the transformer (8 layers / 8 data parallel instances).

I added timers into the checkpoint path. The two calls that use significant time are the cost to write the non-zero checkpoint files and the cost to write the zero checkpoint files, i.e., I've got timers to report the number of seconds around these two lines:

self._save_checkpoint(save_dir, tag, client_state=client_state)
self._save_zero_checkpoint(save_dir, tag)

The changes here improve the cost writing the non-zero checkpoint files.

Before this change, I see the following timings for 4 different checkpoints (units of seconds where not labeled):

18: 0: non_zero: 15.127218961715698
18: 0: zero:     4.272188663482666
18: 15: time (ms) | save-checkpoint: 19613.66

18: 0: non_zero: 15.118722915649414
18: 0: zero:     4.241654872894287
18: 15: time (ms) | save-checkpoint: 19414.78

18: 0: non_zero: 14.895861148834229
18: 0: zero:     2.4203057289123535
18: 15: time (ms) | save-checkpoint: 18793.07

18: 0: non_zero: 15.461049795150757
18: 0: zero:     4.352852821350098
18: 15: time (ms) | save-checkpoint: 19825.21

When using the optimization in this PR, I get the following instead:

15: 0: non_zero: 4.961373567581177
15: 0: zero:     6.336402893066406
15: 15: time (ms) | save-checkpoint: 11619.31

15: 0: non_zero: 3.244314193725586
15: 0: zero:     7.434791803359985
15: 15: time (ms) | save-checkpoint: 10897.41

15: 0: non_zero: 4.893619537353516
15: 0: zero:     5.75852370262146
15: 15: time (ms) | save-checkpoint: 14128.98

15: 0: non_zero: 5.354466676712036
15: 0: zero:     5.717851877212524
15: 15: time (ms) | save-checkpoint: 11235.55

The total checkpoint time drops from about 19 seconds to 11. The gains come in reducing the cost of writing the non_zero files, which drops from 15 seconds to 5, even though the cost to write the zero files seems to have bumped up a bit.

adammoody · 2022-03-31T21:31:04Z

@awan-10 , I see this PR has conflicts again. I'll take a look at refreshing it.

Would someone from the team have some time to look this over?

I think it could be useful to others, but if not, that'd be good to know too.

adammoody · 2022-03-31T21:44:56Z

@awan-10 , oh, this likely still needs to be updated for MoE. I started this before MoE was merged in. I'm trying to get a MoE example working, so that I can better follow its checkpoint path. I'm not quite there yet.

adammoody · 2022-04-12T19:31:03Z

@jeffra , I'm still fighting with my system to get a python+pytorch build that will let me run MoE. That could be a ways off, and it's hard to put a date on it. Aside from that, I updated this for the latest code and verified that it still works and performs as expected for the non-MoE case.

Would someone be willing to review this in its current state?

adammoody · 2022-05-12T20:55:54Z

@tjruwase , these changes reduce the checkpoint cost of some models. Is this something that could be worked in?

rocm-mici · 2022-06-09T20:19:15Z

Can one of the admins verify this patch?

adammoody · 2022-07-07T17:30:28Z

@stas00 , I noted that the checkpoints in the latest bigscience run were taking about 40 seconds.

https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md#2022-03-21

The changes in this patch may reduce that cost. This PR is out of date with the latest, though it might still apply cleanly for you depending on the version used in those runs. I also know this may come too late for the current run, but thought I'd throw it out there if you are planning more.

I have separate work that effectively helps reduce disk space consumed by the checkpoints, too. Combined with the speed improvement here, those two changes enable one checkpoint large training runs more frequently.

stas00 · 2022-07-08T21:48:24Z

Hi Adam,

Indeed, we have finished training 176B, so hopefully this version will accept your work.

In the case of JeanZay from my many experiments IO seems to be the bottleneck and not the CPU. In which cases a sequential writing of the data by one process might not be faster than doing the same from many processes to different places. Your way probably would still be a bit faster I guestimate.

Otherwise yours is definitely a super-smart idea!

I have separate work that effectively helps reduce disk space consumed by the checkpoints, too. Combined with the speed improvement here, those two changes enable one checkpoint large training runs more frequently.

I'm all ears, Do tell! (but probably let's discuss it in another Issue so that we don't derail your PR)

p.s. Always awesome to read about your performance innovations!

tjruwase · 2022-07-15T01:48:01Z

@tjruwase , these changes reduce the checkpoint cost of some models. Is this something that could be worked in?

Hi @adammoody, sorry that I did not respond to this earlier. This is a great contribution and is timely as we looking to improve the checkpointing features of DeepSpeed. I will take a closer look at this PR to get a better understanding.

adammoody · 2022-07-15T16:20:04Z

@tjruwase , these changes reduce the checkpoint cost of some models. Is this something that could be worked in?

Hi @adammoody, sorry that I did not respond to this earlier. This is a great contribution and is timely as we looking to improve the checkpointing features of DeepSpeed. I will take a closer look at this PR to get a better understanding.

Thanks @tjruwase . The ideas in here should still apply, but the PR itself has fallen out of sync with main and needs to be refreshed. I have found that this speeds up the I/O on our system, where one of the bottlenecks lies in the cost of calling torch.save().

A second benefit of this approach is that by more evenly spreading the bytes written across the ranks, one can gain more benefit from checkpoint libraries that write to node-local storage. We have follow on work that I can describe, but it depends on this first change.

adammoody · 2022-09-19T22:42:38Z

@tjruwase , @stas00 , here are a couple charts showing the speedup one can get using this approach. This first chart shows the checkpoint cost in seconds when scaling up the node count, while holding the hidden size constant and increasing the number of layers.

At 128 nodes, the total bytes written in this case is 1.15e12 (~1 TiB). The "Base" plot is the time I get with the original code, and the "PR#1419" plot shows the time for writing the same checkpoint using the changes in this PR. For the node counts shown, I get a speedup of 2-3x.

While this PR improves performance on its own, it also enables one to plug in the Scalable Checkpoint/Restart (SCR) library. Just for reference, the "SCR Single" and "SCR XOR" plots show the cost of writing the checkpoint files to /dev/shm on each node with SCR. This is 3-4x faster and the checkpoint cost is constant with increasing count.

adammoody · 2022-09-19T22:46:58Z

The cost to checkpoint the same model, but shown as effective write bandwidth:

The peak write bandwidth of the parallel file system is about 160 GiB/s. The "Base" and "PR#1419" plots should both approach that limit asymptotically.

The SCR plots meet or surpass the parallel file system bandwidth at 128 nodes, and they continue to scale linearly with the node count.

adammoody · 2022-09-19T22:50:41Z

My main goal here is to see if we can get the PR accepted. I think this should be useful to others.

Separately, I can also talk about SCR more if you're interested. That can be helpful at larger scales, however, it also requires MPI (mpi4py).

stas00 · 2022-09-19T23:46:30Z

As I'm not part of the Deepspeed team my vote won't count, but your benchmarks are super-impressive and I'd say definitely go for it.

I will let @tjruwase to chime in and also perhaps having @jeffra's opinion shared as well.

Are there situations where this approach would be slower and in which case this behaviour should be configurable by the user?

tjruwase · 2022-09-20T02:45:31Z

@adammoody, this is certainly impressive and definitely of interest to DeepSpeed. Can you please refresh this PR? Please let me know how I can help. Thanks!

@GuanhuaWang, FYI

GuanhuaWang · 2022-09-21T04:04:54Z

Hi all, I've got some time to circle back to this. I'm hoping someone on the team can take a look and provide some feedback when they get a chance.

I have rebased this PR on the latest code. It still needs to be adopted to handle MOE. I think the idea should extend to the MOE checkpoint path, as well.

In the meantime, here are some measurements to demonstrate the performance difference.

I am training a model using 16 processes on 4 nodes with the following configuration:
zero stage = 1
tensor parallelism = 2
pipeline parallelism = 1
num layers = 8
hidden size = 5120
This gives 8 data parallel instances of a model that has 1 pipeline stage. There are 8 layers in that single stage.

Before this PR, the processes that have rank 0 within the data parallel group write all 8 layers of the transformer. With this PR, all processes in the job write 1 layer of the transformer (8 layers / 8 data parallel instances).

I added timers into the checkpoint path. The two calls that use significant time are the cost to write the non-zero checkpoint files and the cost to write the zero checkpoint files, i.e., I've got timers to report the number of seconds around these two lines:
self._save_checkpoint(save_dir, tag, client_state=client_state)
self._save_zero_checkpoint(save_dir, tag)
The changes here improve the cost writing the non-zero checkpoint files.

Before this change, I see the following timings for 4 different checkpoints (units of seconds where not labeled):
18: 0: non_zero: 15.127218961715698
18: 0: zero:     4.272188663482666
18: 15: time (ms) | save-checkpoint: 19613.66

18: 0: non_zero: 15.118722915649414
18: 0: zero:     4.241654872894287
18: 15: time (ms) | save-checkpoint: 19414.78

18: 0: non_zero: 14.895861148834229
18: 0: zero:     2.4203057289123535
18: 15: time (ms) | save-checkpoint: 18793.07

18: 0: non_zero: 15.461049795150757
18: 0: zero:     4.352852821350098
18: 15: time (ms) | save-checkpoint: 19825.21
When using the optimization in this PR, I get the following instead:
15: 0: non_zero: 4.961373567581177
15: 0: zero:     6.336402893066406
15: 15: time (ms) | save-checkpoint: 11619.31

15: 0: non_zero: 3.244314193725586
15: 0: zero:     7.434791803359985
15: 15: time (ms) | save-checkpoint: 10897.41

15: 0: non_zero: 4.893619537353516
15: 0: zero:     5.75852370262146
15: 15: time (ms) | save-checkpoint: 14128.98

15: 0: non_zero: 5.354466676712036
15: 0: zero:     5.717851877212524
15: 15: time (ms) | save-checkpoint: 11235.55
The total checkpoint time drops from about 19 seconds to 11. The gains come in reducing the cost of writing the non_zero files, which drops from 15 seconds to 5, even though the cost to write the zero files seems to have bumped up a bit.

Hi @adammoody , this is great contribution to us. Just curious, is there any reason why zero performance is slightly worse compared to baseline?

Also for the format check, please follow these instructions here

adammoody · 2022-09-21T16:49:58Z

Hi @adammoody , this is great contribution to us. Just curious, is there any reason why zero performance is slightly worse compared to baseline?

Also for the format check, please follow these instructions here

Hi @GuanhuaWang , I don't yet know why the cost to write the zero files slowed down with this change. That was a surprise to me.

Thanks for the formatting tip. I'll take a look at cleaning that up.

adammoody · 2022-09-21T18:43:40Z

This PR was written before the checkpoint bloat optimization was added. I see that the layer checkpoint files are much smaller after rebasing to pick up the bloat fix. That likely reduces absolute performance benefit from the graphs I show above, since the baseline cost should drop significantly. I'd have to queue up a new set of runs.

tjruwase · 2022-09-21T19:24:13Z

@adammoody, I think the perf benefits are still useful regardless of the bloat fix.

The main issue as far as I can see right now is a clean configuration design to enable/disable and potentially specify the parallelism degree of checkpointing. It would be nice if the configuration knobs could be easily used for parallelization of the other non_zero checkpoints. Perhaps, that could be a TODO for now given how old this PR is. Please share your thoughts.

adammoody · 2022-09-29T20:38:02Z

@tjruwase , it should be easy to enable/disable this feature based on a ds configuraton option with code like shown here: #1419 (comment).

Want to go with something like that?

Maybe as a first step we could define a config option that leaves this disabled by default.

tjruwase · 2022-09-30T10:56:40Z

@adammoody, one possibility is to modify checkpoint configuration as below:

   "checkpoint": {
      ...,
      "parallel_write": {
           "pipeline_stage": [true|false],
           "tensor_slice": [true|false]    
   }

I think the above will address your PR (parallel writes of pipeline stage), and support future extension to parallelizing of tensor slice, all on the data parallel dimension. However, I am open to other suggestions.

@stas00, FYI

…e stage

adammoody · 2022-10-03T21:13:30Z

@adammoody, one possibility is to modify checkpoint configuration as below:
   "checkpoint": {
      ...,
      "parallel_write": {
           "pipeline_stage": [true|false],
           "tensor_slice": [true|false]    
   }
I think the above will address your PR (parallel writes of pipeline stage), and support future extension to parallelizing of tensor slice, all on the data parallel dimension. However, I am open to other suggestions.

@stas00, FYI

Thanks, @tjruwase . I'm working to add these config options now. I've pushed a commit for just the pipeline_stage value right now. A couple of questions:

Based on what I have so far, do you have any recommendation on the names, option processing, or error message?
I'll need to pass this setting down to be used in runtime/pipe/module.py. What do you recommend for that? It looks like one method would be to modify the PipelineModule constructor to take another argument, but maybe there is another way?

adammoody · 2022-10-04T01:46:18Z

@tjruwase , I found one way to get the value of that new configuration setting. This seems to work for me now.

adammoody · 2022-10-10T21:22:02Z

@tjruwase , I've only added the pipeline option so far, but how do things look at this point?

tjruwase · 2022-10-21T21:24:05Z

@adammoody, FYI, this is in 0.7.4 Patch Release

adammoody · 2022-10-26T16:23:04Z

@adammoody, FYI, this is in 0.7.4 Patch Release

Great. Thanks!

adammoody · 2022-10-26T18:21:00Z

I can describe the SCR work if anyone is interested. Adding SCR lets one achieve small and constant checkpointing costs as shown in the charts above. It doesn't require a lot of code to add, e.g.,:

DeepSpeed: adammoody#1
Megatron: adammoody/Megatron-DeepSpeed#5

SCR uses MPI, so it requires MPI as a dependency, which in turn may require changes in how one launches the job. If mpi4py works, then that should be enough.

I see that there is now built-in support for running DeepSpeed within MPI environments:

https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility

@RezaYazdaniAminabadi

* Fix the layer-past for GPT based models (deepspeedai#2196) * Add gradient_average flag support for sparse grads (deepspeedai#2188) * Add gradient_average flag support for sparse grads * formatting fixes * Add tests Co-authored-by: Olatunji Ruwase <[email protected]> * Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (deepspeedai#2197) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Log user config exactly (deepspeedai#2201) * Fix the tensor-slicing copy for qkv parameters (deepspeedai#2198) Co-authored-by: Olatunji Ruwase <[email protected]> * Refactor Distributed Tests (deepspeedai#2180) Refactor Distributed unit tests * fix table syntax (deepspeedai#2204) Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Correctly detect offload configuration (deepspeedai#2208) Co-authored-by: Jeff Rasley <[email protected]> * add cuda 11.7 (deepspeedai#2211) * add cuda 11.7 * formatting * use torch 1.9 (deepspeedai#2215) * [zero-3] print warning once and support torch parameter (deepspeedai#2127) * print warning only once. * add support for torch param and only warn on gpu 0 * remove type checking. will be done on a new PR with more tests. Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Add support of OPT models (deepspeedai#2205) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fc. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * fix typos in readme. (deepspeedai#2218) Co-authored-by: Olatunji Ruwase <[email protected]> * [device abstraction] add device abstraction to allow other device than CUDA be used * Fix regression w. dist_init_required (deepspeedai#2225) * add doc for new bert example (deepspeedai#2224) * Remove the random-generator from context during inference (deepspeedai#2228) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <[email protected]> * allow saving ckpt w/o ckpt json + bloom copy fix (deepspeedai#2237) * Correctly detect zero_offload (deepspeedai#2213) * Correctly detect offload configuration * Correctly detect offload configuration * Handle deprecated cpu offload setting * Correcly detect zero_offload setting * Minor tweak Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * update videos (deepspeedai#2249) * Refactor dist tests: Checkpointing (deepspeedai#2202) Refactor distributed tests: checkpointing Co-authored-by: Michael Wyatt <[email protected]> * Make OPT policy backward compatible with pre-OPT transformers versions (deepspeedai#2254) * fix ds-inference without policy (deepspeedai#2247) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.7.2 * Enable contiguous gradients with Z1+MoE (deepspeedai#2250) MoE training with zero stage 1 only works with `contiguous gradients=True`. * [rebase-202208] additional changes needed when rebase to 202208 * [rebase] cleanup direct cuda usage after merge * Correctly detect CPU optimizer usage (deepspeedai#2257) * Correctly detect CPU optimizer usage * Update nv-transformers-v100.yml (deepspeedai#2259) Co-authored-by: Jeff Rasley <[email protected]> * [precommit] fix pre-commit issues * Update half precision header guards (deepspeedai#2261) * fix deepspeedai#2240: wrong time unit in flops_profiler (deepspeedai#2241) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.7.3 * Add blob storage to CI runners (deepspeedai#2260) Add blob storage to CI runners and enable for transformers cache on inference tests * Update replace_module.py, test-gptj.py related fix (deepspeedai#2269) Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py * Fix OrderedDict import for python3.6 (deepspeedai#2267) Co-authored-by: Olatunji Ruwase <[email protected]> * Ds inference/fix mp2 (deepspeedai#2270) * Trajepl: nebula load fix (deepspeedai#2182) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: chenguo <[email protected]> * prevent torch ext folder mkdir at tmp (deepspeedai#2274) * Ds-inference Int8 support through ZeroQuant technology (deepspeedai#2217) Co-authored-by: Jeff Rasley <[email protected]> * add a new unit test for cuda ops (deepspeedai#2278) Co-authored-by: cmikeh2 <[email protected]> * Add to codeowners file (deepspeedai#2279) * [pin_memory] make pin_memory select device type * Memory Access Utility (deepspeedai#2276) Co-authored-by: Ammar Ahmad Awan <[email protected]> * Fp32 accuracy bug fix (deepspeedai#2285) Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> * Refactor universal checkpointing and tensor fragments (deepspeedai#2253) * Refactor universal checkpointing and tensor fragments * Formatting * [ds-inference] fix progress bar (deepspeedai#2286) when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work. Co-authored-by: Olatunji Ruwase <[email protected]> * Offload all gradients to nvme (deepspeedai#2282) * fused bias relu unittest (deepspeedai#2297) * fix for pytest picking up local deepspeed dir instead of installed deepspeed (deepspeedai#2299) * Fix for Zero3 when MP>1 and at least one batch param undefined (deepspeedai#2289) Co-authored-by: anthony.301 <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [downstream] merge from xpu support downstream * Unit test for bias add kernel (deepspeedai#2298) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py * Update relu.cu with mem_access_utils (deepspeedai#2306) * Add tensor parallel inference unit tests (deepspeedai#2232) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> * Fix the residual add mp scaling for GPTNeoX (deepspeedai#2310) * Add unit tests for residual_add kernels (deepspeedai#2307) * add inference eval scripts (deepspeedai#2303) * Upgrade P40 tests to torch 1.8 (deepspeedai#2316) Co-authored-by: Jeff Rasley <[email protected]> * ZeRO-Inference blog (deepspeedai#2271) * ZeRO-Inference blog * ZeRO-Inference blog * Format fixes * Apply feedback * Feedback * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <[email protected]> * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <[email protected]> * Address feedback * Format fixes * More tweaks * long sequence, nvme offload * Add image Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO-Inference blog - wrap up (deepspeedai#2321) * ZeRO-Inference blog - Update README (deepspeedai#2322) * refactor to use mem_access (deepspeedai#2317) * add quant unit test (deepspeedai#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <[email protected]> * only override forward if using cuda-graph (deepspeedai#2291) * Add more options to inference benchmark (deepspeedai#2325) * bump to 0.7.4 * MOE residual matmult unit test (deepspeedai#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * [device] port cuda device to literal_device() in new tests * MOE matmult with memaccess (deepspeedai#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (deepspeedai#2333) Co-authored-by: Ammar Ahmad Awan <[email protected]> * [accel_runtime] add pin_memory to accelerator runtime interface. * mem access for quantize kernel (deepspeedai#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Connor Holmes <[email protected]> * increase min pre-commit versions (deepspeedai#2346) * Extend scratch buffer for long prompts (deepspeedai#2212) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * fix zero docs (deepspeedai#2350) * Inference profiling updates/fixes (deepspeedai#2348) (deepspeedai#2349) Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Kernel Data Conversion Utility (deepspeedai#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (deepspeedai#2340) Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * [accelerator abstraction] merge from deepspeedai#2320 * docs(mixture-of-experts-inference): fix typo in tuto (deepspeedai#2345) Co-authored-by: Olatunji Ruwase <[email protected]> * download cifar to blob storage (deepspeedai#2342) Co-authored-by: Olatunji Ruwase <[email protected]> * Refactor gptj_residual_add kernels for better readability (deepspeedai#2358) Co-authored-by: Reza Yazdani <[email protected]> * Updated issue templates (deepspeedai#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (deepspeedai#2362) * format * remove round fn * Add missing pytest fixture scope (deepspeedai#2353) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Extend residual_add kernel tests to conver pre_attn_norm (deepspeedai#2354) Co-authored-by: Jeff Rasley <[email protected]> * Refactor fused_bias_residual kernels for better readability (deepspeedai#2356) Co-authored-by: Olatunji Ruwase <[email protected]> * Capture error message during sweep tests (deepspeedai#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix an exception when recursively casting dicts to fp16 (deepspeedai#2370) * Refactor remaining distributed tests (deepspeedai#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <[email protected]> * Fix the MLP output tensor's shape (deepspeedai#2380) * allow building with latest CUDA (11.8), it is backwards compatible (deepspeedai#2390) * pin transformers version for unit tests (deepspeedai#2402) * Change type to tuple in replace_wo_policy isinstance check (deepspeedai#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * Checkpoint backwards-compatbility workaround (deepspeedai#2384) * Add predicated global load (deepspeedai#2373) Co-authored-by: Reza Yazdani <[email protected]> * change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call * add new interface definition from olruwase/accelerator_abstraction * MII blog post (deepspeedai#2418) Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * Fix figure reference (deepspeedai#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (deepspeedai#2404) Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Fix issue with corrupted output on long generation for GPT (deepspeedai#2359) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (deepspeedai#2401) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (deepspeedai#2422) * [docs] update mii blog title (deepspeedai#2423) * add SD injection policy (deepspeedai#2381) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * [accelerator abstraction] remove name() from interface, device_name() should be used. * merge with master (ec13da6) * fix checkpoint loading when it is a dictionary (deepspeedai#2425) * Make error regex more generic in collect_results.py (deepspeedai#2415) Co-authored-by: Jeff Rasley <[email protected]> * fixes deepspeedai#2389 (deepspeedai#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Fix for inference gpt-j test (deepspeedai#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (deepspeedai#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <[email protected]> * Universal checkpoint for zero stage 1 (deepspeedai#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <[email protected]> * only add deps if extra is explictly called (deepspeedai#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (deepspeedai#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see deepspeedaiGH-2387). Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * [memory estimators] new config args sync (deepspeedai#2431) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * parallelize writing of layer checkpoint files across data parallel instances (deepspeedai#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <[email protected]> * Fix broken link to DeepSpeed Megatron fork (deepspeedai#2440) Co-authored-by: Lev Kurilenko <[email protected]> * bump to 0.7.5 * [OpBuilder] Add op builder abstraction * convert op builder usage in merged code * merge diff files from upstream * [OpBuilder] add create_op_builder interface in abstract_accelerator.py * remove files that is deleted from upstream * [OpBuilder] add left over op builder usage in tests * [OpBuilder] fix op builder usage in tests * [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design * import get_accelerator from deepspeed.accelerator directly * [OpBuilder] remove unused function and sync with main * add missing import * revert changes in device.py to avoid conflict with main * fix alexnet_model to use /tmp instead of /blob * Mingzhi/solve pr108 b (deepspeedai#115) * move ALL_OPs from __init__.py to all_Op.py to solve circular import * delete deepspeedexamples * fix import * fix regression (deepspeedai#117) * fix pin_memory * fix regression * fix error Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Mikhail Druzhinin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Minjia Zhang <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Kamal Raj <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Zhihong Chen <[email protected]> Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: 叶志晟 <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: trajep <[email protected]> Co-authored-by: chenguo <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: anthony.301 <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Guanhua Wang <[email protected]> Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Jean-Louis Queguiner <[email protected]> Co-authored-by: Matt Smith <[email protected]> Co-authored-by: Thomas-MMJ <[email protected]> Co-authored-by: lekurile <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Andrey Chernykh <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Adam Moody <[email protected]> Co-authored-by: AGUL <[email protected]>

* refactor to use mem_access (#2317) * add quant unit test (#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <[email protected]> * only override forward if using cuda-graph (#2291) * Add more options to inference benchmark (#2325) * bump to 0.7.4 * MOE residual matmult unit test (#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * MOE matmult with memaccess (#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (#2333) Co-authored-by: Ammar Ahmad Awan <[email protected]> * mem access for quantize kernel (#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Connor Holmes <[email protected]> * increase min pre-commit versions (#2346) * Extend scratch buffer for long prompts (#2212) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * fix zero docs (#2350) * Inference profiling updates/fixes (#2348) (#2349) Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Kernel Data Conversion Utility (#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (#2340) Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * docs(mixture-of-experts-inference): fix typo in tuto (#2345) Co-authored-by: Olatunji Ruwase <[email protected]> * download cifar to blob storage (#2342) Co-authored-by: Olatunji Ruwase <[email protected]> * Refactor gptj_residual_add kernels for better readability (#2358) Co-authored-by: Reza Yazdani <[email protected]> * Updated issue templates (#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (#2362) * format * remove round fn * Add missing pytest fixture scope (#2353) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Extend residual_add kernel tests to conver pre_attn_norm (#2354) Co-authored-by: Jeff Rasley <[email protected]> * Refactor fused_bias_residual kernels for better readability (#2356) Co-authored-by: Olatunji Ruwase <[email protected]> * Capture error message during sweep tests (#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix an exception when recursively casting dicts to fp16 (#2370) * Refactor remaining distributed tests (#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <[email protected]> * Fix the MLP output tensor's shape (#2380) * allow building with latest CUDA (11.8), it is backwards compatible (#2390) * pin transformers version for unit tests (#2402) * Change type to tuple in replace_wo_policy isinstance check (#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * Checkpoint backwards-compatbility workaround (#2384) * Add predicated global load (#2373) Co-authored-by: Reza Yazdani <[email protected]> * MII blog post (#2418) Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * Fix figure reference (#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (#2404) Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Fix issue with corrupted output on long generation for GPT (#2359) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (#2401) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (#2422) * [docs] update mii blog title (#2423) * add SD injection policy (#2381) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * fix checkpoint loading when it is a dictionary (#2425) * Make error regex more generic in collect_results.py (#2415) Co-authored-by: Jeff Rasley <[email protected]> * fixes #2389 (#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Fix for inference gpt-j test (#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <[email protected]> * Universal checkpoint for zero stage 1 (#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <[email protected]> * only add deps if extra is explictly called (#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see GH-2387). Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * [memory estimators] new config args sync (#2431) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * parallelize writing of layer checkpoint files across data parallel instances (#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <[email protected]> * Fix broken link to DeepSpeed Megatron fork (#2440) Co-authored-by: Lev Kurilenko <[email protected]> * bump to 0.7.5 * Fix Bug #2319 (#2438) Co-authored-by: Jeff Rasley <[email protected]> * update pytorch pool operator function signiture (#2443) * update pytorch pool operator function signiture * fix the case where kwargs is None * Fix build issues on Windows (#2428) * Fix build issues on Windows * small fix to complie with new version of Microsoft C++ Build Tools Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * rollback ds config changes (#2395) * rollback ds config changes * fix format * Fix error when output_file is a relative path without a prefix (#2397) Co-authored-by: Benjamin Steenhoek <[email protected]> * fix restuls and exprs path to use absolute path * write out optimial config after tuning * fix format * assert tuning result dir creation Co-authored-by: Benjamin Steenhoek <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Use CUDA events for inference model profiling (#2371) * use cuda event timers for model profiling * Fixing a mismatch in basic adam test. (#2447) * Reduction Kernel Utility (#2436) * Initial reduction_utils.h implementation * Add initialization helper, ensures correct min/max behavior * Remove unnecessary warp sync * deepspeed/launcher/launch.py: add option '--enable_each_rank_log logdir' (#2409) * Fixes for various CI problems (#2457) * check only major CUDA version in CI * update expected torch latest version * pin torch latest to 1.12 until issues with 1.13 are resolve * wrong expected torch version * Update nv-torch18-v100.yml * remove forked from pytest option due to cuda re-initialization errors * removed expected torch version from inference tests, causing errors currently * fix various bugs that popped up * move all tests over to cu111 runners, cu113 runners having problems * Cache Allocation and Softmax Fixes (#2433) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * fixing the checkpoint loading at inference-engine (#2429) Co-authored-by: Ammar Ahmad Awan <[email protected]> * Create a new folder structure to isolate model-specific code in DS (#2464) * don't gather partitioned activations for mp size 1 (#2454) * don't gather partitioned activations for mp size 1 * add inline comment for the change Co-authored-by: Olatunji Ruwase <[email protected]> * Updating autotune json default in docs. (#2476) * Updating autotune default in docs. * Running pre-commit. * Added MLFLOW environment variables for logging metrics within trainig… (#2477) * Added MLFLOW environment variables for logging metrics within trainign script * exporting MLFlow env variables from AML env Co-authored-by: Cheng Li <[email protected]> * fix accelerate link (#2481) * Add correct memory-allocation at DeepSpeed-Attention (#2474) Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Connor Holmes <[email protected]> * Fix CI issues related to cupy install (#2483) * remove any cupy install when setting up environments * revert previous changes to run on cu111 runners * fix for when no cupy is installed * remove cupy uninstall for workflows not using latest torch version * update to cu116 for inference tests * fix pip uninstall line * move python environment list to after DS install * remove cupy uninstall * re-add --forked * fix how we get cupy version (should be based on nvcc version) * [docs] add SD tutorial to news * [docs] add SD tutorial to deepspeed.ai news * Add `scale_attn_by_inverse_layer_idx` feature (#2486) * Add scale_attn_by_inverse_layer_idx feature * Fix layer_id bug * Fix scaling value Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * Stable Diffusion Enhancements (#2491) Co-authored-by: cmikeh2 <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * stage_1_and_2.py: no allreduce needed when mp size is 1 (#2494) * Make bf16_optimizer work for non pipeline (#2470) * Fix nightly CI tests (#2493) * fix for lm-eval nightly tests and add gpt-j to MPtest because OOM on single GPU * add nv-nightly badge * Make data contiguous before the inplace reshape-copy_ function (#2489) Co-authored-by: Michael Wyatt <[email protected]> * Fix typos: deepseed -> deepspeed (#2499) * bump to 0.7.6 * DeepSpeed inference config. (#2459) (#2472) Changes to inference API to use accept a config dict and cleaning up Inference Engine to utilize the newly added inference config. Co-authored-by: Michael Wyatt <[email protected]> * Update docs to autogenerate pydantic config model docs (#2509) * update zero config docs * add autogenerated docs for pydantic models used in ZeRO and Inference configs * Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility (#2508) This PR adds a max_tokens alias to the max_out_tokens argument in the init_inference API to support backwards compatibility after the config refactor PR https://github.com/microsoft/DeepSpeed/pull/2472. Thanks @molly-smith and @mrwyattii. * Deepspeed quantization library v0.1 (#2450) * Initial commit Deepspeed quantization library * Match function signatures * Add Quantization Kernel * adding offset comparision and precommit changes * format fixes * FIle name changes * pt_binding_changes * test name change * Integer quantization, minor refactors * Add directed test_case * format fixes * Move param calculation to constructor of params class * Use local function and add elemsPerBlock * change function to be specalized * sub block reduce * add new schedule * Add new schedule test case * fix illegal writes in sch1 * Style fixes in comments Co-authored-by: Connor Holmes <[email protected]> * Fix backward compatibility for InferenceConfig (#2516) * Make new InferenceConfig backwards compatible with previous init_inference API Co-authored-by: Jeff Rasley <[email protected]> * Add missing Inference sub-configs (#2518) * Add note about nvcc/hipcc requirement (#2519) * Update codeowners (#2525) * Initial dequant library implementation (#2521) * Fixes for torch 1.14 due to new torch.numel return type (#2522) * fixes for new torch.numel return type * address comment * Ensure is initialized for SD (#2534) * Make DS-Inference config readable from JSON (#2537) * Add MII tests (#2533) Adding MII tests to ensure changes to DS-Inference do not break MII * Remove mutable default parameter in init_inference() (#2540) A mutable default value is dangerous because editing it will change the value for all future calls to the function. The value is itself edited later in the function, so this problem will likely be encountered sooner or later. Co-authored-by: Michael Wyatt <[email protected]> * Change Where DS/Triton is Used in Stable Diffusion (#2536) * Change utilization of DS/Triton kernels * add config at Clip-encoder Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * Pass down the new DS inference config to replace_transformer_layer. (#2539) * pass down the new DS inference config to replace_transformer_layer. * remove quantize_settings and rename the ep_mp_group. * Fix model_config passing. Fixes gptj issue with wrong output. * fix small bug in gpt-neo. Co-authored-by: Reza Yazdani and Michael Wyatt * Adding Gradient Accumulation Data Type Config (#2512) * Adding gradient accumulation dtype config. * Switching to new DtypeEnum * Adding standalone check function, and unit tests * Variable disambiguation * Adding checks for unsupported states. * Updating for PR comments. * Reorganizing unit test. Co-authored-by: Olatunji Ruwase <[email protected]> * Report progress at gradient accumulation boundary (#2553) * report progress at gradient accumulation boundary * format * format * encoded ds config into command line argument when launching child processes in autotuning (#2524) * rollback ds config changes * fix format * Fix error when output_file is a relative path without a prefix (#2397) Co-authored-by: Benjamin Steenhoek <[email protected]> * fix restuls and exprs path to use absolute path * use base64 encoded ds config as cmd arg * fix format * remove assert * write out optimial config after tuning * fix format * no need to update ds config path when encoding ds config * udpate * do not use abs path for result and expr dir * fix conflicts * fix run mode * fix format * fix format Co-authored-by: Benjamin Steenhoek <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * add missing moe deprecated fields to inference config (#2556) * Abstract accelerator (step 1) (#2504) * Establish building block of abstract accelerator * Change .*Tensor variable to @property * [op builder] add op builder reflection to allow enumerate of builders in all_ops.py and builder_names.py * change @abstractproperty to @property @abstractmethod Co-authored-by: Olatunji Ruwase <[email protected]> * Fix invalid check of recorded parameter orders in zero stage3. (#2550) Co-authored-by: Olatunji Ruwase <[email protected]> * bump to 0.7.7 * docs: Update the recent url for Megatron-LM (#2564) * use get_global_rank if available (#2567) * Add Determined to open-source DL frameworks (#2573) * Support fp32 gradaccum for bf16 model (#2566) * allow bf16 model with fp32 gradient accumulation datatype * allow fp32 gradient accumulation and bfloat16 model in amp mode * alternative fix for grad accumulation type mismatch. In the case of zero optimizer we should have grad accum type == model data type Co-authored-by: Olatunji Ruwase <[email protected]> * Drop Maxwell Support (#2574) * Officially drop Maxwell support * Formatting * Comparison mismatch fix * Fix quantized-inference & Add generic support of checkpoint loading (#2547) * fix checkpoint loading when it is a dictionary * fix some issues with saving ckpt & int8 inference * fix quantized-inference & add generic support of checkpoint loading * remove int8 hard-coded flag * fix mlp return tensors * fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size * add more comments & description for checkpoint-loading module Co-authored-by: Michael Wyatt <[email protected]> * Fix MegatronLayerPolicy to have megatron_v2=True (#2579) This PR updates the MegatronLayerPolicy to set megatron_v2=True, which is required in order to properly transpose in the replace_with_policy() function. After the change in this PR, in conjunction with PR #99 in the Megatron-DeepSpeed fork, the Megatron text-generation example works with DS inference. * Update barrier and reduce_scatter_base to conform to PyTorch signatures (#2570) Co-authored-by: Jeff Rasley <[email protected]> * Support N-dimension input in quantization kernel (#2575) * Add support for inputs > 2D * use vec * Add N-Dim support to Dequant kernel * merge master and fix format * Bug Fix * fix num_bits * Fix dequant Co-authored-by: Connor Holmes <[email protected]> * Add checkpoint sharding unit tests (#2561) * added checkpopint sharding tests * Updating docs README (#2587) * Updating docs README with API update procedure. * Addressing comments. Co-authored-by: Jeff Rasley <[email protected]> * Updating API docs (#2586) Co-authored-by: Jeff Rasley <[email protected]> * Fix issues w. python 3.6 + add py-version checks to CI (#2589) * get mask token from tokenizer (#2592) * bump to 0.7.8 * DeepSpeed Data Efficiency Library (#2585) Co-authored-by: Jeff Rasley <[email protected]> * fix blog link (#2600) * Migrate ops tests to new inference_ops marker (#2599) * Migrate ops tests to new inference_ops marker * Disable by default * Add missing test cases * Reorder such that inference_ops will run[fail] first * Move layer norm to new schedule (#2590) * Move layer norm to new schedule * Pre-commit fixes * fix comments * format fixes * Merge unrolls * format fixes * camelCase * format fixes * revert unwanted file * move pow2 function * format fixes Co-authored-by: Connor Holmes <[email protected]> * [deepspeed/autotuner] Bug fix for binary search for batch size (#2162) * bug fix for binary search for batch size * fix binary search termination condition * add fix for older pydantic versions (#2611) * Use rocm/pytorch:latest (#2613) * skip torch.zeros and tensor.copy_ when model parallel is not used (#2479) Co-authored-by: Olatunji Ruwase <[email protected]> * call empty_cache to really free up GPU memory as described in comment (#2620) Co-authored-by: Olatunji Ruwase <[email protected]> * Remove GatheredParameters context from replace_with_policy (#2591) This PR removes the zero-infernece GatheredParameters context from replace_with_policy due to no longer needing zero-inference after the introduction of meta tensor support for BLOOM. * fixes #2498 (#2603) taking gradient accumulation steps into account for throughput calculation Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Update AVX512 Detection (#2621) * Update cpuinfo AVX512 detection * Missing conversion from `_mm256` to `_mm256i` Co-authored-by: Olatunji Ruwase <[email protected]> * Add Megatron CI workflow (#2614) * added megatron unit test * Update nv-megatron.yml Co-authored-by: Olatunji Ruwase <[email protected]> * [inference] check for unsupported model generate args (#2627) * [launcher] parse hostfile via regex and added error checks (#2626) * Unit tests setup own venv (#2628) add reusable workflow that sets up fresh venv for each test and prints relevant environment info * add enable_each_rank_log to deepspeed/launcher/runner.py (#2571) * Fix typo in autotuner.py (#2639) * [zero-3] Handle forward parameter return correctly in nested cases (#2642) Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [inference] ds-attention refactor w.r.t. ops (#2623) * Fix issue w. bloom when changing tp size (#2645) * fix assertion error in zero stage 3 (#2647) * tweaks to ds-attn, distilbert policy, and mup (#2649) * [doc] fix `min_loss_scale` default (#2660) * [doc] fix `min_loss_scale` default * align * [launcher] fail gracefully if hostname -i doesn't work as expected (#2631) Co-authored-by: Olatunji Ruwase <[email protected]> * Fix Opt injection (#2541) * fix Opt injection & add injection verification check at inference test * fix several issues * remove fixture * remove check_injection when no kerenl is injected Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Abstract accelerator (step 2) (#2560) * Abstract accelerator (step 2) * more flex op_builder path for both installation and runtime * add SpatialInferenceBuilder into cuda_accelerator.py * use reflection to make cuda_accelerator adapt to CUDA op builder change automatically * clean up deepspeed/__init__.py * add comments in cuda_accelerator for no torch path * Update deepspeed/env_report.py Change env_report.py according to suggestion Co-authored-by: Michael Wyatt <[email protected]> * reduce the range of try...except for better code clarity * Add porting for deepspeed/ops/random_ltd/dropping_utils.py * move accelerator to top directory and create symlink under deepspeed Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Remove unnecessary device synchronization for stage 2 (#2500) * Remove unnecessary device synchronization for stage 2 * Remove unnecessary device synchronization for stage 2 Co-authored-by: liyidong.lyd <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [Bug Fixed] use torch.cuda.is_available() (#2661) Co-authored-by: Olatunji Ruwase <[email protected]> * [fp16] lower initial_scale_power (#2663) Co-authored-by: Olatunji Ruwase <[email protected]> * fix Tensor contiguous bug in model_compression (#2671) double check the unit tests * [inference] ds-mlp refactor w.r.t. ops (#2668) * real_accelerator validation check for both accelerator and deepspeed.accelerator path (#2685) * remove duplicated code in ZeRO (#2655) Co-authored-by: Olatunji Ruwase <[email protected]> * Add mlflow logging for aml (#2495) * add logging changes * try w/out abspath * undo last change * start mlflow debug * remove mlflow from export_envs * add mlflow logging for reversed * remove mlflow.start_run * add back start run * don't clean cmd * print os environment variables * remove first start run * add run_id to mlflow star * remove context managers * move last end run * add extra parent start_runs * add run id logging * add logging to run_ds_config * change run_id to run_name * add back context managers and run_id logs * remove context mng * debug environment variable * reset environment variables * add env variable deletion * clean up * remove unused import * fix yapf/whitespace errors Co-authored-by: Cheng Li <[email protected]> * fix import path to op_builder (#2687) Co-authored-by: Olatunji Ruwase <[email protected]> * Pass training flag to forward call from Eval (#2604) Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * Extend quantization utils features (#2683) * Extend quantization utils features * remove unwanted files * fix cahce setting Co-authored-by: Connor Holmes <[email protected]> * [GatheredParameters] add support for any iterator (#2664) Co-authored-by: Olatunji Ruwase <[email protected]> * fix for latest diffusers (#2699) Co-authored-by: Jeff Rasley <[email protected]> * exclude benchmarks during install (#2698) * using correct loss scale in zero step (#2695) Co-authored-by: Olatunji Ruwase <[email protected]> * non-MoE stage 1 requires CG disabled (#2703) Co-authored-by: Olatunji Ruwase <[email protected]> * remove print side effect from importing deepspeed (#2704) * ZeRO3 handling frozen weights] (#2653) * bump to 0.8.1 * CUDA optional deepspeed ops (#2507) * CPU-Adam: add compile-flag to enable param-copy from CPU to GPU * guarde the CUDA-related include files and variables * remove CUDA dependency from op_builder when building against CPU * fixing the builder issues * fix formatting * return true when there is no mismatch on the cuda version * guard for when cuda is not available & test with cpu-only environment * Update cpu_adam and cpu_adagrad * Format fixes * Add configurable half precision type; Build/run in CUDA environment * Run cpu_adam and cpu_adagrad in cpu only environment * Mark CUDA only unit tests * CPU environment CI * Format fixes * Remove --forked * Add --forked * CPU only CI should pass * Format fixes * Format fixes * Remove scattered pytest.skip * Fix cpu_adam unit test * Update .github/workflows/nv-torch-latest-cpu.yml Co-authored-by: Michael Wyatt <[email protected]> * Update .github/workflows/nv-torch-latest-cpu.yml Co-authored-by: Michael Wyatt <[email protected]> * Address PR feedback * OpenMP linking * Fix unit tests Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * remove master branch from CI triggers (#2712) * [install] only add deepspeed pkg at install (#2714) Co-authored-by: Olatunji Ruwase <[email protected]> * update for lm-eval==0.3.0 (#2713) Co-authored-by: Jeff Rasley <[email protected]> * BF16 optimizer for BF16+ZeRO Stage 1 (#2706) * BF16 optimizer only with ZeRO stage 1. * Updating to grad accum of fp32 for BF16 ZeRO1 case. Co-authored-by: Olatunji Ruwase <[email protected]> * fix typo (#2718) Co-authored-by: Jeff Rasley <[email protected]> * Inference Refactor (replace_with_policy, model_implementations) (#2554) Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Change zero_grad() argument to match pytorch (#2741) * Automatic tensor parallelism v2 (#2670) * loop through pipe.model * tp_parser first draft * client_module must be type object * Simplify layernorm tracking. Add unittest. * cleanup * Add more models to unittest * cleanup inference pytest for merging * Add unittest * cleanup * pre-commit * unittest id and pytest marker * try marian for unittest * precommit * Move tp code to seperate file * Add new auto tp file * pre-commit and type * Update deepspeed/module_inject/auto_tp.py Co-authored-by: Michael Wyatt <[email protected]> * Update deepspeed/module_inject/auto_tp.py Co-authored-by: Michael Wyatt <[email protected]> * Update tests/unit/inference/test_inference.py Co-authored-by: Michael Wyatt <[email protected]> * remove unused fillmask function Co-authored-by: Michael Wyatt <[email protected]> * fixing optimizer sanity check (#2742) Co-authored-by: Olatunji Ruwase <[email protected]> * [GatheredParameters] fix memory leak (#2665) * [GatheredParameters] fix memory leak * simplify * cleanup and move * style * Formatting * fix test * fix test * fix test take 2 * Trigger CI Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Joe Mayer <[email protected]> * Abstract accelerator (step 3) (#2677) * Integrate accelerator abstraction interface into deepspeed/ * Fix error message in fp16/fused_optimizer * fix error message in fp16/unfused_optimizer.py * assign get_accelerator().pin_memory() result to input Tensor name * no need to check cuda and whether nvtx supported * move try-except into inner most block * call Event() and Stream() in get_accelerator() for data type * Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed * Apply op_builder backend api change from #2705 from @jeffra * fix tests where Builder NAME is used * keep original ...Builder.NAME interface instead of ...Builder().NAME interface * fix builder closure for installation * fix randomltd builder * add comments to clarify create_op_builder and get_op_builder * fix compatibility with pip install -e Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Fix autotuning so that it records Floating Point Operations per second, not microsecond (#2711) * Fix how autotuning reports TFLOPS so that they are reported in FLOPS per second, not millisecond Co-authored-by: Nick Sarkauskas <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> Signed-off-by: Dashiell Stander <[email protected]> * Actually it is microseconds -> seconds Signed-off-by: Dashiell Stander <[email protected]> * Actually it is microseconds -> seconds Signed-off-by: Dashiell Stander <[email protected]> Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Nick Sarkauskas <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> * fix a mispelled attribute (#2750) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [zero] remove misleading dtype log (#2732) Co-authored-by: Olatunji Ruwase <[email protected]> * Fix softmax backward (#2709) * Reset KV-cache at the beginning of text-generation * Add new backward kernel to handle large softmax-length * remove unrelated changes Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Connor Holmes <[email protected]> * Skip test_bias_gelu unit test if torch < 1.12 (#2754) This PR adds a torch version check in the test_bias_gelu unit test to skip if the torch version < 1.12. This is due to gelu implementation differences in versions prior to 1.12. * Add environment variable to make nvcc compilation more verbose (#2759) * Bing/formatting correction (#2764) * modify engine.py for formatting * commit formatting changes on engine.py * Add links to new azureML examples (#2756) Co-authored-by: Jeff Rasley <[email protected]> * Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) * Remove hardcoded instances to fp16 in log messages. * Add model_dtype to print the correct format * Respond to PR feedback --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Refactor/Pydantify monitoring config (#2640) * pydantify monitoring configs --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Pin minimum `packaging` requirement (#2771) Co-authored-by: Jeff Rasley <[email protected]> * Fix for diffusers v0.12.0 (#2753) Co-authored-by: Jeff Rasley <[email protected]> * some fix in flops_profiler (#2068) * bugs in profiler: 1. Tensor.bmm missed in _patch_tensor_methods function 2. missed funtions in _reload_functionals and _reload_tensor_methods functions 3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead. * formatting --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Cheng Li <[email protected]> * fix upsample flops compute by skipping unused kargs (#2773) * fix upsample flops compute by skipping unused kargs * fix format * Fix broken kernel inject bug (#2776) * Fix Checkpoint-loading with Meta-tensor (#2781) * Reset KV-cache at the beginning of text-generation * Pass the ckpt-loading arguments to work with meta-tensor * remove unrelated changes * add support for hjson config files (#2783) Co-authored-by: Olatunji Ruwase <[email protected]> * Reset KV-cache at the beginning of text-generation (#2669) Co-authored-by: Martin Cai <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Container param cleanup + remove qkv_merging (#2780) This PR cleans up some container items and removes an unused qkv_merging parameter: - Remove qkv_merging=True from BERT containers - Change containers config object to ds_model_config - Remove qkv_merging param * Common location to install libaio-dev (#2779) * Common location to install libaio-dev * Update .github/workflows/setup-venv/action.yml Co-authored-by: Michael Wyatt <[email protected]> --------- Co-authored-by: Michael Wyatt <[email protected]> * Fixing broken link to azureml-examples recipes (#2795) * remove outdated comment (#2786) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Enable page-locked tensors without CUDA (#2775) * Enable page-locked memory in cpu only env * Enable page-locked memory in cpu only env * Formatting * Add TODOs; Release page-locked memory * Update perf microbenchmark; Reduce unit test memory * Reduce CI mem usage * Add container load checkpoint error reporting + refactor (#2792) This PR refactors the organization of meta tensor checkpoint loading as follows: - Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer - Model-specific get_param_names() definitions moved from policy into model-specific container - selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured - ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited - Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading. The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature. * Add user defined launcher args for PDSH launcher (#2804) * Add user defined launcher args for PDSH launcher * Formatting fixes * Fix Slurm launcher user args (#2806) Fix missing connections from --launcher_args to Slurm srun command. * Handle hanged tests in CI (#2808) * Fix inference CI device error (#2824) * Fix permissions issue with pip upgrade (#2823) * fix permissions issue with pip upgrade * install to .local instead of use sudo * upgrade pip in venv * Update action.yml * fix typos * Fix cpu-only CI hangs (#2825) * don't run tests in parallel * make AsyncIO test sequential * Fix Pipeline Parallel resize unit test (#2833) * fix overlapping checkpoint names in unit tests * remove running cpu-only on master merge * Fix auto TP for duplicate modules with different gems (#2784) * Fix auto TP for duplicate modules with different gems * precommit and comments * Comment * Combine gem list of same named modules * remove duplicates from gem_list before updating policy * Add module attribute with name variation for ProphetNet --------- Co-authored-by: Jeff Rasley <[email protected]> * Refactor DS inference API. No longer need replace_method. (#2831) Co-authored-by: Michael Wyatt <[email protected]> * Port Reza's INT8-quantization fix to container architecture (#2725) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Heyang Qin <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Fix gpt-Neox rotary embedding implementation (#2782) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * fix for cpu-only tests (#2849) * bump to 0.8.2 * add auto-generated PR workflow (#2822) * add auto-generated PR for private repo * change variable names * fix typo in autosync workflow (#2850) * Fix example command when building wheel with dev version specified (#2815) * Create tensor parallelism blog/tutorial (#2766) Co-authored-by: Michael Wyatt <[email protected]> * Data efficiency library update (#2866) * data efficiency library update * data efficiency library update * data efficiency update * data efficiency update * Make z3 respect comm dtype (#2807) * Make z3 respect comm dtype * Support fp32 comm dtype * Remove obsolete assert * Code cleanup * Automatic Tensor Parallelism Blog Links (#2877) * Modify table for compatible web format * Add tutorial links to navigation * Add news bit to main readme * Update docs/_tutorials/automatic-tensor-parallelism.md Co-authored-by: Michael Wyatt <[email protected]> --------- Co-authored-by: Michael Wyatt <[email protected]> * Check device count before running dist tests (#2799) * Check device count before running dist tests * fixing format for "Check device count before running dist tests" * Check device count against max world size * Check GPU count before launching dist tests * double-check GPU actually exists --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * AutoTP tutorial web formatting and news (#2883) Co-authored-by: Jeff Rasley <[email protected]> * Remove deprecated `torch._six` imports (#2863) * Remove deprecated `torch._six` imports Closes #2845. * Support older versions of PyTorch as well. --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Reduce I/O size (#2814) * add missing license info to top of all source code (#2889) Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Enable tensor fragments for zero 2 & 3 (#2727) * Enable tensor fragments for zero 2 * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <[email protected]> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <[email protected]> * Support offload * Support multi-gpu * Cleanup * WIP * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <[email protected]> * Support padding * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <[email protected]> * z3 optimizer state support; aligned api * Support frozen z3 params * Unit tests * Check NVMe offload capability * Formatting * Docs * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <[email protected]> * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <[email protected]> * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <[email protected]> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <[email protected]> * More docs * Support unsharded fp32 grad * Remove debug prints * Fix off-by-one detection of empty grads * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <[email protected]> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <[email protected]> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <[email protected]> * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <[email protected]> * Fix off-by-one error * Skip ranks with no gradient data * Formatting * Add license * Fix license --------- Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * better eval sampler (#2907) Co-authored-by: Olatunji Ruwase <[email protected]> * using container when loading inference checkpoints (#2875) This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy. * Fix CPUAdam for when `vendor_id_raw` is not provided (#2836) * #1213: Fix CPUAdam for when `vendor_id_raw` is not provided * formatting (yapf) fix --------- Co-authored-by: Olatunji Ruwase <[email protected]> * Always convert input mask to half (#2851) * Fixes `AttributeError` in #2853 (#2854) Updates `deepspeed/monitor/monitor.py` to instantiate objects with correct configs Relevant issue: https://github.com/microsoft/DeepSpeed/issues/2853 Co-authored-by: Olatunji Ruwase <[email protected]> * Add MPICH Multinode Runner (#2839) * MPICH support * MPICH changes * MPICH changes * MPICH changes * MPICH changes * accelerator runtime modifications * Accelerator runtime changes * Accelerator runtime modifications * Remove redundant print from single node * Move hostfile to tmp * Code cleanup for MPICH class * Code cleanup, rm whitespace * Removing mpiexec environment check details * Not needed tmp hostfile as pass directly * Remove debugging comments * rm print statement * Revert comm changes as WA not needed * Use MPICHRunner name for class * Use MPICHRunner as class name * No need to use args.force_multi and args.launcher . This should be set in deepspeedexamples gpt-3.6b .sh script as: $launcher=MPICH run_cmd=" deepspeed --hostfile=${hostfile_ds} --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}" * Adhere to code pattern * Rm empty lines in MPICHRunner class * Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh * pass MPICH hostfile through launcher_args in gpt-3.6b.sh * Clean code and remove args hostfile * fix merge * fix merge --------- Co-authored-by: Abhilash Majumder <[email protected]> * clean up and fix format * add ut --------- Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * TP unsupported models and assertions (#2810) Co-authored-by: Jeff Rasley <[email protected]> * AutoTP Assert Kernel Injection Support (#2939) * check kernel injection supported models * Clarify why user should use kernel injection * Check for local CUDA graphs when enable_cuda_graph=True (#2941) * Improve overflow handling (#2944) Co-authored-by: Jeff Rasley <[email protected]> * [RFC] add device abstraction to allow other device than CUDA be used (#2221) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * deepspeed.init_distributed() support for TCP protocols (#2905) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.8.3 * bug fix for skipping mbs (#2171) Co-authored-by: Rajhans Samdani <[email protected]> * Fix issue between our abstract accelerator and colossalai's version of op_builder (#2963) Co-authored-by: Logan Adams <[email protected]> * [zero] prevent poor configs from running w. zero-offload (#2971) --------- Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Guanhua Wang <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jean-Louis Queguiner <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Matt Smith <[email protected]> Co-authored-by: Thomas-MMJ <[email protected]> Co-authored-by: lekurile <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Andrey Chernykh <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Adam Moody <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: eltonzheng <[email protected]> Co-authored-by: Benjamin Steenhoek <[email protected]> Co-authored-by: Guo Yejun <[email protected]> Co-authored-by: savitamittal1 <[email protected]> Co-authored-by: kyoto7250 <[email protected]> Co-authored-by: Kevin Ko <[email protected]> Co-authored-by: lokoppakmsft <[email protected]> Co-authored-by: iLeGend <[email protected]> Co-authored-by: Alex Hedges <[email protected]> Co-authored-by: ShijieZZZZ <[email protected]> Co-authored-by: Ma, Guokai <[email protected]> Co-authored-by: AGUL <[email protected]> Co-authored-by: Jeongseok Kang <[email protected]> Co-authored-by: Hayden <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Rahil Bathwal <[email protected]> Co-authored-by: Jithun Nair <[email protected]> Co-authored-by: Ikko Ashimine <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: li-yi-dong <[email protected]> Co-authored-by: liyidong.lyd <[email protected]> Co-authored-by: JackieWu <[email protected]> Co-authored-by: Xiaoxia (Shirley) Wu <[email protected]> Co-authored-by: cassieesvelt <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: loadams <[email protected]> Co-authored-by: Nick Sarkauskas <[email protected]> Co-authored-by: Bing Xie <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: swli <[email protected]> Co-authored-by: Martin Cai <[email protected]> Co-authored-by: Razvan Tanase <[email protected]> Co-authored-by: Heyang Qin <[email protected]> Co-authored-by: Yasyf Mohamedali <[email protected]> Co-authored-by: Mayank Mishra <[email protected]> Co-authored-by: Farzan Taj <[email protected]> Co-authored-by: Sam Foreman <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: noabauma <[email protected]> Co-authored-by: Rajhans Samdani <[email protected]>

adammoody requested review from awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners September 30, 2021 19:11

adammoody force-pushed the layerckpt branch from 1ac9895 to 9fbeb42 Compare December 18, 2021 01:07

adammoody changed the title ~~WIP: parallelize layer checkpoints across data parallel instances~~ parallelize writing of layer checkpoint files across data parallel instances Dec 18, 2021

adammoody force-pushed the layerckpt branch from 9fbeb42 to 1cee52d Compare March 31, 2022 21:40

adammoody mentioned this pull request Jul 14, 2022

DeepSpeed checkpoint performance improvements bigscience-workshop/Megatron-DeepSpeed#312

Open

formatting fix

6f8c9d1

Merge branch 'master' into layerckpt

fa99397

Merge branch 'master' into layerckpt

f05dc91

config: add option for parallel write of layer checkpoints in pipelin…

ed8bc48

…e stage

adammoody force-pushed the layerckpt branch from e6a45fd to ed8bc48 Compare October 3, 2022 21:09

adammoody added 2 commits October 3, 2022 14:23

yapf fixes

92f6a84

enable parallel layer write according to config param

2dbf0a4

avoid extraneous makedir when rank 0 writes all layers

2f311e9

Merge branch 'master' into layerckpt

27002cf

tjruwase approved these changes Oct 10, 2022

View reviewed changes

tjruwase added 3 commits October 12, 2022 22:31

Merge branch 'master' into layerckpt

f54324a

Merge branch 'master' into layerckpt

6d5518b

Merge branch 'master' into layerckpt

0e4d92b

tjruwase merged commit b8fb9c3 into deepspeedai:master Oct 21, 2022

adammoody deleted the layerckpt branch October 26, 2022 17:57

tjruwase mentioned this pull request Mar 15, 2023

should DeepSpeedEngine.save_checkpoint be only under main_process #2993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize writing of layer checkpoint files across data parallel instances #1419

parallelize writing of layer checkpoint files across data parallel instances #1419

adammoody commented Sep 30, 2021

adammoody commented Dec 18, 2021 •

edited

Loading

adammoody commented Mar 31, 2022

adammoody commented Mar 31, 2022 •

edited

Loading

adammoody commented Apr 12, 2022

adammoody commented May 12, 2022

rocm-mici commented Jun 9, 2022

adammoody commented Jul 7, 2022 •

edited

Loading

stas00 commented Jul 8, 2022

tjruwase commented Jul 15, 2022

adammoody commented Jul 15, 2022

adammoody commented Sep 19, 2022 •

edited

Loading

adammoody commented Sep 19, 2022

adammoody commented Sep 19, 2022

stas00 commented Sep 19, 2022

tjruwase commented Sep 20, 2022

GuanhuaWang commented Sep 21, 2022 •

edited

Loading

adammoody commented Sep 21, 2022

adammoody commented Sep 21, 2022

tjruwase commented Sep 21, 2022

adammoody commented Sep 29, 2022

tjruwase commented Sep 30, 2022 •

edited

Loading

adammoody commented Oct 3, 2022

adammoody commented Oct 4, 2022

adammoody commented Oct 10, 2022

tjruwase commented Oct 21, 2022

adammoody commented Oct 26, 2022

adammoody commented Oct 26, 2022 •

edited

Loading

parallelize writing of layer checkpoint files across data parallel instances #1419

parallelize writing of layer checkpoint files across data parallel instances #1419

Conversation

adammoody commented Sep 30, 2021

adammoody commented Dec 18, 2021 • edited Loading

adammoody commented Mar 31, 2022

adammoody commented Mar 31, 2022 • edited Loading

adammoody commented Apr 12, 2022

adammoody commented May 12, 2022

rocm-mici commented Jun 9, 2022

adammoody commented Jul 7, 2022 • edited Loading

stas00 commented Jul 8, 2022

tjruwase commented Jul 15, 2022

adammoody commented Jul 15, 2022

adammoody commented Sep 19, 2022 • edited Loading

adammoody commented Sep 19, 2022

adammoody commented Sep 19, 2022

stas00 commented Sep 19, 2022

tjruwase commented Sep 20, 2022

GuanhuaWang commented Sep 21, 2022 • edited Loading

adammoody commented Sep 21, 2022

adammoody commented Sep 21, 2022

tjruwase commented Sep 21, 2022

adammoody commented Sep 29, 2022

tjruwase commented Sep 30, 2022 • edited Loading

adammoody commented Oct 3, 2022

adammoody commented Oct 4, 2022

adammoody commented Oct 10, 2022

tjruwase commented Oct 21, 2022

adammoody commented Oct 26, 2022

adammoody commented Oct 26, 2022 • edited Loading

adammoody commented Dec 18, 2021 •

edited

Loading

adammoody commented Mar 31, 2022 •

edited

Loading

adammoody commented Jul 7, 2022 •

edited

Loading

adammoody commented Sep 19, 2022 •

edited

Loading

GuanhuaWang commented Sep 21, 2022 •

edited

Loading

tjruwase commented Sep 30, 2022 •

edited

Loading

adammoody commented Oct 26, 2022 •

edited

Loading