[BUG] Errors in GPT-MoE models Inferences #6973

1155157110 · 2025-01-25T06:49:04Z

Describe the bug
Could not do model inference in for pretrained GPT-MoE models produced by ds_pretrain_gpt_125M_MoE64.sh

To Reproduce
Steps to reproduce the behavior:

Train GPT-MoE models with deepspeed tutorial refered examples
Run the generate_text.sh (with some minor adjustments on the paths and model size) to do model inference

Expected behavior
Model runs in expert parallelism manner and generates output tokens

ds_report output

DeepSpeed general environment info:
torch install path ............... ['/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch']
torch version .................... 2.5.1+cu121
deepspeed install path ........... ['/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.1
shared memory (/dev/shm) size .... 125.88 GB

Screenshots
Error Log:

Context prompt (stop to exit) >>> hi
layer_past = None, get_key_value = True, forward_method_parallel_output = False
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 222, in <module>
[rank1]:     main()
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 162, in main
[rank1]:     generate_samples_interactive(model)
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 291, in generate_samples_interactive
[rank1]:     for counter, decode_tokens in enumerate(token_stream):
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 422, in get_token_stream
[rank1]:     for tokens, lengths in batch_token_iterator:
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 543, in sample_sequence_batch
[rank1]:     output, layer_past = forward_step(model, tokens2use,
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 467, in forward_step
[rank1]:     output_tensor = model(tokens, position_ids, attention_mask,
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/model/distributed.py", line 58, in forward
[rank1]:     return self.module(*inputs, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/model/module.py", line 191, in forward
[rank1]:     outputs = self.module(*inputs, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]: TypeError: GPTModel.forward() got an unexpected keyword argument 'layer_past'
[rank1]:[W125 14:48:27.674636513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[2025-01-25 14:48:30,771] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188779
[2025-01-25 14:48:31,117] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188780
[2025-01-25 14:48:31,118] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188781
[2025-01-25 14:48:31,704] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188782
[2025-01-25 14:48:32,330] [ERROR] [launch.py:325:sigkill_handler] ['/home/a6000/miniconda3/envs/DeepSpeed/bin/python', '-u', '/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py', '--local_rank=3', '--tensor-model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--max-position-embeddings', '2048', '--tokenizer-type', 'GPT2BPETokenizer', '--fp16', '--num-experts', '64', '64', '64', '64', '64', '64', '--mlp-type', 'standard', '--micro-batch-size', '4', '--seq-length', '2048', '--out-seq-length', '1024', '--temperature', '1.0', '--vocab-file', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/data/gpt2-vocab.json', '--merge-file', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/data/gpt2-merges.txt', '--genfile', 'unconditional_samples.json', '--top_p', '0.9', '--log-interval', '1', '--num-samples', '0', '--load', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/output/checkpoint/gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true'] exits with return code = 1

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: one machine with 4 A6000 GPUs
deepspeed 0.16.3
Hugging Face Transformers 4.48.1
accelerate 1.3.0
python 3.10.16

Other information
deepspeedai/Megatron-DeepSpeed#458

The text was updated successfully, but these errors were encountered:

1155157110 added bug Something isn't working inference labels Jan 25, 2025

hwchen2017 self-assigned this Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Errors in GPT-MoE models Inferences #6973

[BUG] Errors in GPT-MoE models Inferences #6973

1155157110 commented Jan 25, 2025

[BUG] Errors in GPT-MoE models Inferences #6973

[BUG] Errors in GPT-MoE models Inferences #6973

Comments

1155157110 commented Jan 25, 2025