Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Errors in GPT-MoE models Inferences #6973

Open
1155157110 opened this issue Jan 25, 2025 · 0 comments
Open

[BUG] Errors in GPT-MoE models Inferences #6973

1155157110 opened this issue Jan 25, 2025 · 0 comments
Assignees
Labels
bug Something isn't working inference

Comments

@1155157110
Copy link

Describe the bug
Could not do model inference in for pretrained GPT-MoE models produced by ds_pretrain_gpt_125M_MoE64.sh

To Reproduce
Steps to reproduce the behavior:

  1. Train GPT-MoE models with deepspeed tutorial refered examples
  2. Run the generate_text.sh (with some minor adjustments on the paths and model size) to do model inference

Expected behavior
Model runs in expert parallelism manner and generates output tokens

ds_report output

DeepSpeed general environment info:
torch install path ............... ['/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch']
torch version .................... 2.5.1+cu121
deepspeed install path ........... ['/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.1
shared memory (/dev/shm) size .... 125.88 GB

Screenshots
Error Log:

Context prompt (stop to exit) >>> hi
layer_past = None, get_key_value = True, forward_method_parallel_output = False
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 222, in <module>
[rank1]:     main()
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 162, in main
[rank1]:     generate_samples_interactive(model)
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 291, in generate_samples_interactive
[rank1]:     for counter, decode_tokens in enumerate(token_stream):
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 422, in get_token_stream
[rank1]:     for tokens, lengths in batch_token_iterator:
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 543, in sample_sequence_batch
[rank1]:     output, layer_past = forward_step(model, tokens2use,
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 467, in forward_step
[rank1]:     output_tensor = model(tokens, position_ids, attention_mask,
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/model/distributed.py", line 58, in forward
[rank1]:     return self.module(*inputs, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/model/module.py", line 191, in forward
[rank1]:     outputs = self.module(*inputs, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]: TypeError: GPTModel.forward() got an unexpected keyword argument 'layer_past'
[rank1]:[W125 14:48:27.674636513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[2025-01-25 14:48:30,771] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188779
[2025-01-25 14:48:31,117] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188780
[2025-01-25 14:48:31,118] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188781
[2025-01-25 14:48:31,704] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188782
[2025-01-25 14:48:32,330] [ERROR] [launch.py:325:sigkill_handler] ['/home/a6000/miniconda3/envs/DeepSpeed/bin/python', '-u', '/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py', '--local_rank=3', '--tensor-model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--max-position-embeddings', '2048', '--tokenizer-type', 'GPT2BPETokenizer', '--fp16', '--num-experts', '64', '64', '64', '64', '64', '64', '--mlp-type', 'standard', '--micro-batch-size', '4', '--seq-length', '2048', '--out-seq-length', '1024', '--temperature', '1.0', '--vocab-file', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/data/gpt2-vocab.json', '--merge-file', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/data/gpt2-merges.txt', '--genfile', 'unconditional_samples.json', '--top_p', '0.9', '--log-interval', '1', '--num-samples', '0', '--load', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/output/checkpoint/gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true'] exits with return code = 1

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types: one machine with 4 A6000 GPUs
  • deepspeed 0.16.3
  • Hugging Face Transformers 4.48.1
  • accelerate 1.3.0
  • python 3.10.16

Other information
deepspeedai/Megatron-DeepSpeed#458

@1155157110 1155157110 added bug Something isn't working inference labels Jan 25, 2025
@hwchen2017 hwchen2017 self-assigned this Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

2 participants