You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run the generate_text.sh (with some minor adjustments on the paths and model size) to do model inference
Expected behavior
Model runs in expert parallelism manner and generates output tokens
ds_report output
DeepSpeed general environment info:
torch install path ............... ['/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch']
torch version .................... 2.5.1+cu121
deepspeed install path ........... ['/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.1
shared memory (/dev/shm) size .... 125.88 GB
Screenshots
Error Log:
Context prompt (stop to exit) >>> hi
layer_past = None, get_key_value = True, forward_method_parallel_output = False
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 222, in <module>
[rank1]: main()
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 162, in main
[rank1]: generate_samples_interactive(model)
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 291, in generate_samples_interactive
[rank1]: for counter, decode_tokens in enumerate(token_stream):
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 422, in get_token_stream
[rank1]: for tokens, lengths in batch_token_iterator:
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 543, in sample_sequence_batch
[rank1]: output, layer_past = forward_step(model, tokens2use,
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 467, in forward_step
[rank1]: output_tensor = model(tokens, position_ids, attention_mask,
[rank1]: File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/model/distributed.py", line 58, in forward
[rank1]: return self.module(*inputs, **kwargs)
[rank1]: File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/a6000/Desktop/Megatron-DeepSpeed/megatron/model/module.py", line 191, in forward
[rank1]: outputs = self.module(*inputs, **kwargs)
[rank1]: File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/a6000/miniconda3/envs/DeepSpeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: TypeError: GPTModel.forward() got an unexpected keyword argument 'layer_past'
[rank1]:[W125 14:48:27.674636513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[2025-01-25 14:48:30,771] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188779
[2025-01-25 14:48:31,117] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188780
[2025-01-25 14:48:31,118] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188781
[2025-01-25 14:48:31,704] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2188782
[2025-01-25 14:48:32,330] [ERROR] [launch.py:325:sigkill_handler] ['/home/a6000/miniconda3/envs/DeepSpeed/bin/python', '-u', '/home/a6000/Desktop/Megatron-DeepSpeed/tools/generate_samples_gpt.py', '--local_rank=3', '--tensor-model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--max-position-embeddings', '2048', '--tokenizer-type', 'GPT2BPETokenizer', '--fp16', '--num-experts', '64', '64', '64', '64', '64', '64', '--mlp-type', 'standard', '--micro-batch-size', '4', '--seq-length', '2048', '--out-seq-length', '1024', '--temperature', '1.0', '--vocab-file', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/data/gpt2-vocab.json', '--merge-file', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/data/gpt2-merges.txt', '--genfile', 'unconditional_samples.json', '--top_p', '0.9', '--log-interval', '1', '--num-samples', '0', '--load', '/home/a6000/Desktop/Megatron-DeepSpeed/examples_deepspeed/MoE/output/checkpoint/gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-4-mp-1-pp-1-ep-64-mlc-0.01-cap-1.0-drop-true'] exits with return code = 1
System info (please complete the following information):
OS: Ubuntu 20.04
GPU count and types: one machine with 4 A6000 GPUs
Describe the bug
Could not do model inference in for pretrained GPT-MoE models produced by ds_pretrain_gpt_125M_MoE64.sh
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Model runs in expert parallelism manner and generates output tokens
ds_report output
Screenshots
Error Log:
System info (please complete the following information):
Other information
deepspeedai/Megatron-DeepSpeed#458
The text was updated successfully, but these errors were encountered: