Skip to content

numpy.ndarray' object is not callable in gpt2/1/lib/triton_decoder.py", line 160, in convert_triton_request #702

Open
@freedom-168

Description

@freedom-168

System Info

Docker Image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
CPU: x86_64
GPU: H100
The container also includes the following:
[Ubuntu 24.04] including [Python 3.12]
[NVIDIA CUDA 12.6.3]
[NVIDIA cuBLAS 12.6.4.1]
[cuDNN 9.6.0.74]
[NVIDIA NCCL 2.23.4]
[NVIDIA TensorRT™ 10.7.0.23]
OpenUCX 1.15.0
GDRCopy 2.4.1
NVIDIA HPC-X 2.21
OpenMPI 4.1.7]]
[nvImageCodec 0.2.0.7]
ONNX Runtime 1.20.1
Intel[ OpenVINO ]
DCGM 3.3.6
[TensorRT-LLM] version [release/0.16.0]
[vLLM] version 0.5.5

Who can help?

After triton server was launched successfully, check its status by running triton status. It show triton server is running and ready.

Then sending following two requests:

1 triton infer -m gpt2 --prompt hello -i grpc -u localhost -p 8001
2. genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --num-prompts 1000 --random-seed 123 --synthetic-input-tokens-mean 1000 --synthetic-input-tokens-stddev 0 --output-tokens-mean 512 --output-tokens-stddev 0 --output-tokens-mean-deterministic --tokenizer /root/models/gpt2/tokenizer --concurrency 16 --measurement-interval 8000 --profile-export-file my_profile_export.json --url localhost:8001

it always returned error message described in the actual result.

Is there anyone help on this.

Thanks/Gavin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

send request:

genai-perf profile -m gpt2 --service-kind triton --backend tensorrtllm --num-prompts 1000 --random-seed 123 --synthetic-input-tokens-mean 1000 --synthetic-input-tokens-stddev 0 --output-tokens-mean 512 --output-tokens-stddev 0 --output-tokens-mean-deterministic --tokenizer /root/models/gpt2/tokenizer --concurrency 16 --measurement-interval 8000 --profile-export-file my_profile_export.json --url localhost:8001

Expected behavior

should be like this.

                                            LLM Metrics

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Request latency (ns) │ 296,990,497 │ 43,312,449 │ 332,788,242 │ 327,475,292 │ 317,392,767 │ 310,343,333 │
│ Output sequence length │ 109 │ 11 │ 158 │ 142 │ 118 │ 113 │
│ Input sequence length │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │
└────────────────────────┴─────────────┴────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
Output token throughput (per sec): 366.78
Request throughput (per sec): 3.37

actual behavior

E0212 21:46:42.323909 655 model.py:120] "Traceback (most recent call last):\n File "/root/models/gpt2/1/model.py", line 88, in execute\n req = self.decoder.convert_triton_request(request)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/root/models/gpt2/1/lib/triton_decoder.py", line 160, in convert_triton_request\n request = Request()\n ^^^^^^^^^\n File "", line 3, in init\nTypeError: 'numpy.ndarray' object is not callable\n"
triton - ERROR - Traceback (most recent call last):
File "/root/models/gpt2/1/model.py", line 88, in execute
req = self.decoder.convert_triton_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/models/gpt2/1/lib/triton_decoder.py", line 160, in convert_triton_request
request = Request()
^^^^^^^^^
File "", line 3, in init
TypeError: 'numpy.ndarray' object is not callable

triton - ERROR - Unexpected error:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/triton_cli/main.py", line 51, in main
run()
File "/usr/local/lib/python3.12/dist-packages/triton_cli/main.py", line 45, in run
args.func(args)
File "/usr/local/lib/python3.12/dist-packages/triton_cli/parser.py", line 363, in handle_infer
client.infer(model=args.model, prompt=args.prompt)
File "/usr/local/lib/python3.12/dist-packages/triton_cli/client/client.py", line 217, in infer
self.__async_infer(model, inputs)
File "/usr/local/lib/python3.12/dist-packages/triton_cli/client/client.py", line 221, in __async_infer
self.__grpc_async_infer(model, inputs)
File "/usr/local/lib/python3.12/dist-packages/triton_cli/client/client.py", line 273, in __grpc_async_infer
raise result
tritonclient.utils.InferenceServerException: Traceback (most recent call last):
File "/root/models/gpt2/1/model.py", line 88, in execute
req = self.decoder.convert_triton_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/models/gpt2/1/lib/triton_decoder.py", line 160, in convert_triton_request
request = Request()
^^^^^^^^^
File "", line 3, in init
TypeError: 'numpy.ndarray' object is not callable

additional notes

I check the triton_decoder.py in tensorrtllm_backend/infight_batcher_llm. It has the same code as gpt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions