latency is high #29

ben-8878 · 2024-11-27T09:47:12Z

use the docker "ghcr.io/coqui-ai/xtts-streaming-server", when post request, I get the follow first chunk time:

Time to make POST: 0.18376178992912173s
Time to first chunk: 0.8716433839872479s

when I use the local inference.
first chunk: 0.2440338134765625s

The text was updated successfully, but these errors were encountered:

VinMing · 2024-11-27T09:47:42Z

已收到。

ben-8878 · 2024-11-27T09:58:03Z

使用fastapi的时候 deepspeed=True未生效。我改了一下代码强制 deepspeed=True时候报以下错误。

    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/app/TTS/tts/layers/xtts/gpt_inference.py", line 97, in forward
    |     transformer_outputs = self.transformer(
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 888, in forward
    |     outputs = block(
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 171, in forward
    |     self.attention(input,
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 171, in forward
    |     self.attention(input,
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 152, in forward
    |     qkv_out = self.qkv_func(input=input,
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/op_binding/qkv_gemm.py", line 82, in forward
    |     output, norm = self.qkv_gemm_func(input, weight, q_scale, bias, gamma, beta, self.config.epsilon, add_bias,
    | RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.
    +------------------------------------

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latency is high #29

latency is high #29

ben-8878 commented Nov 27, 2024

VinMing commented Nov 27, 2024 via email

ben-8878 commented Nov 27, 2024

latency is high #29

latency is high #29

Comments

ben-8878 commented Nov 27, 2024

VinMing commented Nov 27, 2024 via email

ben-8878 commented Nov 27, 2024