Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported model type xlm-roberta #3020

Open
2 of 4 tasks
elvizlai opened this issue Feb 13, 2025 · 0 comments
Open
2 of 4 tasks

Unsupported model type xlm-roberta #3020

elvizlai opened this issue Feb 13, 2025 · 0 comments

Comments

@elvizlai
Copy link

elvizlai commented Feb 13, 2025

System Info

docker deploy

$ nvidia-smi
Thu Feb 13 23:44:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:4F:00.0 Off |                    0 |
| N/A   63C    P0            297W /  300W |   41431MiB /  81920MiB |     98%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:52:00.0 Off |                    0 |
| N/A   62C    P0            155W /  300W |   41437MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:56:00.0 Off |                    0 |
| N/A   65C    P0            165W /  300W |   41397MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:57:00.0 Off |                    0 |
| N/A   35C    P0             48W /  300W |      14MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L40S                    Off |   00000000:CE:00.0 Off |                    0 |
| N/A   37C    P0             95W /  350W |    2266MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L40S                    Off |   00000000:D1:00.0 Off |                    0 |
| N/A   36C    P0             91W /  350W |     876MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L40S                    Off |   00000000:D5:00.0 Off |                    0 |
| N/A   38C    P0             97W /  350W |   19149MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L40S                    Off |   00000000:D6:00.0 Off |                    0 |
| N/A   38C    P0             96W /  350W |   19187MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

2025-02-13T15:38:51.005264Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-02-13T15:38:52.028582Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
> File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner
    model = get_model_with_lora_adapters(
  File "/usr/src/server/text_generation_server/models/__init__.py", line 1542, in get_model_with_lora_adapters
    model = get_model(
  File "/usr/src/server/text_generation_server/models/__init__.py", line 1523, in get_model
    raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type xlm-roberta
2025-02-13T15:38:53.307253Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-02-13 15:38:42.621 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/src/server/text_generation_server/cli.py:119 in serve                   │
│                                                                              │
│   116 │   │   raise RuntimeError(                                            │
│   117 │   │   │   "Only 1 can be set between `dtype` and `quantize`, as they │
│   118 │   │   )                                                              │
│ ❱ 119 │   server.serve(                                                      │
│   120 │   │   model_id,                                                      │
│   121 │   │   lora_adapters,                                                 │
│   122 │   │   revision,                                                      │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             dtype = None                                                 │ │
│ │       json_output = True                                                 │ │
│ │    kv_cache_dtype = None                                                 │ │
│ │      logger_level = 'INFO'                                               │ │
│ │     lora_adapters = []                                                   │ │
│ │  max_input_tokens = None                                                 │ │
│ │          model_id = 'BAAI/bge-m3'                                        │ │
│ │     otlp_endpoint = None                                                 │ │
│ │ otlp_service_name = 'text-generation-inference.router'                   │ │
│ │          quantize = None                                                 │ │
│ │          revision = None                                                 │ │
│ │            server = <module 'text_generation_server.server' from         │ │
│ │                     '/usr/src/server/text_generation_server/server.py'>  │ │
│ │           sharded = False                                                │ │
│ │         speculate = None                                                 │ │
│ │ trust_remote_code = False                                                │ │
│ │          uds_path = PosixPath('/tmp/text-generation-server')             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/src/server/text_generation_server/server.py:315 in serve                │
│                                                                              │
│   312 │   │   while signal_handler.KEEP_PROCESSING:                          │
│   313 │   │   │   await asyncio.sleep(0.5)                                   │
│   314 │                                                                      │
│ ❱ 315 │   asyncio.run(                                                       │
│   316 │   │   serve_inner(                                                   │
│   317 │   │   │   model_id,                                                  │
│   318 │   │   │   lora_adapters,                                             │
│                                                                              │
│ ╭─────────────────────────── locals ───────────────────────────╮             │
│ │             dtype = None                                     │             │
│ │    kv_cache_dtype = None                                     │             │
│ │     lora_adapters = []                                       │             │
│ │  max_input_tokens = None                                     │             │
│ │          model_id = 'BAAI/bge-m3'                            │             │
│ │          quantize = None                                     │             │
│ │          revision = None                                     │             │
│ │           sharded = False                                    │             │
│ │         speculate = None                                     │             │
│ │ trust_remote_code = False                                    │             │
│ │          uds_path = PosixPath('/tmp/text-generation-server') │             │
│ ╰──────────────────────────────────────────────────────────────╯             │
│                                                                              │
│ /opt/conda/lib/python3.11/asyncio/runners.py:190 in run                      │
│                                                                              │
│   187 │   │   │   "asyncio.run() cannot be called from a running event loop" │
│   188 │                                                                      │
│   189 │   with Runner(debug=debug) as runner:                                │
│ ❱ 190 │   │   return runner.run(main)                                        │
│   191                                                                        │
│   192                                                                        │
│   193 def _cancel_all_tasks(loop):                                           │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │  debug = None                                                            │ │
│ │   main = <coroutine object serve.<locals>.serve_inner at 0x7f8b0b7f1480> │ │
│ │ runner = <asyncio.runners.Runner object at 0x7f8b09e03890>               │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.11/asyncio/runners.py:118 in run                      │
│                                                                              │
│   115 │   │                                                                  │
│   116 │   │   self._interrupt_count = 0                                      │
│   117 │   │   try:                                                           │
│ ❱ 118 │   │   │   return self._loop.run_until_complete(task)                 │
│   119 │   │   except exceptions.CancelledError:                              │
│   120 │   │   │   if self._interrupt_count > 0:                              │
│   121 │   │   │   │   uncancel = getattr(task, "uncancel", None)             │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │        context = <_contextvars.Context object at 0x7f8b0a25d0c0>         │ │
│ │           coro = <coroutine object serve.<locals>.serve_inner at         │ │
│ │                  0x7f8b0b7f1480>                                         │ │
│ │           self = <asyncio.runners.Runner object at 0x7f8b09e03890>       │ │
│ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of    │ │
│ │                  <asyncio.runners.Runner object at 0x7f8b09e03890>>,     │ │
│ │                  main_task=<Task finished name='Task-1'                  │ │
│ │                  coro=<serve.<locals>.serve_inner() done, defined at     │ │
│ │                  /usr/src/server/text_generation_server/server.py:244>   │ │
│ │                  exception=ValueError('Unsupported model type            │ │
│ │                  xlm-roberta')>)                                         │ │
│ │           task = <Task finished name='Task-1'                            │ │
│ │                  coro=<serve.<locals>.serve_inner() done, defined at     │ │
│ │                  /usr/src/server/text_generation_server/server.py:244>   │ │
│ │                  exception=ValueError('Unsupported model type            │ │
│ │                  xlm-roberta')>                                          │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.11/asyncio/base_events.py:654 in run_until_complete   │
│                                                                              │
│    651 │   │   if not future.done():                                         │
│    652 │   │   │   raise RuntimeError('Event loop stopped before Future comp │
│    653 │   │                                                                 │
│ ❱  654 │   │   return future.result()                                        │
│    655 │                                                                     │
│    656 │   def stop(self):                                                   │
│    657 │   │   """Stop running the event loop.                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │   future = <Task finished name='Task-1'                                  │ │
│ │            coro=<serve.<locals>.serve_inner() done, defined at           │ │
│ │            /usr/src/server/text_generation_server/server.py:244>         │ │
│ │            exception=ValueError('Unsupported model type xlm-roberta')>   │ │
│ │ new_task = False                                                         │ │
│ │     self = <_UnixSelectorEventLoop running=False closed=True             │ │
│ │            debug=False>                                                  │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/src/server/text_generation_server/server.py:268 in serve_inner          │
│                                                                              │
│   265 │   │   │   server_urls = [local_url]                                  │
│   266 │   │                                                                  │
│   267 │   │   try:                                                           │
│ ❱ 268 │   │   │   model = get_model_with_lora_adapters(                      │
│   269 │   │   │   │   model_id,                                              │
│   270 │   │   │   │   lora_adapters,                                         │
│   271 │   │   │   │   revision,                                              │
│                                                                              │
│ ╭──────────────────────────── locals ─────────────────────────────╮          │
│ │     adapter_to_index = {}                                       │          │
│ │                dtype = None                                     │          │
│ │       kv_cache_dtype = None                                     │          │
│ │            local_url = 'unix:///tmp/text-generation-server-0'   │          │
│ │        lora_adapters = []                                       │          │
│ │     max_input_tokens = None                                     │          │
│ │             model_id = 'BAAI/bge-m3'                            │          │
│ │             quantize = None                                     │          │
│ │             revision = None                                     │          │
│ │          server_urls = ['unix:///tmp/text-generation-server-0'] │          │
│ │              sharded = False                                    │          │
│ │            speculate = None                                     │          │
│ │    trust_remote_code = False                                    │          │
│ │             uds_path = PosixPath('/tmp/text-generation-server') │          │
│ │ unix_socket_template = 'unix://{}-{}'                           │          │
│ ╰─────────────────────────────────────────────────────────────────╯          │
│                                                                              │
│ /usr/src/server/text_generation_server/models/__init__.py:1542 in            │
│ get_model_with_lora_adapters                                                 │
│                                                                              │
│   1539 │   adapter_to_index: Dict[str, int],                                 │
│   1540 ):                                                                    │
│   1541 │   lora_adapter_ids = [adapter.id for adapter in lora_adapters]      │
│ ❱ 1542 │   model = get_model(                                                │
│   1543 │   │   model_id,                                                     │
│   1544 │   │   lora_adapter_ids,                                             │
│   1545 │   │   revision,                                                     │
│                                                                              │
│ ╭───────────── locals ──────────────╮                                        │
│ │  adapter_to_index = {}            │                                        │
│ │             dtype = None          │                                        │
│ │    kv_cache_dtype = None          │                                        │
│ │  lora_adapter_ids = []            │                                        │
│ │     lora_adapters = []            │                                        │
│ │  max_input_tokens = None          │                                        │
│ │          model_id = 'BAAI/bge-m3' │                                        │
│ │          quantize = None          │                                        │
│ │          revision = None          │                                        │
│ │           sharded = False         │                                        │
│ │         speculate = None          │                                        │
│ │ trust_remote_code = False         │                                        │
│ ╰───────────────────────────────────╯                                        │
│                                                                              │
│ /usr/src/server/text_generation_server/models/__init__.py:1523 in get_model  │
│                                                                              │
│   1520 │   │   │   │   trust_remote_code=trust_remote_code,                  │
│   1521 │   │   │   )                                                         │
│   1522 │                                                                     │
│ ❱ 1523 │   raise ValueError(f"Unsupported model type {model_type}")          │
│   1524                                                                       │
│   1525                                                                       │
│   1526 # get_model_with_lora_adapters wraps the internal get_model function  │
│                                                                              │
│ ╭─────────────────────────────── locals ────────────────────────────────╮    │
│ │                         _ = {}                                        │    │
│ │                  auto_map = None                                      │    │
│ │ compressed_tensors_config = None                                      │    │
│ │               config_dict = {                                         │    │
│ │                             │   '_name_or_path': '',                  │    │
│ │                             │   'architectures': [                    │    │
│ │                             │   │   'XLMRobertaModel'                 │    │
│ │                             │   ],                                    │    │
│ │                             │   'attention_probs_dropout_prob': 0.1,  │    │
│ │                             │   'bos_token_id': 0,                    │    │
│ │                             │   'classifier_dropout': None,           │    │
│ │                             │   'eos_token_id': 2,                    │    │
│ │                             │   'hidden_act': 'gelu',                 │    │
│ │                             │   'hidden_dropout_prob': 0.1,           │    │
│ │                             │   'hidden_size': 1024,                  │    │
│ │                             │   'initializer_range': 0.02,            │    │
│ │                             │   ... +15                               │    │
│ │                             }                                         │    │
│ │                     dtype = None                                      │    │
│ │            kv_cache_dtype = None                                      │    │
│ │           kv_cache_scheme = None                                      │    │
│ │          lora_adapter_ids = []                                        │    │
│ │          max_input_tokens = None                                      │    │
│ │                    method = 'n-gram'                                  │    │
│ │                  model_id = 'BAAI/bge-m3'                             │    │
│ │                model_type = 'xlm-roberta'                             │    │
│ │      needs_sliding_window = False                                     │    │
│ │       quantization_config = None                                      │    │
│ │                  quantize = None                                      │    │
│ │                  revision = None                                      │    │
│ │                   sharded = False                                     │    │
│ │            sliding_window = -1                                        │    │
│ │                 speculate = 0                                         │    │
│ │                speculator = None                                      │    │
│ │         trust_remote_code = False                                     │    │
│ │        use_sliding_window = False                                     │    │
│ ╰───────────────────────────────────────────────────────────────────────╯    │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: Unsupported model type xlm-roberta rank=0

Expected behavior

should works

model=BAAI/bge-m3
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus '"device=4"' --shm-size 64g -p 10003:80 -v $volume:/data
ghcr.io/huggingface/text-generation-inference:3.1.0
--model-id $model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant