We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker deploy
$ nvidia-smi Thu Feb 13 23:44:10 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe Off | 00000000:4F:00.0 Off | 0 | | N/A 63C P0 297W / 300W | 41431MiB / 81920MiB | 98% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100 80GB PCIe Off | 00000000:52:00.0 Off | 0 | | N/A 62C P0 155W / 300W | 41437MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A100 80GB PCIe Off | 00000000:56:00.0 Off | 0 | | N/A 65C P0 165W / 300W | 41397MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA A100 80GB PCIe Off | 00000000:57:00.0 Off | 0 | | N/A 35C P0 48W / 300W | 14MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA L40S Off | 00000000:CE:00.0 Off | 0 | | N/A 37C P0 95W / 350W | 2266MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA L40S Off | 00000000:D1:00.0 Off | 0 | | N/A 36C P0 91W / 350W | 876MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA L40S Off | 00000000:D5:00.0 Off | 0 | | N/A 38C P0 97W / 350W | 19149MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA L40S Off | 00000000:D6:00.0 Off | 0 | | N/A 38C P0 96W / 350W | 19187MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
2025-02-13T15:38:51.005264Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-02-13T15:38:52.028582Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 10, in <module> sys.exit(app()) File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in __call__ return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) > File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner model = get_model_with_lora_adapters( File "/usr/src/server/text_generation_server/models/__init__.py", line 1542, in get_model_with_lora_adapters model = get_model( File "/usr/src/server/text_generation_server/models/__init__.py", line 1523, in get_model raise ValueError(f"Unsupported model type {model_type}") ValueError: Unsupported model type xlm-roberta 2025-02-13T15:38:53.307253Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: 2025-02-13 15:38:42.621 | INFO | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda /usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd(cast_inputs=torch.float16) /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @custom_bwd /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @custom_bwd ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/src/server/text_generation_server/cli.py:119 in serve │ │ │ │ 116 │ │ raise RuntimeError( │ │ 117 │ │ │ "Only 1 can be set between `dtype` and `quantize`, as they │ │ 118 │ │ ) │ │ ❱ 119 │ server.serve( │ │ 120 │ │ model_id, │ │ 121 │ │ lora_adapters, │ │ 122 │ │ revision, │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ dtype = None │ │ │ │ json_output = True │ │ │ │ kv_cache_dtype = None │ │ │ │ logger_level = 'INFO' │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'BAAI/bge-m3' │ │ │ │ otlp_endpoint = None │ │ │ │ otlp_service_name = 'text-generation-inference.router' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ server = <module 'text_generation_server.server' from │ │ │ │ '/usr/src/server/text_generation_server/server.py'> │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ │ uds_path = PosixPath('/tmp/text-generation-server') │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/server.py:315 in serve │ │ │ │ 312 │ │ while signal_handler.KEEP_PROCESSING: │ │ 313 │ │ │ await asyncio.sleep(0.5) │ │ 314 │ │ │ ❱ 315 │ asyncio.run( │ │ 316 │ │ serve_inner( │ │ 317 │ │ │ model_id, │ │ 318 │ │ │ lora_adapters, │ │ │ │ ╭─────────────────────────── locals ───────────────────────────╮ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'BAAI/bge-m3' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ │ uds_path = PosixPath('/tmp/text-generation-server') │ │ │ ╰──────────────────────────────────────────────────────────────╯ │ │ │ │ /opt/conda/lib/python3.11/asyncio/runners.py:190 in run │ │ │ │ 187 │ │ │ "asyncio.run() cannot be called from a running event loop" │ │ 188 │ │ │ 189 │ with Runner(debug=debug) as runner: │ │ ❱ 190 │ │ return runner.run(main) │ │ 191 │ │ 192 │ │ 193 def _cancel_all_tasks(loop): │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ debug = None │ │ │ │ main = <coroutine object serve.<locals>.serve_inner at 0x7f8b0b7f1480> │ │ │ │ runner = <asyncio.runners.Runner object at 0x7f8b09e03890> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /opt/conda/lib/python3.11/asyncio/runners.py:118 in run │ │ │ │ 115 │ │ │ │ 116 │ │ self._interrupt_count = 0 │ │ 117 │ │ try: │ │ ❱ 118 │ │ │ return self._loop.run_until_complete(task) │ │ 119 │ │ except exceptions.CancelledError: │ │ 120 │ │ │ if self._interrupt_count > 0: │ │ 121 │ │ │ │ uncancel = getattr(task, "uncancel", None) │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ context = <_contextvars.Context object at 0x7f8b0a25d0c0> │ │ │ │ coro = <coroutine object serve.<locals>.serve_inner at │ │ │ │ 0x7f8b0b7f1480> │ │ │ │ self = <asyncio.runners.Runner object at 0x7f8b09e03890> │ │ │ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of │ │ │ │ <asyncio.runners.Runner object at 0x7f8b09e03890>>, │ │ │ │ main_task=<Task finished name='Task-1' │ │ │ │ coro=<serve.<locals>.serve_inner() done, defined at │ │ │ │ /usr/src/server/text_generation_server/server.py:244> │ │ │ │ exception=ValueError('Unsupported model type │ │ │ │ xlm-roberta')>) │ │ │ │ task = <Task finished name='Task-1' │ │ │ │ coro=<serve.<locals>.serve_inner() done, defined at │ │ │ │ /usr/src/server/text_generation_server/server.py:244> │ │ │ │ exception=ValueError('Unsupported model type │ │ │ │ xlm-roberta')> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /opt/conda/lib/python3.11/asyncio/base_events.py:654 in run_until_complete │ │ │ │ 651 │ │ if not future.done(): │ │ 652 │ │ │ raise RuntimeError('Event loop stopped before Future comp │ │ 653 │ │ │ │ ❱ 654 │ │ return future.result() │ │ 655 │ │ │ 656 │ def stop(self): │ │ 657 │ │ """Stop running the event loop. │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ future = <Task finished name='Task-1' │ │ │ │ coro=<serve.<locals>.serve_inner() done, defined at │ │ │ │ /usr/src/server/text_generation_server/server.py:244> │ │ │ │ exception=ValueError('Unsupported model type xlm-roberta')> │ │ │ │ new_task = False │ │ │ │ self = <_UnixSelectorEventLoop running=False closed=True │ │ │ │ debug=False> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/server.py:268 in serve_inner │ │ │ │ 265 │ │ │ server_urls = [local_url] │ │ 266 │ │ │ │ 267 │ │ try: │ │ ❱ 268 │ │ │ model = get_model_with_lora_adapters( │ │ 269 │ │ │ │ model_id, │ │ 270 │ │ │ │ lora_adapters, │ │ 271 │ │ │ │ revision, │ │ │ │ ╭──────────────────────────── locals ─────────────────────────────╮ │ │ │ adapter_to_index = {} │ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ local_url = 'unix:///tmp/text-generation-server-0' │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'BAAI/bge-m3' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ server_urls = ['unix:///tmp/text-generation-server-0'] │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ │ uds_path = PosixPath('/tmp/text-generation-server') │ │ │ │ unix_socket_template = 'unix://{}-{}' │ │ │ ╰─────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/models/__init__.py:1542 in │ │ get_model_with_lora_adapters │ │ │ │ 1539 │ adapter_to_index: Dict[str, int], │ │ 1540 ): │ │ 1541 │ lora_adapter_ids = [adapter.id for adapter in lora_adapters] │ │ ❱ 1542 │ model = get_model( │ │ 1543 │ │ model_id, │ │ 1544 │ │ lora_adapter_ids, │ │ 1545 │ │ revision, │ │ │ │ ╭───────────── locals ──────────────╮ │ │ │ adapter_to_index = {} │ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ lora_adapter_ids = [] │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'BAAI/bge-m3' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ ╰───────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/models/__init__.py:1523 in get_model │ │ │ │ 1520 │ │ │ │ trust_remote_code=trust_remote_code, │ │ 1521 │ │ │ ) │ │ 1522 │ │ │ ❱ 1523 │ raise ValueError(f"Unsupported model type {model_type}") │ │ 1524 │ │ 1525 │ │ 1526 # get_model_with_lora_adapters wraps the internal get_model function │ │ │ │ ╭─────────────────────────────── locals ────────────────────────────────╮ │ │ │ _ = {} │ │ │ │ auto_map = None │ │ │ │ compressed_tensors_config = None │ │ │ │ config_dict = { │ │ │ │ │ '_name_or_path': '', │ │ │ │ │ 'architectures': [ │ │ │ │ │ │ 'XLMRobertaModel' │ │ │ │ │ ], │ │ │ │ │ 'attention_probs_dropout_prob': 0.1, │ │ │ │ │ 'bos_token_id': 0, │ │ │ │ │ 'classifier_dropout': None, │ │ │ │ │ 'eos_token_id': 2, │ │ │ │ │ 'hidden_act': 'gelu', │ │ │ │ │ 'hidden_dropout_prob': 0.1, │ │ │ │ │ 'hidden_size': 1024, │ │ │ │ │ 'initializer_range': 0.02, │ │ │ │ │ ... +15 │ │ │ │ } │ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ kv_cache_scheme = None │ │ │ │ lora_adapter_ids = [] │ │ │ │ max_input_tokens = None │ │ │ │ method = 'n-gram' │ │ │ │ model_id = 'BAAI/bge-m3' │ │ │ │ model_type = 'xlm-roberta' │ │ │ │ needs_sliding_window = False │ │ │ │ quantization_config = None │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ sharded = False │ │ │ │ sliding_window = -1 │ │ │ │ speculate = 0 │ │ │ │ speculator = None │ │ │ │ trust_remote_code = False │ │ │ │ use_sliding_window = False │ │ │ ╰───────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────╯ ValueError: Unsupported model type xlm-roberta rank=0
should works
model=BAAI/bge-m3 volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus '"device=4"' --shm-size 64g -p 10003:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id $model
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
docker deploy
Information
Tasks
Reproduction
Expected behavior
should works
model=BAAI/bge-m3
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus '"device=4"' --shm-size 64g -p 10003:80 -v $volume:/data
ghcr.io/huggingface/text-generation-inference:3.1.0
--model-id $model
The text was updated successfully, but these errors were encountered: