We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing on 2x 4090 TI Super
- MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
text-generation-inference-1 | [rank1]: │ /usr/src/server/text_generation_server/utils/weights.py:275 in get_sharded │ text-generation-inference-1 | [rank1]: │ │ text-generation-inference-1 | [rank1]: │ 272 │ │ world_size = self.process_group.size() │ text-generation-inference-1 | [rank1]: │ 273 │ │ size = slice_.get_shape()[dim] │ text-generation-inference-1 | [rank1]: │ 274 │ │ assert ( │ text-generation-inference-1 | [rank1]: │ ❱ 275 │ │ │ size % world_size == 0 │ text-generation-inference-1 | [rank1]: │ 276 │ │ ), f"The choosen size {size} is not compatible with sharding o │ text-generation-inference-1 | [rank1]: │ 277 │ │ return self.get_partial_sharded( │ text-generation-inference-1 | [rank1]: │ 278 │ │ │ tensor_name, dim, to_device=to_device, to_dtype=to_dtype │ text-generation-inference-1 | [rank1]: │ │ text-generation-inference-1 | [rank1]: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ text-generation-inference-1 | [rank1]: │ │ dim = 1 │ │ text-generation-inference-1 | [rank1]: │ │ f = <builtins.safe_open object at 0x77297422f7b0> │ │ text-generation-inference-1 | [rank1]: │ │ filename = '/data/hub/models--unsloth--Qwen2.5-Coder-32B-bnb-4bit/sn… │ │ text-generation-inference-1 | [rank1]: │ │ self = <text_generation_server.utils.weights.Weights object at │ │ text-generation-inference-1 | [rank1]: │ │ 0x772972e7a3d0> │ │ text-generation-inference-1 | [rank1]: │ │ size = 1 │ │ text-generation-inference-1 | [rank1]: │ │ slice_ = <builtins.PySafeSlice object at 0x772973da8f80> │ │ text-generation-inference-1 | [rank1]: │ │ tensor_name = 'model.layers.0.self_attn.o_proj.weight' │ │ text-generation-inference-1 | [rank1]: │ │ to_device = True │ │ text-generation-inference-1 | [rank1]: │ │ to_dtype = True │ │ text-generation-inference-1 | [rank1]: │ │ world_size = 2 │ │ text-generation-inference-1 | [rank1]: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ text-generation-inference-1 | [rank1]: ╰──────────────────────────────────────────────────────────────────────────────╯ text-generation-inference-1 | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1 text-generation-inference-1 | 2025-02-10T12:36:10.058627Z ERROR text_generation_launcher: Shard 1 failed to start text-generation-inference-1 | 2025-02-10T12:36:10.058637Z INFO text_generation_launcher: Shutting down shards text-generation-inference-1 | 2025-02-10T12:36:10.065243Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 text-generation-inference-1 | 2025-02-10T12:36:10.065344Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 text-generation-inference-1 | 2025-02-10T12:36:10.165431Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 text-generation-inference-1 | Error: ShardCannotStart
text-generation-inference: image: ghcr.io/huggingface/text-generation-inference:3.1.0 environment: - HF_TOKEN=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 # - MODEL_ID=Qwen/Qwen2.5-Coder-7B-Instruct-AWQ # - MODEL_ID=mistralai/Mistral-Small-24B-Instruct-2501 # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-AWQ # - MODEL_ID=avoroshilov/DeepSeek-R1-Distill-Qwen-32B-GPTQ_4bit-128g # - MODEL_ID=Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4 - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit # - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit # - SHARDED=true # - SHARDS=2 # - QUANTIZED=bitsandbytes ports: - "0.0.0.0:8099:80" restart: "unless-stopped" # command: "--quantize bitsandbytes-nf4 --max-input-tokens 30000" deploy: resources: reservations: devices: - driver: nvidia device_ids: ['0', '1'] capabilities: [gpu] shm_size: '90g' volumes: - ~/.hf-docker-data:/data networks: - llmhost
Unquant ones works fine with "--quantize bitsandbytes-nf4 --max-input-tokens 30000"
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
Testing on 2x 4090 TI Super
Information
Tasks
Reproduction
Expected behavior
Unquant ones works fine with "--quantize bitsandbytes-nf4 --max-input-tokens 30000"
The text was updated successfully, but these errors were encountered: