Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized BNB-4bit models are not working. #3005

Open
2 of 4 tasks
v3ss0n opened this issue Feb 10, 2025 · 0 comments
Open
2 of 4 tasks

Quantized BNB-4bit models are not working. #3005

v3ss0n opened this issue Feb 10, 2025 · 0 comments

Comments

@v3ss0n
Copy link

v3ss0n commented Feb 10, 2025

System Info

Testing on 2x 4090 TI Super

      - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit
      - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
text-generation-inference-1  | [rank1]: │ /usr/src/server/text_generation_server/utils/weights.py:275 in get_sharded   │
text-generation-inference-1  | [rank1]: │                                                                              │
text-generation-inference-1  | [rank1]: │   272 │   │   world_size = self.process_group.size()                         │
text-generation-inference-1  | [rank1]: │   273 │   │   size = slice_.get_shape()[dim]                                 │
text-generation-inference-1  | [rank1]: │   274 │   │   assert (                                                       │
text-generation-inference-1  | [rank1]: │ ❱ 275 │   │   │   size % world_size == 0                                     │
text-generation-inference-1  | [rank1]: │   276 │   │   ), f"The choosen size {size} is not compatible with sharding o │
text-generation-inference-1  | [rank1]: │   277 │   │   return self.get_partial_sharded(                               │
text-generation-inference-1  | [rank1]: │   278 │   │   │   tensor_name, dim, to_device=to_device, to_dtype=to_dtype   │
text-generation-inference-1  | [rank1]: │                                                                              │
text-generation-inference-1  | [rank1]: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
text-generation-inference-1  | [rank1]: │ │         dim = 1                                                          │ │
text-generation-inference-1  | [rank1]: │ │           f = <builtins.safe_open object at 0x77297422f7b0>              │ │
text-generation-inference-1  | [rank1]: │ │    filename = '/data/hub/models--unsloth--Qwen2.5-Coder-32B-bnb-4bit/sn… │ │
text-generation-inference-1  | [rank1]: │ │        self = <text_generation_server.utils.weights.Weights object at    │ │
text-generation-inference-1  | [rank1]: │ │               0x772972e7a3d0>                                            │ │
text-generation-inference-1  | [rank1]: │ │        size = 1                                                          │ │
text-generation-inference-1  | [rank1]: │ │      slice_ = <builtins.PySafeSlice object at 0x772973da8f80>            │ │
text-generation-inference-1  | [rank1]: │ │ tensor_name = 'model.layers.0.self_attn.o_proj.weight'                   │ │
text-generation-inference-1  | [rank1]: │ │   to_device = True                                                       │ │
text-generation-inference-1  | [rank1]: │ │    to_dtype = True                                                       │ │
text-generation-inference-1  | [rank1]: │ │  world_size = 2                                                          │ │
text-generation-inference-1  | [rank1]: │ ╰──────────────────────────────────────────────────────────────────────────╯ │
text-generation-inference-1  | [rank1]: ╰──────────────────────────────────────────────────────────────────────────────╯
text-generation-inference-1  | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1
text-generation-inference-1  | 2025-02-10T12:36:10.058627Z ERROR text_generation_launcher: Shard 1 failed to start
text-generation-inference-1  | 2025-02-10T12:36:10.058637Z  INFO text_generation_launcher: Shutting down shards
text-generation-inference-1  | 2025-02-10T12:36:10.065243Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
text-generation-inference-1  | 2025-02-10T12:36:10.065344Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
text-generation-inference-1  | 2025-02-10T12:36:10.165431Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
text-generation-inference-1  | Error: ShardCannotStart

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:3.1.0
    environment:
      - HF_TOKEN=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
      # - MODEL_ID=Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
      # - MODEL_ID=mistralai/Mistral-Small-24B-Instruct-2501
      # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
      # - MODEL_ID=avoroshilov/DeepSeek-R1-Distill-Qwen-32B-GPTQ_4bit-128g
      # - MODEL_ID=Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ
      # - MODEL_ID=Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
      - MODEL_ID=unsloth/Qwen2.5-Coder-32B-bnb-4bit
      # - MODEL_ID=unsloth/Mistral-Small-24B-Instruct-2501-bnb-4bit
      # - SHARDED=true
      # - SHARDS=2
      # - QUANTIZED=bitsandbytes
    ports:
      - "0.0.0.0:8099:80"
    restart: "unless-stopped"
    # command: "--quantize bitsandbytes-nf4 --max-input-tokens 30000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    shm_size: '90g'
    volumes:
      - ~/.hf-docker-data:/data
    networks:
      - llmhost

Expected behavior

Unquant ones works fine with "--quantize bitsandbytes-nf4 --max-input-tokens 30000"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant