Skip to content

[Bug] facing nan problem assert torch.isnan(p).sum() == 0 #757

Open
@lzk9508

Description

@lzk9508

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

8BD3EB78-1AC0-4977-90ED-B65C90F35202

Reproduction

from transformers import AutoTokenizer, AutoModel
import torch
import os

model_path = "/ossfs/workspace/checkpoint-555"
save_path = "/ossfs/workspace/checkpoint-555_fused"

Load the model with device_map="auto" to handle device placement

model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map="auto" # Add this parameter
)

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True
)

Save the model and tokenizer

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

Convert to AWQ models

print(f'begin convert to awq models')
os.system(f'lmdeploy lite auto_awq {save_path} --work-dir {save_path}_awq')

Environment

Linux centos
torch2.1.2 cuda12.1 torchvision 0.16.2
transformers 4.40.0
lmdeploy 0.5.3

Error traceback

Using the latest cached version of the dataset since ptb_text_only couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'penn_treebank' at /root/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/fa7dfc4a32462b6a91341205a11ef3ddff7ffc0325ce3cb662e73eddb4ae1182 (last modified on Fri Dec 13 12:57:48 2024).
Using the latest cached version of the dataset since ptb_text_only couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'penn_treebank' at /root/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/fa7dfc4a32462b6a91341205a11ef3ddff7ffc0325ce3cb662e73eddb4ae1182 (last modified on Fri Dec 13 12:57:48 2024).
Token indices sequence length is longer than the specified maximum sequence length for this model (1085165 > 4096). Running this sequence through the model will result in indexing errors
model.layers.0, samples: 128, max gpu memory: 7.14 GB
model.layers.1, samples: 128, max gpu memory: 9.14 GB
model.layers.2, samples: 128, max gpu memory: 9.14 GB
model.layers.3, samples: 128, max gpu memory: 9.14 GB
model.layers.4, samples: 128, max gpu memory: 9.14 GB
model.layers.5, samples: 128, max gpu memory: 9.14 GB
model.layers.6, samples: 128, max gpu memory: 9.14 GB
model.layers.7, samples: 128, max gpu memory: 9.14 GB
model.layers.8, samples: 128, max gpu memory: 9.14 GB
model.layers.9, samples: 128, max gpu memory: 9.14 GB
model.layers.10, samples: 128, max gpu memory: 9.14 GB
model.layers.11, samples: 128, max gpu memory: 9.14 GB
model.layers.12, samples: 128, max gpu memory: 9.14 GB
model.layers.13, samples: 128, max gpu memory: 9.14 GB
model.layers.14, samples: 128, max gpu memory: 9.14 GB
model.layers.15, samples: 128, max gpu memory: 9.14 GB
model.layers.16, samples: 128, max gpu memory: 9.14 GB
model.layers.17, samples: 128, max gpu memory: 9.14 GB
model.layers.18, samples: 128, max gpu memory: 9.14 GB
model.layers.19, samples: 128, max gpu memory: 9.14 GB
model.layers.20, samples: 128, max gpu memory: 9.14 GB
model.layers.21, samples: 128, max gpu memory: 9.14 GB
model.layers.22, samples: 128, max gpu memory: 9.14 GB
model.layers.23, samples: 128, max gpu memory: 9.14 GB
model.layers.24, samples: 128, max gpu memory: 9.14 GB
model.layers.25, samples: 128, max gpu memory: 9.14 GB
model.layers.26, samples: 128, max gpu memory: 9.14 GB
model.layers.27, samples: 128, max gpu memory: 9.14 GB
model.layers.28, samples: 128, max gpu memory: 9.14 GB
model.layers.29, samples: 128, max gpu memory: 9.14 GB
model.layers.30, samples: 128, max gpu memory: 9.14 GB
model.layers.31, samples: 128, max gpu memory: 9.14 GB
Traceback (most recent call last):
  File "/opt/conda/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 36, in run
    args.run(args)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 139, in auto_awq
    auto_awq(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 108, in auto_awq
    smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size,
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 277, in smooth_layers
    smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 134, in smooth_ln_fcs
    assert torch.isnan(p).sum() == 0
AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions