--use_fp8 doesn't work with llama 3.1 8b #2602

ShuaiShao93 · 2024-12-20T20:11:49Z

System Info

x86_64, debian 11, L40s GPU

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

clone llama 3.1 8b
Install trtllm 0.15.0
convert checkpoint with

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3.1-8B-Instruct --output_dir ./tllm_8b_checkpoint_1gpu_fp8  --dtype float16 --use_fp8

Expected behavior

The ckpt should be converted

actual behavior

Got error

[TensorRT-LLM] TensorRT-LLM version: 0.15.0
0.15.0
[12/20/2024-20:09:51] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
5it [00:00, 25.38it/s]
Traceback (most recent call last):
  File "/home/ubuntu/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 555, in <module>
    main()
  File "/home/ubuntu/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 547, in main
    convert_and_save_hf(args)
  File "/home/ubuntu/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 488, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/home/ubuntu/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 495, in execute
    f(args, rank)
  File "/home/ubuntu/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 472, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 405, in from_hugging_face
    loader.generate_tllm_weights(model)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 408, in generate_tllm_weights
    self.load(tllm_key,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/models/model_weights_loader.py", line 296, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 1478, in postprocess
    new_amax = max(weight_scaling_factors).reshape(1, ).to(
TypeError: '>' not supported between instances of 'NoneType' and 'NoneType'

additional notes

N/A

The text was updated successfully, but these errors were encountered:

ShuaiShao93 added the bug Something isn't working label Dec 20, 2024

ShuaiShao93 mentioned this issue Dec 23, 2024

[Feature Request] Better support for w4a8 quantization #2605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--use_fp8 doesn't work with llama 3.1 8b #2602

--use_fp8 doesn't work with llama 3.1 8b #2602

ShuaiShao93 commented Dec 20, 2024

--use_fp8 doesn't work with llama 3.1 8b #2602

--use_fp8 doesn't work with llama 3.1 8b #2602

Comments

ShuaiShao93 commented Dec 20, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes