You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour #1223

AraratSaribekyan · 2024-09-12T09:27:37Z

Hi everyone!
Trying to initialize Llava1.6-34b-hf with flash attention 2 but getting the following issue, after which it doesn't work properly and doesn't speed up inference.
The point is I explicitly pass torch_dtye=torch.float16. The question is how to handle this warning and does it affect the flash-attention work on inference.
Code is bellow
model = LlavaNextForConditionalGeneration.from_pretrained( "llava-hf/llava-v1.6-34b-hf", device_map="cuda", torch_dtype=torch.float16, attn_implementation="flash_attention_2" )

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour #1223

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour #1223

AraratSaribekyan commented Sep 12, 2024

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour #1223

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour #1223

Comments

AraratSaribekyan commented Sep 12, 2024