Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] support llama #575

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

[enhancement] support llama #575

wants to merge 34 commits into from

Conversation

void-main
Copy link

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

@syslot
Copy link

syslot commented Apr 25, 2023

This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api?

@void-main
Copy link
Author

This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api?

I've been super busy lately, don't quite have the time for performance comparison, hopefully someone will do the favor and compare FT with transformers api. :-)

@152334H
Copy link

152334H commented Apr 28, 2023

Does this implement int8 (or even 4bit) by any chance?

@void-main
Copy link
Author

Some updates:

  • supported bf16
  • supported triton decouple mode
  • verified that Llama 65B is working

@pineking
Copy link

pineking commented May 1, 2023

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

what's the parameters for kernel-autotuning for llama model?

@atyshka
Copy link

atyshka commented May 1, 2023

Does this implement int8 (or even 4bit) by any chance?

FasterTransformer doesn't seem to support int4 at all right now. I would be interested in helping with int8 though, that should enable the 65B model to run tensor-parallel on my 2x A6000 GPUs.

@happytree09
Copy link

Does this implement int8 (or even 4bit) by any chance?

FasterTransformer doesn't seem to support int4 at all right now. I would be interested in helping with int8 though, that should enable the 65B model to run tensor-parallel on my 2x A6000 GPUs.

+1 happy to contribute to this

@lucasjinreal
Copy link

@void-main Have compared with ggml's llama.cpp with cuBlas support?

num_layer = 32
rotary_embedding = 128
vocab_size = 32000
start_id = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add layernorm_eps=1e-06 will help new beginers.

@handoku
Copy link

handoku commented Jun 9, 2023

Hi. I've recently tested this implementation on blip2_vicuna_instruct. It utilizes vit_qformer's embedding as a prefix_soft_embedding, which will be fed into vicuna with prompt's token_ids.

According to my test result, I found that:
When testing only vicuna-13b, FT outputs same quality text as huggingface's.
However, when token_ids are fed along with prefix_soft_embedding, a noticeable quality decrease occurs.

For example,
image:
ref
prompt: Describe the environment in which the product in the middle of the image is located

pytorch output:

. The product in the middle of this image is located within a refrigerator, surrounded by various fruits and vegetables on both sides as well

FT output:

. The refrigerator is open and filled with food.
The refrigerator is open and filled with food.

Does anyone has experience in using fasterTransformer's prefix soft prompt feature. What problem might cause this issue. Counld it be a usage mistake? I need some hits to debug it.

Thanks in advance!

[EDITED]: issue solved

@sleepwalker2017
Copy link

@void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat?

@void-main
Copy link
Author

@void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat?

sure, try send me an email. :-)

@frankxyy
Copy link

I found that the results of rotary embedding is different for FT and huggingface. Has anyone met similar problems?

@UnknownSwordsman
Copy link

UnknownSwordsman commented Jun 28, 2023

@void-main Hello,i found a bug that after multiple (thousands of) batch(20) inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally.

When the batch size is equal 5, I haven't found it yet.

Prompt Mixed Chinese and English
Some Answer Example:

该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 

该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��

Compile based on Fastertransformer Backend

Device: V100/ A100 4gpu
Model: Vicuna 13B-v1.1
parameters: top_k = 1, output_token_len = 500, batch_size = 20

@UnknownSwordsman
Copy link

Another problem is When batch inference is used, the results generated by the same prompt are different.

paramters: top_k=1, random_seed=1, output_len=500
device: T4/ A100 4-gpu by triton sever

prompt = "写一篇关于爱情的故事"

answer:

['text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以在任何情况下帮助他们克服困难。\n',
   'index': 0},
  {'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以在任何情况下帮助他们克服困难。\n',
   'index': 1},
  {'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以战胜一切困难。\n\n这是一个关于�',
   'index': 2},
  {'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以在任何情况下帮助他们克服困难。\n',
   'index': 3},
  {'text': '。\n\n爱情,这个词汇,它既是最美妙的东西,又是最艰难的。它可以让人感到无比的高昂,也可以让人陷入无比的低谷。\n\n有一天,有两个人,他们分别是艾尔和卡洛斯。艾尔和卡洛斯是在一个酒店里相遇的。他们在一起,聊天,发现彼此都有着相似的经历。他们都是单身,都在寻找真爱。\n\n他们决定一起去旅行,去探索这个世界。他们在旅途中,发现彼此之间的感情越来越深。他们一起经历了许多困难和挑战,但是他们始终相互扶持,相互支持。\n\n最终,他们到达了一个美丽的小镇,这里是他们想要的地方。他们决定在这里生活,共同度过剩下的生活。他们彼此扶持,一起面对未来,一起迎接挑战。\n\n他们的爱情,不是一种刻意的选择,而是一种自然的选择。他们在彼此的陪伴下,找到了真正的意义。他们的爱情,是一种强大的力量,可以战胜一切困难。\n\n这是一个关于�', 'index': 4}]

size_t rotary_embedding_dim_;
float layernorm_eps_;

static constexpr bool neox_rotary_style_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be false.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prnake I test it and only true output normal value

@prnake
Copy link
Contributor

prnake commented Jun 28, 2023

It seems that torch.cos() and c cos func generates slightly different results which leads to the different results of rotary embedding. Anyone has idea for the solution?

You are right, the model should use the basic type of rotary.

@CN-COTER
Copy link

First of all, thx for your Implement of ft LlaMa.@void-main I push a PR to support Int8 and share context. Anyone can help me to check it?

@void-main
Copy link
Author

Hi @CN-COTER , thanks for the contribution! really appreciate it!
I've checked your code and started a review, could you please take a look. 🎉

@double-vin
Copy link

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

My start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973]
out:
0 18637 29892 526 366 1136 455 2470 29973 1815 366 5193 304 592 29973 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 2 2991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991

decode out:
Hey, are you consciours? Can you talk to me?olgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolg

It looks like there's a problem with the out and decode out. Please give me some suggestions.

@KeKe-Deng
Copy link

KeKe-Deng commented Jul 25, 2023 via email

@realgump
Copy link

realgump commented Aug 1, 2023

@void-main Hello,i found a bug that after multiple (thousands of) batch(20) inference, some batches may output randomly. But if the triton service is restarted, it can be inferred normally.

When the batch size is equal 5, I haven't found it yet.

Prompt Mixed Chinese and English Some Answer Example:

该��ate-to-p>\n\n\n\n\n\n\n\n\n\n\n\n\nIt's a\n\nIt's a new-est\nIt in the\nIt's at all about\nIt's at the\nIt's at the\nIt's at the\nIt's\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's the\nIt's 

该宏在况况冲况��'s 不, 不, 它的 它的 ��, ��\n\n\n\n\n\n\n\n\n\n\n\nJupit was not you be ��, ��\n/b\n/and ��\n/��\n/��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/\n/ ��\n/��\n/\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��\n/��

Compile based on Fastertransformer Backend

Device: V100/ A100 4gpu Model: Vicuna 13B-v1.1 parameters: top_k = 1, output_token_len = 500, batch_size = 20

Same issue. Have you find the solution?

@double-vin
Copy link

Implement LlaMa as requested in issue #506 .

Steps to use

first convert llama-7b-hf weights from huggingface with huggingface_llama_convert.py: python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
next, compile and run llama_example.

Test case

start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: [0,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366,1136,455,2470,29973,1815,366,5193,304,592,29973,18637,29892,526,366]

My start_ids.csv: [0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973] out: 0 18637 29892 526 366 1136 455 2470 29973 1815 366 5193 304 592 29973 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 2 2991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991 22991

decode out: Hey, are you consciours? Can you talk to me?olgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolgolg

It looks like there's a problem with the out and decode out. Please give me some suggestions.

@void-main Please give some suggestions. Thank you!

@jcao-ai
Copy link

jcao-ai commented Aug 8, 2023

Will llama-2 70b arch be supported in the future? @void-main Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.