-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fuse multi head att #417
base: main
Are you sure you want to change the base?
Conversation
batch size = 4, acc step = 8, amp, open Checkpointing
在 |
projects/T5/models/attention.py
Outdated
if self.multihead_attn_fusion: | ||
hidden_states = hidden_states.transpose(0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在进行cross_attention的时候, 每次都需要transpose一下
有没有办法只在外面transpose一次?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原本的 fuse multihead attention 就是在整个网络一开始做一次 transpose,使得整个 transformer layer 里的数据维度都是转置后的维度(到 loss 的部分可以考虑再转置回来,但是 Megatron 即使在 loss 部分貌似也是转置后的),这样可以省去每个 layer 内部的 transpose 操作。这样才能达到提升速度的目的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考代码 :
默认进到 attention 里面的就是 [sq, b, h] , batch size 在 dim 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以参考之前 程鹏 和 星宇 的代码:
- 16ca685#diff-0579a91f68617ee49042d24fa0fade3f3e1171363d3a14ae611309fa54601020R385
libai/libai/models/gpt_model.py
Line 311 in b1c7d32
hidden_states = hidden_states.transpose(0, 1) # [seq, bs, dim]
…fuse_multi_head_att
…Inc/libai into use_fuse_multi_head_att
|
No description provided.