You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi thank you for your impressive work :)
As is mentioned in your paper, "For MHA, heads with mask ’0’ will not be executed. For FFN, as matrix-matrix multiplication can be transformed to multiple matrix vector multiplications, we only need to complete part of computations where vector’s mask is not zero."
however, it seems that in modeling_ebert.py you may just simply multiply the mask with the hidden states or attention probs and computations aren't reduced. Then the inference flops is computed theoretically. Is what I said true?
But if you actually prune the channels and heads, the feature dimension (e.g.768) of the hidden states would be diminished, causing a mismatch of all those linear layers(e.g.in FFN 768->3072->768,the weight matrix is (3072,768) so if the inter dim<3072, the multiplication is invalid) How did you deal with this mismatch?
The text was updated successfully, but these errors were encountered:
Yes, our implementation does not reduce the actual inference flops, that's why we do not report real time speed up in the paper.
For example, if we reduce 3072 to 3000, than we need a weight matrix (3000, 768), and this matrix is selected from the original matrix (3072, 768) according to the predicted masks. This selection is not easy to implement, so we just set those unused part to zero, which achieve the same result but can not bring real speed up.
If we want to translate the reduction of flops to real speed up, we need to implement some functions using cuda for gpu, or using special accelerators.
Hi thank you for your impressive work :)
As is mentioned in your paper, "For MHA, heads with mask ’0’ will not be executed. For FFN, as matrix-matrix multiplication can be transformed to multiple matrix vector multiplications, we only need to complete part of computations where vector’s mask is not zero."
however, it seems that in
modeling_ebert.py
you may just simply multiply the mask with the hidden states or attention probs and computations aren't reduced. Then the inference flops is computed theoretically. Is what I said true?But if you actually prune the channels and heads, the feature dimension (e.g.768) of the hidden states would be diminished, causing a mismatch of all those linear layers(e.g.in FFN 768->3072->768,the weight matrix is (3072,768) so if the inter dim<3072, the multiplication is invalid) How did you deal with this mismatch?
The text was updated successfully, but these errors were encountered: