A question about inference #2

twwwwx · 2023-01-19T08:51:07Z

Hi thank you for your impressive work :)
As is mentioned in your paper, "For MHA, heads with mask ’0’ will not be executed. For FFN, as matrix-matrix multiplication can be transformed to multiple matrix vector multiplications, we only need to complete part of computations where vector’s mask is not zero."
however, it seems that in modeling_ebert.py you may just simply multiply the mask with the hidden states or attention probs and computations aren't reduced. Then the inference flops is computed theoretically. Is what I said true?
But if you actually prune the channels and heads, the feature dimension (e.g.768) of the hidden states would be diminished, causing a mismatch of all those linear layers(e.g.in FFN 768->3072->768,the weight matrix is (3072,768) so if the inter dim<3072, the multiplication is invalid) How did you deal with this mismatch?

The text was updated successfully, but these errors were encountered:

zejiangp · 2023-01-28T07:46:09Z

Thank you for interesting our work.

Yes, our implementation does not reduce the actual inference flops, that's why we do not report real time speed up in the paper.
For example, if we reduce 3072 to 3000, than we need a weight matrix (3000, 768), and this matrix is selected from the original matrix (3072, 768) according to the predicted masks. This selection is not easy to implement, so we just set those unused part to zero, which achieve the same result but can not bring real speed up.
If we want to translate the reduction of flops to real speed up, we need to implement some functions using cuda for gpu, or using special accelerators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about inference #2

A question about inference #2

twwwwx commented Jan 19, 2023

zejiangp commented Jan 28, 2023

A question about inference #2

A question about inference #2

Comments

twwwwx commented Jan 19, 2023

zejiangp commented Jan 28, 2023