Select the experts with higher probability from all experts for each token, and sort all tokens according to their assigned expert id. For Mixtral-7B, it selects top-2 experts from 8 experts for each token.
Number of expert.
Number of selected experts for each token.
Input feature.
Shape:
Routing scores after moe gate of each token.
Shape:
Input feature X after expand and permute. Each token's size will be expanded to num_experts_per_token
, and permuted by order of their expert id.
Shape:
Select top num_experts_per_token
from scores, and normalize it with softmax.
Shape:
The indices of invert permutation: mapping from permuted token index to origin token index.
Shape:
Contains the offset of the first token for each expert. Region expert_offset[i+1]
is the prefix sum of tokens from expert_0 to expert_i.
Shape