- flash attention 2 support on training
- flash attention 2 support on left-padding generation with kv cache
- fmha on GQA & MQA
- multi model topology support by mpu context
- more model type for experiment (PPL,RM,...)
TODO:
-
GQA & MQA generation (left-padding)
-
less model control option
-
generator based on non-batch flash attention and self-design cuda fused kernel
-
Fixed pipeline model
-
KV Cache management by pre-malloc and reuse (pre-calculate)