You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-head Latent Attention is described in DeepSeekV2. Instead of caching KV they construct KV from a low dimension latent vector C which can be cached instead. KVQuant can obviously be applied to this Attention architecture but I am wondering if it can do so out of the box, specifically in regards to Fisher Information.
The text was updated successfully, but these errors were encountered:
Multi-head Latent Attention is described in DeepSeekV2. Instead of caching KV they construct KV from a low dimension latent vector C which can be cached instead. KVQuant can obviously be applied to this Attention architecture but I am wondering if it can do so out of the box, specifically in regards to Fisher Information.
The text was updated successfully, but these errors were encountered: