Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention #11

naston · 2024-06-30T04:22:18Z

Multi-head Latent Attention is described in DeepSeekV2. Instead of caching KV they construct KV from a low dimension latent vector C which can be cached instead. KVQuant can obviously be applied to this Attention architecture but I am wondering if it can do so out of the box, specifically in regards to Fisher Information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention #11

Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention #11

naston commented Jun 30, 2024

Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention #11

Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention #11

Comments

naston commented Jun 30, 2024