Does the discontinuous positional encodings confuse the model？ #6

ovowei · 2024-07-11T02:59:40Z

Hi,

I was reading your paper and have a question about the positional encodings. In my understanding, performing attention only on selected pages leads to selecting discontinuous pages, resulting in discontinuous positional encodings. LM-Infinite and StreamingLLM directly assign continuous positional encodings or assign the same positional encodings to all tokens beyond the local window size to handle this. Does Quest need similar processing?

Thanks!

Sakits · 2024-07-20T15:16:25Z

Hi @ovowei ,

Thank you for your interest in our work! Quest does not employ a similar process to LM-Infinite or StreamingLLM. Instead, Quest directly applies the original positional embeddings to the selected pages. Here are the reasons for this approach:

The pages selected by Quest have accumulated sufficient attention scores (>99%). This can be viewed as pruning the unselected KV caches, which works well in the evaluations.
We have experimented with applying continuous positional encodings to the selected pages. However, this approach performed worse than the original method. Unlike StreamingLLM, where only a few attention sinks are discontinuous with the recent token window, Quest deals with more substantial discontinuities. Thus, we found that using continuous positional encodings for these discontinuous pages in Quest did not yield better results.

ovowei · 2024-07-22T03:50:19Z

Hi @Sakits

Thanks for your answers. It makes sense to me.

I used a model pre-trained on shorter sequence datasets to process longer sequence tasks. I found that applying QUEST and assigning the same positional encodings to all tokens beyond a certain distance yields better results in this case. This suggests that QUEST might help models process extremely long sequences. I will conduct more experiments to verify this. If you have conducted similar experiments, I would appreciate it if you could share your results.

Thanks!

Sakits · 2024-10-20T08:10:33Z

Hi @ovowei ,

Sorry for the delayed reply! I’ve been busy working on a paper submission recently. Thank you for sharing your insights and interesting discussions! :)

Yes, we also found that assigning the same positional encodings to tokens beyond a certain distance can somehow extend the model’s effective context range. There are some interesting works that discuss similar ideas, such as InfLLM and LongHeads. However, with more and more models offering extended context windows (up to 128k~10M tokens), modifying positional encodings in this way might not be as necessary as before.

Thank you again for your interest in our work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the discontinuous positional encodings confuse the model？ #6

Does the discontinuous positional encodings confuse the model？ #6

ovowei commented Jul 11, 2024

Sakits commented Jul 20, 2024

ovowei commented Jul 22, 2024

Sakits commented Oct 20, 2024 •

edited

Loading

Does the discontinuous positional encodings confuse the model？ #6

Does the discontinuous positional encodings confuse the model？ #6

Comments

ovowei commented Jul 11, 2024

Sakits commented Jul 20, 2024

ovowei commented Jul 22, 2024

Sakits commented Oct 20, 2024 • edited Loading

Sakits commented Oct 20, 2024 •

edited

Loading