Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the discontinuous positional encodings confuse the model? #6

Open
ovowei opened this issue Jul 11, 2024 · 3 comments
Open

Does the discontinuous positional encodings confuse the model? #6

ovowei opened this issue Jul 11, 2024 · 3 comments

Comments

@ovowei
Copy link

ovowei commented Jul 11, 2024

Hi,

I was reading your paper and have a question about the positional encodings. In my understanding, performing attention only on selected pages leads to selecting discontinuous pages, resulting in discontinuous positional encodings. LM-Infinite and StreamingLLM directly assign continuous positional encodings or assign the same positional encodings to all tokens beyond the local window size to handle this. Does Quest need similar processing?

Thanks!

@Sakits
Copy link
Collaborator

Sakits commented Jul 20, 2024

Hi @ovowei ,

Thank you for your interest in our work! Quest does not employ a similar process to LM-Infinite or StreamingLLM. Instead, Quest directly applies the original positional embeddings to the selected pages. Here are the reasons for this approach:

  1. The pages selected by Quest have accumulated sufficient attention scores (>99%). This can be viewed as pruning the unselected KV caches, which works well in the evaluations.

  2. We have experimented with applying continuous positional encodings to the selected pages. However, this approach performed worse than the original method. Unlike StreamingLLM, where only a few attention sinks are discontinuous with the recent token window, Quest deals with more substantial discontinuities. Thus, we found that using continuous positional encodings for these discontinuous pages in Quest did not yield better results.

@ovowei
Copy link
Author

ovowei commented Jul 22, 2024

Hi @Sakits

Thanks for your answers. It makes sense to me.

I used a model pre-trained on shorter sequence datasets to process longer sequence tasks. I found that applying QUEST and assigning the same positional encodings to all tokens beyond a certain distance yields better results in this case. This suggests that QUEST might help models process extremely long sequences. I will conduct more experiments to verify this. If you have conducted similar experiments, I would appreciate it if you could share your results.

Thanks!

@Sakits
Copy link
Collaborator

Sakits commented Oct 20, 2024

Hi @ovowei ,

Sorry for the delayed reply! I’ve been busy working on a paper submission recently. Thank you for sharing your insights and interesting discussions! :)

Yes, we also found that assigning the same positional encodings to tokens beyond a certain distance can somehow extend the model’s effective context range. There are some interesting works that discuss similar ideas, such as InfLLM and LongHeads. However, with more and more models offering extended context windows (up to 128k~10M tokens), modifying positional encodings in this way might not be as necessary as before.

Thank you again for your interest in our work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants