You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?
I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.
Ultimately, the main point seems to be that it has addressed the issue of high initial layer weight distribution, but I'm curious why it's not universally used. I wonder if sink attention, which disperses not just the initial layers but across the whole, can maintain performance while improving speed, and how it can be best utilized.
Therefore, I am curious about how the authors' thoughts have changed after the paper.
Thank you! :)
The text was updated successfully, but these errors were encountered:
Hello, I was deeply impressed by your paper.
Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?
I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.
Ultimately, the main point seems to be that it has addressed the issue of high initial layer weight distribution, but I'm curious why it's not universally used. I wonder if sink attention, which disperses not just the initial layers but across the whole, can maintain performance while improving speed, and how it can be best utilized.
Therefore, I am curious about how the authors' thoughts have changed after the paper.
Thank you! :)
The text was updated successfully, but these errors were encountered: