You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the evaluation section, I see that you have done a very comprehensive evaluation of several models and extension techniques on long-bench along various fine-tuning dimensions (short vs long training data, training data size) as well as performance on tasks across different context sizes (from 4k and up).
You did mention that ppl/loss during fine-tuning isn't as low as YaRN - could this be reflective of performance loss at short-context tasks?
The text was updated successfully, but these errors were encountered:
Thank you for your question! In our evaluation, we chose not to measure perplexity (PPL) during fine-tuning, as previous research has demonstrated that it may not be a reliable indicator of success on downstream tasks.
To assess the ability of each method to maintain performance in its original context window size, we enforced a context window trim at 4096. We want to let you know that we have not yet tested the models on shorter tasks, which will be a part of our future work. Thanks again for your interest in our work.
Hi folks, really enjoyed reading https://arxiv.org/pdf/2401.07004.pdf
In the evaluation section, I see that you have done a very comprehensive evaluation of several models and extension techniques on long-bench along various fine-tuning dimensions (short vs long training data, training data size) as well as performance on tasks across different context sizes (from 4k and up).
You did mention that ppl/loss during fine-tuning isn't as low as YaRN - could this be reflective of performance loss at short-context tasks?
The text was updated successfully, but these errors were encountered: