Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation for short-context tasks? #1

Closed
leegao opened this issue Jan 19, 2024 · 1 comment
Closed

Evaluation for short-context tasks? #1

leegao opened this issue Jan 19, 2024 · 1 comment

Comments

@leegao
Copy link

leegao commented Jan 19, 2024

Hi folks, really enjoyed reading https://arxiv.org/pdf/2401.07004.pdf

In the evaluation section, I see that you have done a very comprehensive evaluation of several models and extension techniques on long-bench along various fine-tuning dimensions (short vs long training data, training data size) as well as performance on tasks across different context sizes (from 4k and up).

You did mention that ppl/loss during fine-tuning isn't as low as YaRN - could this be reflective of performance loss at short-context tasks?

@Arist12
Copy link
Collaborator

Arist12 commented Jan 25, 2024

Thank you for your question! In our evaluation, we chose not to measure perplexity (PPL) during fine-tuning, as previous research has demonstrated that it may not be a reliable indicator of success on downstream tasks.
To assess the ability of each method to maintain performance in its original context window size, we enforced a context window trim at 4096. We want to let you know that we have not yet tested the models on shorter tasks, which will be a part of our future work. Thanks again for your interest in our work.

@Arist12 Arist12 closed this as completed Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants