Skip to content

Low output score - reasoning output written to final output #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
devishree23 opened this issue Jun 24, 2024 · 1 comment
Open

Low output score - reasoning output written to final output #29

devishree23 opened this issue Jun 24, 2024 · 1 comment

Comments

@devishree23
Copy link

I am trying to reproduce your results from the paper. I am using the Llama3 70B GPTQ model for WebQSP dataset with Freebase KG. However, I am getting much lower results than ones reported in the paper. I got an exact match score of just 0.189.

Though one reason for the difference in scores could be due to the LLM used but based on the error analysis we performed, it seems that the reasoning done using LLM is also being written to the final output score. Is this by design or is it because it is a bug? Most of the output from reasoning is just "yes" or "no" but it doesn't contain the answer to the question. In the reasoning chains however, we see the required answer is derived from the KG.

Please let us know your thoughts. Any help would be appreciated.
Thank you

@GasolSun36
Copy link
Collaborator

I am trying to reproduce your results from the paper. I am using the Llama3 70B GPTQ model for WebQSP dataset with Freebase KG. However, I am getting much lower results than ones reported in the paper. I got an exact match score of just 0.189.

Though one reason for the difference in scores could be due to the LLM used but based on the error analysis we performed, it seems that the reasoning done using LLM is also being written to the final output score. Is this by design or is it because it is a bug? Most of the output from reasoning is just "yes" or "no" but it doesn't contain the answer to the question. In the reasoning chains however, we see the required answer is derived from the KG.

Please let us know your thoughts. Any help would be appreciated. Thank you

Can you send me the exact model and commands you ran? The llama2-70b-chat model we use here doesn't work that badly. Did you make any changes to the exemplars?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants