You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce your results from the paper. I am using the Llama3 70B GPTQ model for WebQSP dataset with Freebase KG. However, I am getting much lower results than ones reported in the paper. I got an exact match score of just 0.189.
Though one reason for the difference in scores could be due to the LLM used but based on the error analysis we performed, it seems that the reasoning done using LLM is also being written to the final output score. Is this by design or is it because it is a bug? Most of the output from reasoning is just "yes" or "no" but it doesn't contain the answer to the question. In the reasoning chains however, we see the required answer is derived from the KG.
Please let us know your thoughts. Any help would be appreciated.
Thank you
The text was updated successfully, but these errors were encountered:
I am trying to reproduce your results from the paper. I am using the Llama3 70B GPTQ model for WebQSP dataset with Freebase KG. However, I am getting much lower results than ones reported in the paper. I got an exact match score of just 0.189.
Though one reason for the difference in scores could be due to the LLM used but based on the error analysis we performed, it seems that the reasoning done using LLM is also being written to the final output score. Is this by design or is it because it is a bug? Most of the output from reasoning is just "yes" or "no" but it doesn't contain the answer to the question. In the reasoning chains however, we see the required answer is derived from the KG.
Please let us know your thoughts. Any help would be appreciated. Thank you
Can you send me the exact model and commands you ran? The llama2-70b-chat model we use here doesn't work that badly. Did you make any changes to the exemplars?
I am trying to reproduce your results from the paper. I am using the Llama3 70B GPTQ model for WebQSP dataset with Freebase KG. However, I am getting much lower results than ones reported in the paper. I got an exact match score of just 0.189.
Though one reason for the difference in scores could be due to the LLM used but based on the error analysis we performed, it seems that the reasoning done using LLM is also being written to the final output score. Is this by design or is it because it is a bug? Most of the output from reasoning is just "yes" or "no" but it doesn't contain the answer to the question. In the reasoning chains however, we see the required answer is derived from the KG.
Please let us know your thoughts. Any help would be appreciated.
Thank you
The text was updated successfully, but these errors were encountered: