You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for flagging this. I was able to reproduce the discrepancy you found. You are correct that the numbers in the paper are off by a little bit for the four models we tested.
There were 2 paradigms that we removed from the dataset because they did not pass human validation, but it looks like we failed to remove the model results of those two paradigms when calculating means for this table. This means that there are also slight discrepancies for Filler-Gap and Island Effects columns, the two categories from which we removed a paradigm. This was an oversight on our part, and we appreciate you bringing it to our attention.
Hi all,
I'm trying to reproduce the calculations for the overall accuracy on Table 3, but my calculations are not matching the paper.
Using the result CSV in this repo, I calculated the mean across 67 rows with type == "sentence", and get the following values:
5-gram: 61.2
LSTM: 69.8
TXL: 69.6
GPT-2: 81.5
Human: 88.6 (this one matches the paper)
Here is the R code I used:
Is the score calculation supposed to be done this way? Thanks.
The text was updated successfully, but these errors were encountered: