Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slight discrepancy in the overall score calculation #2

Open
lucky-bai opened this issue Sep 11, 2020 · 1 comment
Open

Slight discrepancy in the overall score calculation #2

lucky-bai opened this issue Sep 11, 2020 · 1 comment

Comments

@lucky-bai
Copy link

Hi all,

I'm trying to reproduce the calculations for the overall accuracy on Table 3, but my calculations are not matching the paper.

image

Using the result CSV in this repo, I calculated the mean across 67 rows with type == "sentence", and get the following values:

5-gram: 61.2
LSTM: 69.8
TXL: 69.6
GPT-2: 81.5
Human: 88.6 (this one matches the paper)

Here is the R code I used:

library(tidyverse)

df <- read_csv("blimp_full_results_summary.csv")

df %>%
  filter(type == "sentence") %>%
  select(LSTM, GPT2, TXL, `N-Gram`, human) %>%
  colMeans

Is the score calculation supposed to be done this way? Thanks.

@Alicia-Parrish
Copy link
Collaborator

Thanks for flagging this. I was able to reproduce the discrepancy you found. You are correct that the numbers in the paper are off by a little bit for the four models we tested.

There were 2 paradigms that we removed from the dataset because they did not pass human validation, but it looks like we failed to remove the model results of those two paradigms when calculating means for this table. This means that there are also slight discrepancies for Filler-Gap and Island Effects columns, the two categories from which we removed a paradigm. This was an oversight on our part, and we appreciate you bringing it to our attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants