Slight discrepancy in the overall score calculation #2

lucky-bai · 2020-09-11T15:18:07Z

Hi all,

I'm trying to reproduce the calculations for the overall accuracy on Table 3, but my calculations are not matching the paper.

Using the result CSV in this repo, I calculated the mean across 67 rows with type == "sentence", and get the following values:

5-gram: 61.2
LSTM: 69.8
TXL: 69.6
GPT-2: 81.5
Human: 88.6 (this one matches the paper)

Here is the R code I used:

library(tidyverse)

df <- read_csv("blimp_full_results_summary.csv")

df %>%
  filter(type == "sentence") %>%
  select(LSTM, GPT2, TXL, `N-Gram`, human) %>%
  colMeans

Is the score calculation supposed to be done this way? Thanks.

The text was updated successfully, but these errors were encountered:

Alicia-Parrish · 2020-09-11T16:55:15Z

Thanks for flagging this. I was able to reproduce the discrepancy you found. You are correct that the numbers in the paper are off by a little bit for the four models we tested.

There were 2 paradigms that we removed from the dataset because they did not pass human validation, but it looks like we failed to remove the model results of those two paradigms when calculating means for this table. This means that there are also slight discrepancies for Filler-Gap and Island Effects columns, the two categories from which we removed a paradigm. This was an oversight on our part, and we appreciate you bringing it to our attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slight discrepancy in the overall score calculation #2

Slight discrepancy in the overall score calculation #2

lucky-bai commented Sep 11, 2020

Alicia-Parrish commented Sep 11, 2020

Slight discrepancy in the overall score calculation #2

Slight discrepancy in the overall score calculation #2

Comments

lucky-bai commented Sep 11, 2020

Alicia-Parrish commented Sep 11, 2020