calculating reliability of vtc against all human annotations together #377
-
I'm working in the solomon dataset, where there have been 2 annotation campaigns with 4 and 3 human annotations respectively. I'd like to get the most informed estimate of precision & recall, or more simply, F-score for each of the 4 speaker types of VTC. To do so, I adapted a script as follows:
I get an error in line 48:
My best guess right now is that line I started a python session with annotations ended up having 1241 rows and 12 columns, and it looks perfectly reasonable:
The following line, which caused an error when running the script as such, caused no problem here, and segments has some content:
After subsetting to the key speakers, 4576557 rows remain. Outside of python, I unlocked the stats file, which appeared to have been created previously, so I could update it. So the problem actually arises when creating the intersection across annotations:
that's weird! there should be some overlap between VTC and the human annotation! And indeed there is:
392 vtc 2_CW2_CH2_AJ01_AJ09_190609.WAV 0 0 ... 2021-03-06 19:08:24 NaN 0.0.1 65900488.0 |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 3 replies
-
not yet an answer, but progress: if instead of startwith + options, you use "isin", the intersection is not empty:
stats contains:
not sure how to understand these numbers though intersection contains:
|
Beta Was this translation helpful? Give feedback.
-
I also checked, and the dataset passes validation:
|
Beta Was this translation helpful? Give feedback.
-
the am.read() step does spit out a lot of garble:
|
Beta Was this translation helpful? Give feedback.
-
Are you trying to get a global F-score, or one F-score per annotator? If you collect all annotations from the vtc + all annotators at once, and apply `am.intersect' on them, you will only get the portion that is covered by ALL of them (so every annotator + the vtc), which might be none. The original code computes the intersection for each annotator separately, derive the confusion matrice, and move on to the next annotator. Hope that clears things up a bit? |
Beta Was this translation helpful? Give feedback.
-
thanks, i get it now!
i want a global score, collapsing across all human annotators. i wonder how
i can hack the system to do that…
--
…---------------------------------------------------------------
Alex (Alejandrina) Cristia
Researcher, CNRS
Laboratoire de Sciences Cognitives et Psycholinguistique
29, rue d'Ulm, 75005, Paris, FRANCE
My site: www.acristia.org
---------------------------------------------------------------
If you donate, ask me about effective charities
<https://effectivealtruism.us8.list-manage.com/track/click?u=52b028e7f799cca137ef74763&id=206509456c&e=6c0b626a8f>.
/ Si vous faites des dons, demandez moi sur le don efficace
<https://www.altruismeefficacefrance.org/guide-don-efficace-1/>.
|
Beta Was this translation helpful? Give feedback.
-
What I ended up doing is to get the vtc-human agreement for every human, and the human-human agreement. Then in the paper, I reported the weighted average F-score of VTC-human (ie giving more weight to coders that had coded more data -- so NOT based on how much their annotations would overlap) as well as the weighted average F-score of human-human. So this allows us to answer the question of how much more or less accurate VTC is than the humans who have done the coding, compared to other humans who have done the coding. |
Beta Was this translation helpful? Give feedback.
What I ended up doing is to get the vtc-human agreement for every human, and the human-human agreement. Then in the paper, I reported the weighted average F-score of VTC-human (ie giving more weight to coders that had coded more data -- so NOT based on how much their annotations would overlap) as well as the weighted average F-score of human-human. So this allows us to answer the question of how much more or less accurate VTC is than the humans who have done the coding, compared to other humans who have done the coding.