how I set up an annotation campaign #216
Replies: 3 comments 20 replies
-
Indeed. And this is a significant drawback since the energy sampler may be slow... (even though you can just sample everything and retain only the segments for the recordings you're targetting).
Good question. segments_path = os.path.join(destination, 'segments_{}.csv'.format(date)) |
Beta Was this translation helpful? Give feedback.
-
We probably want a fully public sample file to train annotators with. I suggest we use the first recording of Anae in the Paris corpus:
|
Beta Was this translation helpful? Give feedback.
-
Process adapted to the Paris minicorpus:
then inside samples/selrecs.csv I put:
A few variants of the sampling command give me a concatenation error:
Notice the last one asks for a rather low threshold, no skipping, no spacing
|
Beta Was this translation helpful? Give feedback.
-
TODO
My goals
In this project, I want to select sections to be annotated by humans in the lab, with the end goal of having more exemplars of "other child" and "male adult". My annotators speak French, so I've decided to draw samples from the Lyon corpus. I've looked at prior transcriptions and determined that the children that have the most OCH and MAL are: GAL, DUN, GOE2, FRH1, CUM, COF.
Prep work
source ~/ChildProjectVenv/bin/activate
pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git
Whenever I get back to this project after a while, I always do these checks:
source ~/ChildProjectVenv/bin/activate
pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git
git pull
and thendatalad update
git status
then check which is the latest branch on https://gin.g-node.org/EL1000/lyon, and if needed switch with something likegit checkout eaf-corrections
datalad get annotations/eaf/mc/converted annotations/its/converted/
The sampler phase
I don't want our annotations to be too biased by our current algorithms' performance on OCH and MAL; but I also don't want to have a lot of silence. Among the sampling types mentioned in the sampler docs (currently: periodic, random-vocalizations, high-volubility, energy-detection), the most appropriate to avoid silence while not biasing selection by our voice type classifier is energy-detection.
Among the options for the energy-based sampler, I need to choose:
This leads me to the command (the first two parameters are the path to the dataset, and the path to the folder where segments will be stored):
child-project sampler . samples/och_mal/ --recordings samples/selrecs.csv energy-detection --windows-length 30000 --windows-spacing 300000 --windows-offset 1800000 --windows-count 40 --threshold .75 --low-freq 50 --high-freq 3000 --by recording_filename
For one sample child, I got:
TODO --profile converted
Creating a template
This annotation will have a first-pass check by humans, who listen to 30s and decide whether they'll annotate that segment or not. Then a second pass, for which we'll use a template we created for this which is a variant of ACLEW's; it differs in that:
To that end, we need to create a specific ELAN template. This is done inside ELAN, following these instructions.
exelang-template.zip
Building my .eafs
The first parameter is the destination. (Notice that I don't need to provide the path to the project for this one.) The segments file is outputted by the previous step
TODO check with LG: --eaf-type random
output:
Using the seated scribe for selecting sections to annotate
TODO I'm adding/removing the .wav extension -- we should fix this upstream
Setting up files for annotators to access them
to be discussed
in this step, we would get the sections that are "yes" and set them to be annotated. But perhaps it is simpler to create a template with everything, and then during import designating this as the section that has been coded?
todo incorporate this, in order to split the sound files into the different recordings contained in a single wav file:
Beta Was this translation helpful? Give feedback.
All reactions