Github repository accompanying the paper Measuring language development from child-centered recordings
Repository organization
configs/
contains all the necessary options to train/test the models.src/
contains all the python source code.analysis/
contains all the R notebook for analysing the results.plots
contains the graphics reported in the paper.
For reproducing all the experiments, you will need to:
- Prepare the data to train and test the models
- Run the training
- Run the testing
- Reproduce the anlaysis (plots and statistical analysis of the results)
Librispeech : https://www.openslr.org/12
Thomas : https://gin.g-node.org/LAAC-LSCP/thomas
Providence: https://gin.g-node.org/LAAC-LSCP/providence
Clone this github repository and move to it:
git clone https://github.com/yaya-sy/EntropyBasedCLDMetrics.git
cd EntropyBasedCLDMetrics
Create the Python environment:
conda env create -f environment.yml
and activate it:
conda activate ent_cldm
The phonemizer requires the espeak backend, it can be installed with this command line: apt-get install espeak-ng
python src/librispeech_for_ngram_lm.py -i [LIBRISPEECH_TRAIN-CLEAN-360_FOLDER] -o data/ngram_lm/
This will create two files in data/ngram_lm
. The one with the *.orthographic
extension contains the orthographic utterances and the one with *.phonemized
extension contains the phonemized utterances.
We need first to train the n-gram language model in order to prepare the data for the other experiments.
For training the n-gram language model, you will need to install KenLM:
conda install -c anaconda cmake
git clone https://github.com/kpu/kenlm.git
cd kenlm
python setup.py develop
mkdir -p build
cd build
cmake ..
make -j 4
Once installed in the current directory, you can run the training:
mkdir checkpoints
kenlm/build/bin/lmplz --discount_fallback -o 5 < data/ngram_lm/librispeech.phonemized > checkpoints/librispeech_360.arpa
The trained model will be stored in the checkpoints
folder.
You will need to install the Thomas corpus https://gin.g-node.org/LAAC-LSCP/thomas.
To install the thomas corpus using datalad, run the following commands:
datalad install -r [email protected]:/LAAC-LSCP/thomas.git
cd thomas
datalad get annotations/cha/*
datalad get recordings/raw/*
Once installed, you can run this command to extract utterances, their cleaned version and the timemarks:
python src/create_thomas_corpus.py -c [PATH_TO_THOMAS_CORPUS] -o data/Thomas
Where [PATH_TO_THOMAS_CORPUS]
is the path to the installed Thomas corpus.
In the created folder, orthographic
contains the raw annotations without cleaning. The cleaned
folder contains the cleaned version of the annotations. And timemarks
contains the onsets and offsets of each utterance in the audios. All of these are aligned, meaning that the ith line of each file corresponds to the ith line of the other files.
The filename.txt
files contain the raw filenames and months.txt
files contain the ages of the child in months.
> python src/prepare_childes_corpus.py -i data/Thomas/
> python src/prepare_input_files.py -c data/Thomas/ -a [AUDIO_FOLDER] -m checkpoints/librispeech_360.arpa
Where [AUDIO_FOLDER]
is the path to the audio folder of the data installed from the GIN repository. The audio folder is recordings/raw/
.
Create the inputs for the regression model:
> python src/prepare_librispeech_corpus.py -i [LIBRISPEECH_TRAIN-CLEAN-100_FOLDER] -o data/Librispeech/model_inputs
> python src/prepare_input_files.py -c data/Librispeech/ -a [LIBRISPEECH_TRAIN-CLEAN-100_FOLDER] -m checkpoints/librispeech_360.arpa
Where [LIBRISPEECH_TRAIN-CLEAN-100_FOLDER]
is the path to the folder containing the librispeech train-clean-100.
As for the Thomas corpus, you will also need to install the providence corpus https://gin.g-node.org/LAAC-LSCP/providence.
Extract the utterances of the providence corpus:
python src/create_providence_corpus_new.py -i [PREPARED_CSV] -c [PATH_TO_PROVIDENCE_CORPUS] -o data/Providence/
Where [PREPARED_CSV] is the CSV aleady prepared with cleaned utterances, the timemarks, etc.
Create the inputs for the model:
> python src/prepare_childes_corpus.py -i data/Providence/
> python src/prepare_input_files.py -c data/Providence/ -a [AUDIO_FOLDER] -m checkpoints/librispeech_360.arpa
Where [AUDIO_FOLDER]
is the path to the audio folder of the data installed from the GIN repository. The audio folder is recordings/raw/
.
The model is already trained during the data prepration and is saved on checkpoints/librispeech_360.arpa
.
So we will not retrain it again.
Run the regression model training on Thomas:
python src/train.py -c configs/thomas.yaml
The trained model will be stored in the folder checkpoints
as Thomas_30h_Librispeech360_en.pt
.
Run the regression model training on librispeech train-clean-100:
python src/train.py -c configs/librispeech.yaml
The trained model will be stored in the folder checkpoints
as Librispeech_100h_Librispeech360_en.pt
.
python src/compute_entropies_ngram_lm.py
This will create a csv file named Librispeech_360h.csv
in the folder results
.
python src/compute_entropies_whisper.py -c configs/test.yaml -m checkpoints/Thomas_30h_Librispeech360_en.pt
This will create a csv file named Thomas_30h_Librispeech360_en.csv
in the folder results
.
python src/compute_entropies_whisper.py -c configs/test.yaml -m checkpoints/Librispeech_100h_Librispeech360_en.pt
This will create a csv file named Librispeech_100h_Librispeech360_en.csv
in the folder results
.
python src/prepare_for_analysis.py -i results/Librispeech_360h.csv
This will create a csv file named Librispeech_360h_analysis.csv
in the folder results
.
python src/prepare_for_analysis_hubert.py -i results/HuBERT-nat_entropy_ngram-2-merge-False_mmap.csv -c [CHILDES_PATH_PROVIDENCE]
This will create a csv file named HuBERT-nat_entropy_ngram-2-merge-False_mmap_analysis.csv
in the folder results
.
python src/prepare_for_analysis_hubert.py -i results/HuBERT-tts_entropy_ngram-2-merge-False_mmap.csv -c [CHILDES_PATH_PROVIDENCE]
This will create a csv file named HuBERT-tts_entropy_ngram-2-merge-False_mmap_analysis.csv
in the folder results
.
python src/prepare_for_analysis.py -i results/Thomas_30h_Librispeech_en.csv
This will create a csv file named Thomas_30h_Librispeech_en_analysis.csv
in the folder results
.
python src/prepare_for_analysis.py -i results/Librispeech_100h_Librispeech360_en.csv
This will create a csv file named Librispeech_100h_Librispeech360_en_analysis.csv
in the folder results
.
You can reproduce the figures of the paper with this notebook: analysis/plots.Rmd
You can reproduce the mixed linear models of the paper with this notebook: analysis/models.Rmd
We give the MLU, IPSyn and VOCD already computed on the Providence corpus. The CSV file is in extra/chi.kideval.csv
.
But before computing the correlations with the entropy metric, you will need to merge chi.kideval.csv
with the CSVs produced in the previous experiments.
For the experiments 1A, 2A and 2B, you can prepare the CSVs for computing the correlations using this command:
python src/merge_metrics.py -i [CSV_FOR_ANALYSIS]
Where [CSV_RESULTS_FOR_ANALYSIS]
is the path to CSV results already prepared for analysis.
For the experiments 1B and 1C, you can prepare the CSVs for computing the correlations using this command:
python src/merge_metrics_hubert.py -i [CSV_FOR_ANALYSIS]
Once done, you can use the notebook analysis/correlations.Rmd
to compute correlations.