A forced aligner for Tsimane language. This repository contains also many interesting things for tsimane, such as a phonemizer, phonetic dictionary, etc. and can be used for other purposes.
Clone this github repository:
git clone https://github.com/yaya-sy/TsimaneForcedAligner.git
and move to it:
cd TsimaneForcedAligner
You can create the conda environment if you want to donwnload the bible corpus:
conda env create -f environment.yml
and activate it:
conda activate tsimane-scraper
We release the file data/timemarks.txt
containing audio timemarks for each verse of the bible corpus. It's a tab-separated file:
filename verse_line_id onset offset
The lines with onset = offset = 0.0
are unaligned verses, you can ignore them.
You can donwload the bible corpus using the script scripts/download_bible.py
, as:
python scripts/download_bible.py --page live.bible.is/bible/CASNTM/MRK/1 --output-directory data
Note that the source code of the web page or the links may change, so this scraper may become obsolete.
To align a corpus you need:
- a speech corpus: folder containing your audios and their corresponding texts (they must have the same filenames).
- a acoustic model: We release a pretrained acoustic model for aligning a new corpus. This model is pretrained on the bible corpus and is located in
models/all_non_merged_glottal.zip
- a phonetic dictionary: it's a vocabulary of the language mapping each word to its phonetic realization. You can find a phonetic dictionary created with the bible corpus of Tsimane in
data/vocabularies/bible_vocabulary.dict
. But you can also phonemize your own vocabulary using this script:scripts/phonemizer.py
To align your speech corpus, you will need to install the Montreal Forced Aligner.
After installation, you can align your corpus:
mfa align <your-speech-corpus> <your-phonetic-dictionary> models/tsimane_acoustic_model.zip <output-folder> --clean --overwrite --temp_directory aligners/wnh_tsimane --num_jobs 1