English | 简体中文
SOFA (Singing-Oriented Forced Aligner) is a forced alignment tool designed specifically for singing voice.
On singing data, SOFA has the following advantages over MFA (Montreal Forced Aligner):
- Easier installation
- Better performance
- Faster inference speed
- Use
git clone
to download the code from this repository - Install conda
- Create a conda environment, requiring Python version
3.8
conda create -n SOFA python=3.8 -y conda activate SOFA
- Go to the pytorch official website to install torch
- (Optional, to improve wav file reading speed) Go to the pytorch official website to install torchaudio
- Install other Python libraries
pip install -r requirements.txt
-
Download the model file. You can find the trained models in the pretrained model sharing category of the discussion section, with the file extension
.ckpt
. -
Place the dictionary file in the
/dictionary
folder. The default dictionary isopencpop-extension.txt
-
Prepare the data for forced alignment and place it in a folder (by default in the
/segments
folder), with the following format- segments - singer1 - segment1.lab - segment1.wav - segment2.lab - segment2.wav - ... - singer2 - segment1.lab - segment1.wav - ...
Ensure that the
.wav
files and their corresponding.lab
files are in the same folder.The
.lab
file is the transcription for the.wav
file with the same name. The file extension for the transcription can be changed using the--in_format
parameter.After the transcription is converted into a phoneme sequence by the
g2p
module, it is fed into the model for alignment.For example, when using the
DictionaryG2P
module and theopencpop-extension
dictionary by default, if the content of the transcription is:gan shou ting zai wo fa duan de zhi jian
, theg2p
module will convert it based on the dictionary into the phoneme sequenceg an sh ou t ing z ai w o f a duan d e zh ir j ian
. For how to use otherg2p
modules, see g2p module usage instructions. -
Command-line inference
Use
python infer.py
to perform inference.Parameters that need to be specified:
--ckpt
: (must be specified) The path to the model weights;--folder
: The folder where the data to be aligned is stored (default issegments
);--in_format
: The file extension of the transcription (default islab
);--out_formats
: The annotation format of the inferred files, multiple formats can be specified, separated by commas (default isTextGrid,htk,trans
).--save_confidence
: Output confidence scores.--dictionary
: The dictionary file (default isdictionary/opencpop-extension.txt
);
python infer.py --ckpt checkpoint_path --folder segments_path --dictionary dictionary_path --out_formats output_format1,output_format2...
-
Retrieve the Final Annotation
The final annotation is saved in a folder, the name of which is the annotation format you have chosen. This folder is located in the same directory as the wav files used for inference.
- Using a custom g2p instead of a dictionary
- In the matching mode, you can activate it by specifying
-m
during inference. It finds the most probable contiguous sequence segment within the given phoneme sequence, rather than having to use all the phonemes.
-
Follow the steps above for setting up the environment. It is recommended to install torchaudio for faster binarization speed;
-
Place the training data in the
data
folder in the following format:- data - full_label - singer1 - wavs - audio1.wav - audio2.wav - ... - transcriptions.csv - singer2 - wavs - ... - transcriptions.csv - weak_label - singer3 - wavs - ... - transcriptions.csv - singer4 - wavs - ... - transcriptions.csv - no_label - audio1.wav - audio2.wav - ...
Regarding the format of
transcriptions.csv
, see: qiuqiao#5Where:
transcriptions.csv
only needs to have the correct relative path to thewavs
folder;The
transcriptions.csv
inweak_label
does not need to have aph_dur
column; -
Modify
binarize_config.yaml
as needed, then executepython binarize.py
; -
Download the pre-trained model you need from releases, modify
train_config.yaml
as needed, then executepython train.py -p path_to_your_pretrained_model
; -
For training visualization:
tensorboard --logdir=ckpt/
.
To measure the performance of a model, it is useful to calculate some objective evaluation metrics between the predictions (force-aligned labels) and the targets (manual labels), especially in a k-fold cross-validation.
Some useful metrics are:
- Boundary Edit Distance: the total moving distance from the predicted boundaries to the target boundaries.
- Boundary Edit Ratio: the boundary edit distance divided by the total duration of target intervals.
- Boundary Error Rate: the proportion of misplaced boundaries to all target boundaries under a given tolerance of distance.
To evaluate your model on a specific dataset, please first run the inference to get all predictions. You should put your predictions and targets in different folders, with same filenames and relative paths, containing the same phone sequences except for spaces. The script only supports TextGrid format currently.
Run the following command:
python evaluate.py <PRED_DIR> <TARGET_DIR> -r -s
where PRED_DIR
is a directory containing all predictions and TARGET_DIR
is a directory containing all targets.
Options:
-r
,--recursive
: compare the files in subdirectories recursively-s
,--strict
: use strict mode (raise errors instead of skipping if the phones are not identical)--ignore
: ignore some phone marks (default:AP,SP,<AP>,<SP>,,pau,cl
)
The script will calculate:
- The boundary edit ratio
- The boundary error rate, under 10ms, 20ms and 50ms tolerance