This project is a benchmark for evaluating Cantonese Grapheme-to-Phoneme (G2P) systems. The benchmark is based on the Jyutping Romanization System. The data is obtained from word.hk, 100 samples from Dufu-Analysis and 500 colloquial Chinese transcribed from CanCLID/zoengjyutgaai_saamgwokjinji.
The dataset includes character pairs along with their corresponding ground truth phonemes for G2P model evaluation. Each text file line contains a word, an underscore (_), and the target character for phoneme prediction. This format focuses on single-character phoneme prediction within words, making it suitable for assessing G2P models on specific characters.
The benchmark evaluates Cantonese G2P systems using two primary metrics:
- Definition: The percentage of instances where the specified character within a word is correctly converted from graphemes to phonemes.
- Purpose: This metric measures how often the G2P model accurately predicts the phoneme for the target character in the context of the word.
- Definition: The proportion of phoneme components that are incorrectly predicted.
- Calculation Details:
- Syllable Decomposition: Each Jyutping syllable is broken down into four components: onset, nucleus, coda, and tone.
- Hamming Distance: PER is calculated by computing the Hamming distance between the predicted and ground truth quadruples (onset, nucleus, coda, tone).
- For example, if the ground truth is
(s, a, i, 2)
and the prediction is(s, a, m, 2)
, the Hamming distance is 1 (since only the coda differs).
- For example, if the ground truth is
- Multiple Labels Handling: If multiple valid pronunciations (alternative labels) exist for a character, the PER is computed using the label that minimizes the Hamming distance to the prediction.
- Purpose: PER provides a fine-grained evaluation by identifying specific phoneme components where errors occur, offering insights into the model's phonological performance.
Previously, the Levenshtein distance was considered for evaluating G2P performance but was found to be unsuitable for this benchmark due to:
-
Dependency on Romanization System:
- The Levenshtein distance operates on the Jyutping romanization strings, which can bias the results based on spelling conventions rather than actual phonetic differences.
- Different romanization systems might represent the same sounds with different letters or letter combinations, affecting the distance calculation.
-
Positional Pronunciation Variations:
- In Cantonese, certain letters represent sounds that change depending on their position within a syllable.
- Example: The letters p, t, and k are aspirated when they appear at the beginning (onset) of a syllable but are unreleased when they appear at the end (coda).
- Levenshtein distance does not account for these positional differences, potentially overestimating errors when letters are the same but their pronunciations differ due to their positions.
- In Cantonese, certain letters represent sounds that change depending on their position within a syllable.
By using Accuracy and Phoneme Error Rate (PER) based on phonetic components, the benchmark provides a more accurate and meaningful evaluation of G2P systems that reflects true phonological performance rather than orthographic or romanization discrepancies.
# pull submodules
git submodule update --init --recursive
# install dependencies
pip install -r requirements.txt
# install g2pW-Cantonese dependencies
pip install -r g2pW-Cantonese/requirements.txt
python run.py
To submit your G2P system, please subclass the G2PModel
class in models
and implement the _predict
method. Then, add your model to the models
list in run.py
. Finally, run run.py
to generate the results.