Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipformer recipe for CommonVoice #1546

Merged
merged 44 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
b2d1975
init commit
JinZr Mar 11, 2024
ddefabc
added scripts
JinZr Mar 11, 2024
4a1d4be
added scripts for char-based lang prep
JinZr Mar 12, 2024
d35cedc
text_norm updated
JinZr Mar 12, 2024
4cae6b6
text_norm updated
JinZr Mar 12, 2024
9820bf9
updated
JinZr Mar 12, 2024
a9df06c
Update prepare.sh
JinZr Mar 12, 2024
d45e4c6
Update prepare.sh
JinZr Mar 12, 2024
d887bf8
updated scripts for text
JinZr Mar 12, 2024
204a3b2
arg type fixed
JinZr Mar 12, 2024
750e2ac
Update prepare.sh
JinZr Mar 12, 2024
a39aa8a
scripts updated
JinZr Mar 13, 2024
09a358a
Update preprocess_commonvoice.py
JinZr Mar 13, 2024
b30a4d6
updated scripts for text norm
JinZr Mar 13, 2024
eaceb69
Update preprocess_commonvoice.py
JinZr Mar 13, 2024
7d34116
minor fixes
JinZr Mar 13, 2024
4413713
added char based training scripts
JinZr Mar 13, 2024
9bf88ac
Update train_char.py
JinZr Mar 13, 2024
5699202
Update train_char.py
JinZr Mar 13, 2024
303eb99
Update train_char.py
JinZr Mar 13, 2024
921d34a
Update train_char.py
JinZr Mar 13, 2024
c1eb2ad
Update train_char.py
JinZr Mar 13, 2024
58041c1
Update train_char.py
JinZr Mar 13, 2024
e979bf5
Update train_char.py
JinZr Mar 13, 2024
ed3d25b
added scripts for processing validated data
JinZr Mar 13, 2024
53fb384
scripts updated
JinZr Mar 14, 2024
e9f86df
Update asr_datamodule.py
JinZr Mar 14, 2024
7d01eb4
misc fix
JinZr Mar 15, 2024
d77b035
misc. fix
JinZr Mar 15, 2024
030365f
misc. update
JinZr Mar 15, 2024
06bca2f
misc. update
JinZr Mar 15, 2024
678ad2b
Update preprocess_commonvoice.py
JinZr Mar 15, 2024
bea63ca
Update asr_datamodule.py
JinZr Mar 15, 2024
3560e22
Merge branch 'master' into dev/cv-zipformer
JinZr Mar 15, 2024
e62e16e
updated with scripts for streaming decode
JinZr Mar 15, 2024
d9a0ab5
fixed formatting issue
JinZr Mar 15, 2024
6993183
Update preprocess_commonvoice.py
JinZr Mar 15, 2024
4237127
added results on `zh-HK`
JinZr Mar 20, 2024
c274003
Merge branch 'k2-fsa:master' into dev/cv-zipformer
JinZr Mar 20, 2024
1d92107
Merge branch 'k2-fsa:master' into dev/cv-zipformer
JinZr Mar 23, 2024
8d05389
Update egs/commonvoice/ASR/RESULTS.md
JinZr Apr 8, 2024
8347436
Update egs/commonvoice/ASR/pruned_transducer_stateless7/train.py
JinZr Apr 8, 2024
b9d34fb
Update egs/commonvoice/ASR/local/word_segment_yue.py
JinZr Apr 8, 2024
05e48ca
misc. update
JinZr Apr 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 74 additions & 9 deletions egs/commonvoice/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,73 @@
## Results

### Commonvoice Cantonese (zh-HK) Char training results (Zipformer)

See #1546 for more details.

Number of model parameters: 72526519, i.e., 72.53 M

The best CER, for CommonVoice 16.1 (cv-corpus-16.1-2023-12-06/zh-HK) is below:

| | Dev | Test | Note |
|----------------------|-------|------|--------------------|
| greedy_search | 1.17 | 1.22 | --epoch 24 --avg 5 |
| modified_beam_search | 0.98 | 1.11 | --epoch 24 --avg 5 |
| fast_beam_search | 1.08 | 1.27 | --epoch 24 --avg 5 |

When doing the cross-corpus validation on MDCC (w/o blank penalty),
JinZr marked this conversation as resolved.
Show resolved Hide resolved
the best CER is below:

| | Dev | Test | Note |
|----------------------|-------|------|--------------------|
| greedy_search | 42.40 | 42.03| --epoch 24 --avg 5 |
| modified_beam_search | 39.73 | 39.19| --epoch 24 --avg 5 |
| fast_beam_search | 42.14 | 41.98| --epoch 24 --avg 5 |

When doing the cross-corpus validation on MDCC (with blank penalty set to 2.2),
the best CER is below:

| | Dev | Test | Note |
|----------------------|-------|------|----------------------------------------|
| greedy_search | 39.19 | 39.09| --epoch 24 --avg 5 --blank-penalty 2.2 |
| modified_beam_search | 37.73 | 37.65| --epoch 24 --avg 5 --blank-penalty 2.2 |
| fast_beam_search | 37.73 | 37.74| --epoch 24 --avg 5 --blank-penalty 2.2 |

To reproduce the above result, use the following commands for training:

```bash
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train_char.py \
--world-size 2 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--cv-manifest-dir data/zh-HK/fbank \
--language zh-HK \
--use-validated-set 1 \
--context-size 1 \
--max-duration 1000
```

and the following commands for decoding:

```bash
for method in greedy_search modified_beam_search fast_beam_search; do
./zipformer/decode_char.py \
--epoch 24 \
--avg 5 \
--decoding-method $method \
--exp-dir zipformer/exp \
--cv-manifest-dir data/zh-HK/fbank \
--context-size 1 \
--language zh-HK
done
```

Detailed experimental results and pre-trained model are available at:
<https://huggingface.co/zrjin/icefall-asr-commonvoice-zh-HK-zipformer-2024-03-20>


### GigaSpeech BPE training results (Pruned Stateless Transducer 7)
JinZr marked this conversation as resolved.
Show resolved Hide resolved

#### [pruned_transducer_stateless7](./pruned_transducer_stateless7)
Expand All @@ -13,8 +82,8 @@ Results are:

| | Dev | Test |
|----------------------|-------|-------|
| greedy search | 9.96 | 12.54 |
| modified beam search | 9.86 | 12.48 |
| greedy_search | 9.96 | 12.54 |
| modified_beam_search | 9.86 | 12.48 |

To reproduce the above result, use the following commands for training:

Expand Down Expand Up @@ -55,10 +124,6 @@ and the following commands for decoding:
Pretrained model is available at
<https://huggingface.co/yfyeung/icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17>

The tensorboard log for training is available at
<https://tensorboard.dev/experiment/j4pJQty6RMOkMJtRySREKw/>


### Commonvoice (fr) BPE training results (Pruned Stateless Transducer 7_streaming)

#### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
Expand All @@ -73,9 +138,9 @@ Results are:

| decoding method | Test |
|----------------------|-------|
| greedy search | 9.95 |
| modified beam search | 9.57 |
| fast beam search | 9.67 |
| greedy_search | 9.95 |
| modified_beam_search | 9.57 |
| fast_beam_search | 9.67 |

Note: This best result is trained on the full librispeech and gigaspeech, and then fine-tuned on the full commonvoice.

Expand Down
32 changes: 26 additions & 6 deletions egs/commonvoice/ASR/local/compute_fbank_commonvoice_splits.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env python3
# Copyright 2023 Xiaomi Corp. (Yifan Yang)
# Copyright 2023-2024 Xiaomi Corp. (Yifan Yang,
# Zengrui Jin,)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
Expand All @@ -17,7 +18,6 @@

import argparse
import logging
from datetime import datetime
from pathlib import Path

import torch
Expand All @@ -30,6 +30,8 @@
set_caching_enabled,
)

from icefall.utils import str2bool

# Torch's multithreaded behavior needs to be disabled or
# it wastes a lot of CPU and slow things down.
# Do this outside of main() in case it needs to take effect
Expand All @@ -41,6 +43,13 @@
def get_args():
parser = argparse.ArgumentParser()

parser.add_argument(
"--subset",
type=str,
default="train",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a choices here? Otherwise, it is not clear what values are valid.

help="""Dataset parts to compute fbank. """,
)

parser.add_argument(
"--language",
type=str,
Expand All @@ -66,28 +75,35 @@ def get_args():
"--num-splits",
type=int,
required=True,
help="The number of splits of the train subset",
help="The number of splits of the subset",
)

parser.add_argument(
"--start",
type=int,
default=0,
help="Process pieces starting from this number (inclusive).",
help="Process pieces starting from this number (included).",
)

parser.add_argument(
"--stop",
type=int,
default=-1,
help="Stop processing pieces until this number (exclusive).",
help="Stop processing pieces until this number (excluded).",
)

parser.add_argument(
"--perturb-speed",
type=str2bool,
default=False,
help="""Perturb speed with factor 0.9 and 1.1 on train subset.""",
)

return parser.parse_args()


def compute_fbank_commonvoice_splits(args):
subset = "train"
subset = args.subset
num_splits = args.num_splits
language = args.language
output_dir = f"data/{language}/fbank/cv-{language}_{subset}_split_{num_splits}"
Expand Down Expand Up @@ -130,6 +146,10 @@ def compute_fbank_commonvoice_splits(args):
keep_overlapping=False, min_duration=None
)

if args.perturb_speed:
logging.info(f"Doing speed perturb")
cut_set = cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)

logging.info("Computing features")
cut_set = cut_set.compute_and_store_features_batch(
extractor=extractor,
Expand Down
1 change: 1 addition & 0 deletions egs/commonvoice/ASR/local/prepare_char.py
1 change: 1 addition & 0 deletions egs/commonvoice/ASR/local/prepare_lang.py
1 change: 1 addition & 0 deletions egs/commonvoice/ASR/local/prepare_lang_fst.py
46 changes: 37 additions & 9 deletions egs/commonvoice/ASR/local/preprocess_commonvoice.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from pathlib import Path
from typing import Optional

from lhotse import CutSet, SupervisionSegment
from lhotse import CutSet
from lhotse.recipes.utils import read_manifests_if_cached


Expand Down Expand Up @@ -52,14 +52,20 @@ def normalize_text(utt: str, language: str) -> str:
return re.sub(r"[^A-ZÀÂÆÇÉÈÊËÎÏÔŒÙÛÜ' ]", "", utt).upper()
elif language == "pl":
return re.sub(r"[^a-ząćęłńóśźżA-ZĄĆĘŁŃÓŚŹŻ' ]", "", utt).upper()
elif language == "yue":
return (
utt.replace(" ", "")
.replace(",", "")
.replace("。", " ")
.replace("?", "")
.replace("!", "")
.replace("?", "")
elif language in ["yue", "zh-HK"]:
# Mozilla Common Voice uses both "yue" and "zh-HK" for Cantonese
# Not sure why they decided to do this...
# None en/zh-yue tokens are manually removed here

# fmt: off
tokens_to_remove = [",", "。", "?", "!", "?", "!", "‘", "、", ",", "\.", ":", ";", "「", "」", "“", "”", "~", "—", "ㄧ", "《", "》", "…", "⋯", "·", "﹒", ".", ":", "︰", "﹖", "(", ")", "-", "~", ";", "", "⠀", "﹔", "/", "A", "B", "–", "‧"]

# fmt: on
utt = utt.upper().replace("\\", "")
return re.sub(
pattern="|".join([f"[{token}]" for token in tokens_to_remove]),
repl="",
string=utt,
)
else:
raise NotImplementedError(
Expand Down Expand Up @@ -130,6 +136,28 @@ def preprocess_commonvoice(
supervisions=m["supervisions"],
).resample(16000)

if partition == "validated":
logging.warning(
"""
The 'validated' partition contains the data of both 'train', 'dev'
and 'test' partitions. We filter out the 'dev' and 'test' partition
here.
"""
)
dev_ids = src_dir / f"cv-{language}_dev_ids"
test_ids = src_dir / f"cv-{language}_test_ids"
assert (
dev_ids.is_file()
), f"{dev_ids} does not exist, please check stage 1 of the prepare.sh"
assert (
test_ids.is_file()
), f"{test_ids} does not exist, please check stage 1 of the prepare.sh"
dev_ids = dev_ids.read_text().strip().split("\n")
test_ids = test_ids.read_text().strip().split("\n")
cut_set = cut_set.filter(
lambda x: x.supervisions[0].id not in dev_ids + test_ids
)

# Run data augmentation that needs to be done in the
# time domain.
logging.info(f"Saving to {raw_cuts_path}")
Expand Down
Loading
Loading