Zipformer recipe for Cantonese dataset MDCC (#1537)

* init commit * Create README.md * handle code switching cases * misc. fixes * added manifest statistics * init commit for the zipformer recipe * added scripts for exporting model * added RESULTS.md * added scripts for streaming related stuff * doc str fixed
k2-fsa · Mar 13, 2024 · c3f6f28 · c3f6f28
1 parent 81f518e
commit c3f6f28
Show file tree

Hide file tree

Showing 43 changed files with 4,655 additions and 4 deletions.
diff --git a/egs/aishell/ASR/README.md b/egs/aishell/ASR/README.md
@@ -19,7 +19,9 @@ The following table lists the differences among them.
 | `transducer_stateless_modified`    | Conformer | Embedding + Conv1d | with modified transducer from `optimized_transducer`                     |
 | `transducer_stateless_modified-2`  | Conformer | Embedding + Conv1d | with modified transducer from `optimized_transducer` + extra data      |
 | `pruned_transducer_stateless3`     | Conformer (reworked) | Embedding + Conv1d | pruned RNN-T + reworked model with random combiner + using aidatatang_20zh as extra data|
-| `pruned_transducer_stateless7`     | Zipformer | Embedding | pruned RNN-T + zipformer encoder + stateless decoder with context-size 1 |
+| `pruned_transducer_stateless7`     | Zipformer | Embedding | pruned RNN-T + zipformer encoder + stateless decoder with context-size set to 1 |
+| `zipformer`                           | Upgraded Zipformer | Embedding + Conv1d | The latest recipe with context-size set to 1 |
+
 
 The decoder in `transducer_stateless` is modified from the paper
 [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).

diff --git a/egs/aishell/ASR/prepare.sh b/egs/aishell/ASR/prepare.sh
@@ -360,7 +360,7 @@ if [ $stage -le 11 ] && [ $stop_stage -ge 11 ]; then
 fi
 
 if [ $stage -le 12 ] && [ $stop_stage -ge 12 ]; then
-  log "Stage 11: Train RNN LM model"
+  log "Stage 12: Train RNN LM model"
   python ../../../icefall/rnn_lm/train.py \
     --start-epoch 0 \
     --world-size 1 \

diff --git a/egs/mdcc/ASR/README.md b/egs/mdcc/ASR/README.md
@@ -0,0 +1,19 @@
+# Introduction
+
+Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with 
+transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, 
+politics, education, culture, lifestyle and family domains, covering a wide range of topics. 
+
+Manuscript can be found at: https://arxiv.org/abs/2201.02419
+
+# Transducers
+
+
+
+|                                       | Encoder             | Decoder            | Comment                     |
+|---------------------------------------|---------------------|--------------------|-----------------------------|
+| `zipformer`                           | Upgraded Zipformer | Embedding + Conv1d | The latest recipe with context-size set to 1 |
+
+The decoder is modified from the paper
+[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
+We place an additional Conv1d layer right after the input embedding layer.
diff --git a/egs/mdcc/ASR/RESULTS.md b/egs/mdcc/ASR/RESULTS.md
@@ -0,0 +1,41 @@
+## Results
+
+#### Zipformer
+
+See <https://github.com/k2-fsa/icefall/pull/1537>
+
+[./zipformer](./zipformer)
+
+##### normal-scaled model, number of model parameters: 74470867, i.e., 74.47 M
+
+|                        | test | valid | comment                                 |
+|------------------------|------|-------|-----------------------------------------|
+| greedy search          | 7.45 | 7.51  | --epoch 45 --avg 35                     |
+| modified beam search   | 6.68 | 6.73  | --epoch 45 --avg 35                     |
+| fast beam search       | 7.22 | 7.28  | --epoch 45 --avg 35                     |
+
+The training command:
+
+```
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+
+./zipformer/train.py \
+  --world-size 4 \
+  --start-epoch 1 \
+  --num-epochs 50 \
+  --use-fp16 1 \
+  --exp-dir ./zipformer/exp \
+  --max-duration 1000 
+```
+
+The decoding command:
+
+```
+ ./zipformer/decode.py \
+   --epoch 45 \
+   --avg 35 \
+   --exp-dir ./zipformer/exp \
+   --decoding-method greedy_search  # modified_beam_search
+```
+
+The pretrained model is available at:  https://huggingface.co/zrjin/icefall-asr-mdcc-zipformer-2024-03-11/
diff --git a/egs/mdcc/ASR/local/compile_hlg.py b/egs/mdcc/ASR/local/compile_hlg.py
@@ -0,0 +1 @@
+../../../librispeech/ASR/local/compile_hlg.py
diff --git a/egs/mdcc/ASR/local/compile_hlg_using_openfst.py b/egs/mdcc/ASR/local/compile_hlg_using_openfst.py
@@ -0,0 +1 @@
+../../../librispeech/ASR/local/compile_hlg_using_openfst.py
diff --git a/egs/mdcc/ASR/local/compile_lg.py b/egs/mdcc/ASR/local/compile_lg.py
@@ -0,0 +1 @@
+../../../librispeech/ASR/local/compile_lg.py
diff --git a/egs/mdcc/ASR/local/compute_fbank_mdcc.py b/egs/mdcc/ASR/local/compute_fbank_mdcc.py
@@ -0,0 +1,157 @@
+#!/usr/bin/env python3
+# Copyright    2021-2024  Xiaomi Corp.   (authors: Fangjun Kuang,
+#                                                  Zengrui Jin,)
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+"""
+This file computes fbank features of the aishell dataset.
+It looks for manifests in the directory data/manifests.
+
+The generated fbank features are saved in data/fbank.
+"""
+
+import argparse
+import logging
+import os
+from pathlib import Path
+
+import torch
+from lhotse import (
+    CutSet,
+    Fbank,
+    FbankConfig,
+    LilcomChunkyWriter,
+    WhisperFbank,
+    WhisperFbankConfig,
+)
+from lhotse.recipes.utils import read_manifests_if_cached
+
+from icefall.utils import get_executor, str2bool
+
+# Torch's multithreaded behavior needs to be disabled or
+# it wastes a lot of CPU and slow things down.
+# Do this outside of main() in case it needs to take effect
+# even when we are not invoking the main (e.g. when spawning subprocesses).
+torch.set_num_threads(1)
+torch.set_num_interop_threads(1)
+
+
+def compute_fbank_mdcc(
+    num_mel_bins: int = 80,
+    perturb_speed: bool = False,
+    whisper_fbank: bool = False,
+    output_dir: str = "data/fbank",
+):
+    src_dir = Path("data/manifests")
+    output_dir = Path(output_dir)
+    num_jobs = min(15, os.cpu_count())
+
+    dataset_parts = (
+        "train",
+        "valid",
+        "test",
+    )
+    prefix = "mdcc"
+    suffix = "jsonl.gz"
+    manifests = read_manifests_if_cached(
+        dataset_parts=dataset_parts,
+        output_dir=src_dir,
+        prefix=prefix,
+        suffix=suffix,
+    )
+    assert manifests is not None
+
+    assert len(manifests) == len(dataset_parts), (
+        len(manifests),
+        len(dataset_parts),
+        list(manifests.keys()),
+        dataset_parts,
+    )
+    if whisper_fbank:
+        extractor = WhisperFbank(
+            WhisperFbankConfig(num_filters=num_mel_bins, device="cuda")
+        )
+    else:
+        extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
+
+    with get_executor() as ex:  # Initialize the executor only once.
+        for partition, m in manifests.items():
+            if (output_dir / f"{prefix}_cuts_{partition}.{suffix}").is_file():
+                logging.info(f"{partition} already exists - skipping.")
+                continue
+            logging.info(f"Processing {partition}")
+            cut_set = CutSet.from_manifests(
+                recordings=m["recordings"],
+                supervisions=m["supervisions"],
+            )
+            if "train" in partition and perturb_speed:
+                logging.info("Doing speed perturb")
+                cut_set = (
+                    cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)
+                )
+            cut_set = cut_set.compute_and_store_features(
+                extractor=extractor,
+                storage_path=f"{output_dir}/{prefix}_feats_{partition}",
+                # when an executor is specified, make more partitions
+                num_jobs=num_jobs if ex is None else 80,
+                executor=ex,
+                storage_type=LilcomChunkyWriter,
+            )
+            cut_set.to_file(output_dir / f"{prefix}_cuts_{partition}.{suffix}")
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--num-mel-bins",
+        type=int,
+        default=80,
+        help="""The number of mel bins for Fbank""",
+    )
+    parser.add_argument(
+        "--perturb-speed",
+        type=str2bool,
+        default=False,
+        help="Enable 0.9 and 1.1 speed perturbation for data augmentation. Default: False.",
+    )
+    parser.add_argument(
+        "--whisper-fbank",
+        type=str2bool,
+        default=False,
+        help="Use WhisperFbank instead of Fbank. Default: False.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="data/fbank",
+        help="Output directory. Default: data/fbank.",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
+
+    logging.basicConfig(format=formatter, level=logging.INFO)
+
+    args = get_args()
+    compute_fbank_mdcc(
+        num_mel_bins=args.num_mel_bins,
+        perturb_speed=args.perturb_speed,
+        whisper_fbank=args.whisper_fbank,
+        output_dir=args.output_dir,
+    )
diff --git a/egs/mdcc/ASR/local/display_manifest_statistics.py b/egs/mdcc/ASR/local/display_manifest_statistics.py
@@ -0,0 +1,144 @@
+#!/usr/bin/env python3
+# Copyright    2021-2024  Xiaomi Corp.        (authors: Fangjun Kuang,
+#                                                       Zengrui Jin,)
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This file displays duration statistics of utterances in a manifest.
+You can use the displayed value to choose minimum/maximum duration
+to remove short and long utterances during the training.
+
+See the function `remove_short_and_long_utt()` in transducer/train.py
+for usage.
+"""
+
+
+from lhotse import load_manifest_lazy
+
+
+def main():
+    path = "./data/fbank/mdcc_cuts_train.jsonl.gz"
+    path = "./data/fbank/mdcc_cuts_valid.jsonl.gz"
+    path = "./data/fbank/mdcc_cuts_test.jsonl.gz"
+
+    cuts = load_manifest_lazy(path)
+    cuts.describe(full=True)
+
+
+if __name__ == "__main__":
+    main()
+
+"""
+data/fbank/mdcc_cuts_train.jsonl.gz (with speed perturbation)
+_________________________________________ 
+_ Cuts count:               _ 195360
+_________________________________________            
+_ Total duration (hh:mm:ss) _ 173:44:59
+_________________________________________               
+_ mean                      _ 3.2
+_________________________________________
+_ std                       _ 2.1
+_________________________________________               
+_ min                       _ 0.2
+_________________________________________
+_ 25%                       _ 1.8        
+_________________________________________
+_ 50%                       _ 2.7
+_________________________________________
+_ 75%                       _ 4.0
+_________________________________________
+_ 99%                       _ 11.0      _
+_________________________________________
+_ 99.5%                     _ 12.4      _
+_________________________________________
+_ 99.9%                     _ 14.8      _
+_________________________________________
+_ max                       _ 16.7      _
+_________________________________________
+_ Recordings available:     _ 195360    _
+_________________________________________
+_ Features available:       _ 195360    _
+_________________________________________
+_ Supervisions available:   _ 195360    _
+_________________________________________
+
+data/fbank/mdcc_cuts_valid.jsonl.gz 
+________________________________________ 
+_ Cuts count:               _ 5663     _ 
+________________________________________ 
+_ Total duration (hh:mm:ss) _ 05:03:12 _ 
+________________________________________ 
+_ mean                      _ 3.2      _ 
+________________________________________ 
+_ std                       _ 2.0      _ 
+________________________________________ 
+_ min                       _ 0.3      _ 
+________________________________________ 
+_ 25%                       _ 1.8      _ 
+________________________________________ 
+_ 50%                       _ 2.7      _ 
+________________________________________ 
+_ 75%                       _ 4.0      _ 
+________________________________________ 
+_ 99%                       _ 10.9     _ 
+________________________________________
+_ 99.5%                     _ 12.3     _
+________________________________________
+_ 99.9%                     _ 14.4     _
+________________________________________
+_ max                       _ 14.8     _
+________________________________________
+_ Recordings available:     _ 5663     _
+________________________________________
+_ Features available:       _ 5663     _
+________________________________________
+_ Supervisions available:   _ 5663     _
+________________________________________
+
+data/fbank/mdcc_cuts_test.jsonl.gz
+________________________________________ 
+_ Cuts count:               _ 12492    _ 
+________________________________________ 
+_ Total duration (hh:mm:ss) _ 11:00:31 _ 
+________________________________________ 
+_ mean                      _ 3.2      _ 
+________________________________________ 
+_ std                       _ 2.0      _ 
+________________________________________ 
+_ min                       _ 0.2      _ 
+________________________________________ 
+_ 25%                       _ 1.8      _ 
+________________________________________ 
+_ 50%                       _ 2.7      _ 
+________________________________________ 
+_ 75%                       _ 4.0      _ 
+________________________________________ 
+_ 99%                       _ 10.5     _ 
+________________________________________ 
+_ 99.5%                     _ 12.1     _ 
+________________________________________
+_ 99.9%                     _ 14.0     _
+________________________________________
+_ max                       _ 14.8     _
+________________________________________
+_ Recordings available:     _ 12492    _
+________________________________________
+_ Features available:       _ 12492    _
+________________________________________
+_ Supervisions available:   _ 12492    _
+________________________________________
+
+"""
diff --git a/egs/mdcc/ASR/local/prepare_char.py b/egs/mdcc/ASR/local/prepare_char.py
@@ -0,0 +1 @@
+../../../aishell/ASR/local/prepare_char.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../librispeech/ASR/local/compile_hlg_using_openfst.py