Init commit for swbd (#1146)

k2-fsa · Oct 7, 2023 · 82199b8 · 82199b8
1 parent 109354b
commit 82199b8
Show file tree

Hide file tree

Showing 51 changed files with 6,622 additions and 0 deletions.
diff --git a/.github/scripts/run-swbd-conformer-ctc-2023-08-26.sh b/.github/scripts/run-swbd-conformer-ctc-2023-08-26.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+
+set -e
+
+log() {
+  # This function is from espnet
+  local fname=${BASH_SOURCE[1]##*/}
+  echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
+}
+
+cd egs/swbd/ASR
+
+repo_url=https://huggingface.co/zrjin/icefall-asr-swbd-conformer-ctc-2023-8-26
+
+log "Downloading pre-trained model from $repo_url"
+git lfs install
+git clone $repo_url
+repo=$(basename $repo_url)
+
+
+log "Display test files"
+tree $repo/
+ls -lh $repo/test_wavs/*.wav
+
+pushd $repo/exp
+ln -s epoch-98.pt epoch-99.pt
+popd
+
+ls -lh $repo/exp/*.pt
+
+for method in ctc-decoding 1best; do
+  log "$method"
+
+  ./conformer_ctc/pretrained.py \
+    --method $method \
+    --checkpoint $repo/exp/epoch-99.pt \
+    --tokens $repo/data/lang_bpe_500/tokens.txt \
+    --words-file $repo/data/lang_bpe_500/words.txt \
+    --HLG  $repo/data/lang_bpe_500/HLG.pt \
+    --G $repo/data/lm/G_4_gram.pt \
+  $repo/test_wavs/1089-134686-0001.wav \
+  $repo/test_wavs/1221-135766-0001.wav \
+  $repo/test_wavs/1221-135766-0002.wav
+done
diff --git a/.github/workflows/run-swbd-conformer-ctc.yml b/.github/workflows/run-swbd-conformer-ctc.yml
@@ -0,0 +1,84 @@
+# Copyright      2023   Xiaomi Corp.    (author: Zengrui Jin)
+
+# See ../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: run-swbd-conformer_ctc
+
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    types: [labeled]
+
+concurrency:
+  group: run-swbd-conformer_ctc-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  run-swbd-conformer_ctc:
+    if: github.event.label.name == 'onnx' || github.event.label.name == 'ready' || github.event_name == 'push' || github.event.label.name == 'swbd'
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest]
+        python-version: [3.8]
+
+      fail-fast: false
+
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: 'pip'
+          cache-dependency-path: '**/requirements-ci.txt'
+
+      - name: Install Python dependencies
+        run: |
+          grep -v '^#' ./requirements-ci.txt  | xargs -n 1 -L 1 pip install
+          pip uninstall -y protobuf
+          pip install --no-binary protobuf protobuf==3.20.*
+
+      - name: Cache kaldifeat
+        id: my-cache
+        uses: actions/cache@v2
+        with:
+          path: |
+            ~/tmp/kaldifeat
+          key: cache-tmp-${{ matrix.python-version }}-2023-05-22
+
+      - name: Install kaldifeat
+        if: steps.my-cache.outputs.cache-hit != 'true'
+        shell: bash
+        run: |
+          .github/scripts/install-kaldifeat.sh
+
+      - name: Inference with pre-trained model
+        shell: bash
+        env:
+          GITHUB_EVENT_NAME: ${{ github.event_name }}
+          GITHUB_EVENT_LABEL_NAME: ${{ github.event.label.name }}
+        run: |
+          sudo apt-get -qq install git-lfs tree
+          export PYTHONPATH=$PWD:$PYTHONPATH
+          export PYTHONPATH=~/tmp/kaldifeat/kaldifeat/python:$PYTHONPATH
+          export PYTHONPATH=~/tmp/kaldifeat/build/lib:$PYTHONPATH
+
+          .github/scripts/run-swbd-conformer-ctc-2023-08-26.sh
diff --git a/egs/swbd/ASR/.gitignore b/egs/swbd/ASR/.gitignore
@@ -0,0 +1,2 @@
+switchboard_word_alignments.tar.gz
+./swb_ms98_transcriptions/
diff --git a/egs/swbd/ASR/README.md b/egs/swbd/ASR/README.md
@@ -0,0 +1,25 @@
+# Switchboard
+
+The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.
+
+Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.
+
+(The above introduction is from the [LDC Switchboard-1 Release 2 webpage](https://catalog.ldc.upenn.edu/LDC97S62).)
+
+
+## Performance Record
+|                                |  eval2000  |  rt03  |
+|--------------------------------|------------|--------|
+|         `conformer_ctc`        |    33.37   |  35.06 |
+
+See [RESULTS](/egs/swbd/ASR/RESULTS.md) for details.
+
+## Credit
+
+The training script for `conformer_ctc` comes from the LibriSpeech `conformer_ctc` recipe in icefall.
+
+A lot of the scripts for data processing are from the first-gen Kaldi and the ESPNet project, tailored by myself to incorporate with Lhotse and Icefall.
+
+Some of the scripts for text normalization are from stale pull requests of [Piotr Żelasko](https://github.com/pzelasko) and [Nagendra Goel](https://github.com/ngoel17).
+
+The `sclite_scoring.py` is from the GigaSpeech recipe for post processing and glm-like scoring, which is definitely not an elegant stuff to do.
diff --git a/egs/swbd/ASR/RESULTS.md b/egs/swbd/ASR/RESULTS.md
@@ -0,0 +1,113 @@
+## Results
+### Switchboard BPE training results (Conformer-CTC)
+
+#### 2023-09-04
+
+The best WER, as of 2023-09-04, for the Switchboard is below
+
+Results using attention decoder are given as:
+
+|                                |  eval2000-swbd  |  eval2000-callhome  | eval2000-avg |
+|--------------------------------|-----------------|---------------------|--------------|
+|         `conformer_ctc`        |      9.48       |         17.73       |    13.67     | 
+
+Decoding results and models can be found here:
+https://huggingface.co/zrjin/icefall-asr-swbd-conformer-ctc-2023-8-26
+#### 2023-06-27
+
+The best WER, as of 2023-06-27, for the Switchboard is below
+
+Results using HLG decoding + n-gram LM rescoring + attention decoder rescoring:
+
+|                                |  eval2000  |  rt03  |
+|--------------------------------|------------|--------|
+|         `conformer_ctc`        |    30.80   |  32.29 |
+
+Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
+
+##### eval2000
+
+| ngram_lm_scale | attention_scale |
+|----------------|-----------------|
+|      0.9       |       1.1       |
+
+##### rt03
+
+| ngram_lm_scale | attention_scale |
+|----------------|-----------------|
+|      0.9       |       1.9       |
+
+To reproduce the above result, use the following commands for training:
+
+```bash
+cd egs/swbd/ASR
+./prepare.sh
+export CUDA_VISIBLE_DEVICES="0,1"
+./conformer_ctc/train.py \
+  --max-duration 120 \
+  --num-workers 8 \
+  --enable-musan False \
+  --world-size 2 \
+  --num-epochs 100
+```
+
+and the following command for decoding:
+
+```bash
+./conformer_ctc/decode.py \
+  --epoch 99 \
+  --avg 10 \
+  --max-duration 50
+```
+
+#### 2023-06-26
+
+The best WER, as of 2023-06-26, for the Switchboard is below
+
+Results using HLG decoding + n-gram LM rescoring + attention decoder rescoring:
+
+|                                |  eval2000  |  rt03  |
+|--------------------------------|------------|--------|
+|         `conformer_ctc`        |    33.37   |  35.06 |
+
+Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
+
+##### eval2000
+
+| ngram_lm_scale | attention_scale |
+|----------------|-----------------|
+|      0.3       |       2.5       |
+
+##### rt03
+
+| ngram_lm_scale | attention_scale |
+|----------------|-----------------|
+|      0.7       |       1.3       |
+
+To reproduce the above result, use the following commands for training:
+
+```bash
+cd egs/swbd/ASR
+./prepare.sh
+export CUDA_VISIBLE_DEVICES="0,1"
+./conformer_ctc/train.py \
+  --max-duration 120 \
+  --num-workers 8 \
+  --enable-musan False \
+  --world-size 2 \
+```
+
+and the following command for decoding:
+
+```bash
+./conformer_ctc/decode.py \
+  --epoch 55 \
+  --avg 1 \
+  --max-duration 50
+```
+
+For your reference, the nbest oracle WERs are:
+
+|                                |  eval2000  |  rt03  |
+|--------------------------------|------------|--------|
+|         `conformer_ctc`        |    25.64   |  26.84 |
diff --git a/egs/swbd/ASR/conformer_ctc/__init__.py b/egs/swbd/ASR/conformer_ctc/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		switchboard_word_alignments.tar.gz
		./swb_ms98_transcriptions/