Skip to content

Commit

Permalink
Init commit for swbd (#1146)
Browse files Browse the repository at this point in the history
  • Loading branch information
JinZr authored Oct 7, 2023
1 parent 109354b commit 82199b8
Show file tree
Hide file tree
Showing 51 changed files with 6,622 additions and 0 deletions.
44 changes: 44 additions & 0 deletions .github/scripts/run-swbd-conformer-ctc-2023-08-26.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env bash

set -e

log() {
# This function is from espnet
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}

cd egs/swbd/ASR

repo_url=https://huggingface.co/zrjin/icefall-asr-swbd-conformer-ctc-2023-8-26

log "Downloading pre-trained model from $repo_url"
git lfs install
git clone $repo_url
repo=$(basename $repo_url)


log "Display test files"
tree $repo/
ls -lh $repo/test_wavs/*.wav

pushd $repo/exp
ln -s epoch-98.pt epoch-99.pt
popd

ls -lh $repo/exp/*.pt

for method in ctc-decoding 1best; do
log "$method"

./conformer_ctc/pretrained.py \
--method $method \
--checkpoint $repo/exp/epoch-99.pt \
--tokens $repo/data/lang_bpe_500/tokens.txt \
--words-file $repo/data/lang_bpe_500/words.txt \
--HLG $repo/data/lang_bpe_500/HLG.pt \
--G $repo/data/lm/G_4_gram.pt \
$repo/test_wavs/1089-134686-0001.wav \
$repo/test_wavs/1221-135766-0001.wav \
$repo/test_wavs/1221-135766-0002.wav
done
84 changes: 84 additions & 0 deletions .github/workflows/run-swbd-conformer-ctc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Copyright 2023 Xiaomi Corp. (author: Zengrui Jin)

# See ../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: run-swbd-conformer_ctc

on:
push:
branches:
- master
pull_request:
types: [labeled]

concurrency:
group: run-swbd-conformer_ctc-${{ github.ref }}
cancel-in-progress: true

jobs:
run-swbd-conformer_ctc:
if: github.event.label.name == 'onnx' || github.event.label.name == 'ready' || github.event_name == 'push' || github.event.label.name == 'swbd'
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest]
python-version: [3.8]

fail-fast: false

steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0

- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
cache-dependency-path: '**/requirements-ci.txt'

- name: Install Python dependencies
run: |
grep -v '^#' ./requirements-ci.txt | xargs -n 1 -L 1 pip install
pip uninstall -y protobuf
pip install --no-binary protobuf protobuf==3.20.*
- name: Cache kaldifeat
id: my-cache
uses: actions/cache@v2
with:
path: |
~/tmp/kaldifeat
key: cache-tmp-${{ matrix.python-version }}-2023-05-22

- name: Install kaldifeat
if: steps.my-cache.outputs.cache-hit != 'true'
shell: bash
run: |
.github/scripts/install-kaldifeat.sh
- name: Inference with pre-trained model
shell: bash
env:
GITHUB_EVENT_NAME: ${{ github.event_name }}
GITHUB_EVENT_LABEL_NAME: ${{ github.event.label.name }}
run: |
sudo apt-get -qq install git-lfs tree
export PYTHONPATH=$PWD:$PYTHONPATH
export PYTHONPATH=~/tmp/kaldifeat/kaldifeat/python:$PYTHONPATH
export PYTHONPATH=~/tmp/kaldifeat/build/lib:$PYTHONPATH
.github/scripts/run-swbd-conformer-ctc-2023-08-26.sh
2 changes: 2 additions & 0 deletions egs/swbd/ASR/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
switchboard_word_alignments.tar.gz
./swb_ms98_transcriptions/
25 changes: 25 additions & 0 deletions egs/swbd/ASR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Switchboard

The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.

Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.

(The above introduction is from the [LDC Switchboard-1 Release 2 webpage](https://catalog.ldc.upenn.edu/LDC97S62).)


## Performance Record
| | eval2000 | rt03 |
|--------------------------------|------------|--------|
| `conformer_ctc` | 33.37 | 35.06 |

See [RESULTS](/egs/swbd/ASR/RESULTS.md) for details.

## Credit

The training script for `conformer_ctc` comes from the LibriSpeech `conformer_ctc` recipe in icefall.

A lot of the scripts for data processing are from the first-gen Kaldi and the ESPNet project, tailored by myself to incorporate with Lhotse and Icefall.

Some of the scripts for text normalization are from stale pull requests of [Piotr Żelasko](https://github.com/pzelasko) and [Nagendra Goel](https://github.com/ngoel17).

The `sclite_scoring.py` is from the GigaSpeech recipe for post processing and glm-like scoring, which is definitely not an elegant stuff to do.
113 changes: 113 additions & 0 deletions egs/swbd/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
## Results
### Switchboard BPE training results (Conformer-CTC)

#### 2023-09-04

The best WER, as of 2023-09-04, for the Switchboard is below

Results using attention decoder are given as:

| | eval2000-swbd | eval2000-callhome | eval2000-avg |
|--------------------------------|-----------------|---------------------|--------------|
| `conformer_ctc` | 9.48 | 17.73 | 13.67 |

Decoding results and models can be found here:
https://huggingface.co/zrjin/icefall-asr-swbd-conformer-ctc-2023-8-26
#### 2023-06-27

The best WER, as of 2023-06-27, for the Switchboard is below

Results using HLG decoding + n-gram LM rescoring + attention decoder rescoring:

| | eval2000 | rt03 |
|--------------------------------|------------|--------|
| `conformer_ctc` | 30.80 | 32.29 |

Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:

##### eval2000

| ngram_lm_scale | attention_scale |
|----------------|-----------------|
| 0.9 | 1.1 |

##### rt03

| ngram_lm_scale | attention_scale |
|----------------|-----------------|
| 0.9 | 1.9 |

To reproduce the above result, use the following commands for training:

```bash
cd egs/swbd/ASR
./prepare.sh
export CUDA_VISIBLE_DEVICES="0,1"
./conformer_ctc/train.py \
--max-duration 120 \
--num-workers 8 \
--enable-musan False \
--world-size 2 \
--num-epochs 100
```

and the following command for decoding:

```bash
./conformer_ctc/decode.py \
--epoch 99 \
--avg 10 \
--max-duration 50
```

#### 2023-06-26

The best WER, as of 2023-06-26, for the Switchboard is below

Results using HLG decoding + n-gram LM rescoring + attention decoder rescoring:

| | eval2000 | rt03 |
|--------------------------------|------------|--------|
| `conformer_ctc` | 33.37 | 35.06 |

Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:

##### eval2000

| ngram_lm_scale | attention_scale |
|----------------|-----------------|
| 0.3 | 2.5 |

##### rt03

| ngram_lm_scale | attention_scale |
|----------------|-----------------|
| 0.7 | 1.3 |

To reproduce the above result, use the following commands for training:

```bash
cd egs/swbd/ASR
./prepare.sh
export CUDA_VISIBLE_DEVICES="0,1"
./conformer_ctc/train.py \
--max-duration 120 \
--num-workers 8 \
--enable-musan False \
--world-size 2 \
```

and the following command for decoding:

```bash
./conformer_ctc/decode.py \
--epoch 55 \
--avg 1 \
--max-duration 50
```

For your reference, the nbest oracle WERs are:

| | eval2000 | rt03 |
|--------------------------------|------------|--------|
| `conformer_ctc` | 25.64 | 26.84 |
Empty file.
Loading

0 comments on commit 82199b8

Please sign in to comment.