Skip to content

Latest commit

 

History

History
executable file
·
142 lines (109 loc) · 5.58 KB

README2.md

File metadata and controls

executable file
·
142 lines (109 loc) · 5.58 KB

IndicWav2Vec2

All the data preparation scripts are present in data_prep_scripts dir.

The sequence of data preperation pipeline involve

  • Downloading the data
  • Pass the data through VAD pipeline (using vad.py)
  • Pass the obtained data through SNR Filtering pipeline (using snr_filter.py)
  • Finally chunk the data (using chunking.py) All the 4 steps are automatically handled by process_data.sh

Click here for more extended documentation on how to execute these individual steps.

Manifest Creation

For creating language-wise pretraining manifest

$ python path/to/lang_wise_manifest_creation.py /path/to/wave/files --dest /manifest/path --ext $ext --valid-percent $valid

For /path/to/wav/files/ we expect the directory to have one folder per language under the parent directory

In our pretraing, we use a --valid-percent as 0.03

For creating a combined validation file for all languages, we concatenate all individual *_valid.tsv files to create a valid.tsv file.

import pandas as pd
import glob

filenames = glob.glob("*_valid.tsv")

combined = []
for f in filename:
    df = pd.read_csv(f, skiprows=1, names=['f', 'd'], sep='\t')
    combined.append(df)

df_combined = pd.concat(combined, axis=0, ignore_index=True)
df_combined.to_csv('valid.tsv', index=True, header=False, sep='\t')

We then add the /path/to/wav/files/ to the first line of the valid.tsv file

Pretraining

For pretraining the model we do multi-node training and schedule the runs with slurm.

Following is the invocation script for training IndicWav2Vec base starting from Wav2Vec2.0 English base ckeckpoint

$ sbatch --job-name <NAME> --gres gpu:<N_GPU_PER_NODE> --cpus-per-task <N_CPUS> \
    --nodes <N_NODES> --ntasks-per-node <N_TASKS> \
    --wrap "srun --output train.log.node%t --error train.stderr.node%t.%j \
        $(which fairseq-hydra-train) \
        task.data=/path/to/manifest/directory \
        common.wandb_project=<wandb project name> \
        task._name=temp_sampled_audio_pretraining \
        +task.sampling_alpha=0.7 \
        common.log_interval=200 \
        common.log_format=tqdm \
        dataset.max_tokens=3000000 \
        common.user_dir=/path/to/custom_task/directory \
        checkpoint.save_dir=/path/to/save/model/checkpoints \
        checkpoint.restore_file=/path/to wav2vec2-english-base/checkpoint.pt \
        +optimization.update_freq='[2]' \
        optimization.clip_norm=0.5 \
        checkpoint.reset_optimizer=true \
        distributed_training.distributed_world_size=<total GPUs> \
        distributed_training.distributed_port=$PORT \
        --config-dir /path/to/configs/directory \
        --config-name wav2vec2_base_librispeech"

For Large model we override the above configuration with

checkpoint.restore_file=/path/to wav2vec2-english-large/checkpoint.pt \
+optimization.update_freq='[6]' \
lr_scheduler.warmup_updates=0 \
--config-name wav2vec2_large_librivox"

Configs for both the models are provided in the configs directory

Fine-tuning process

Dataprocess and Manifest creation

The scripts for the same can be found here link

Fine-tune

Following is the invocation script for finetuning IndicWav2Vec large on a particular language

sbatch --job-name <NAME> --gres gpu:<N_GPU_PER_NODE> --cpus-per-task <N_CPUS> \
    --nodes <N_NODES> --ntasks-per-node <N_TASKS> \
    --wrap "srun --output finetune.log.node%t --error finetune.stderr.node%t.%j \
        $(which fairseq-hydra-train) \
        task.data=/path/to/finetune/manifest/directory/for/a/particular/language \
        common.wandb_project=<wandb project name> \
        model.w2v_path=/path/to/pretrained/model_large.pt \
        common.log_interval=50 \
        common.log_format=tqdm \
        dataset.max_tokens=1000000 \
        checkpoint.save_dir=/path/to/save/model/fine_tune_checkpoints \
        +optimization.update_freq='[1]' \
        distributed_training.distributed_world_size=<total GPUs> \
        --config-dir /path/to/configs/directory \
        --config-name ai4b_xlsr"

For IndicWav2Vec Base model we override the above configuration with

model.w2v_path=/path/to/pretrained/model_base.pt \
--config-name ai4b_base"

Configs for both the models are provided in the finetune_configs directory

Training Language Model

Scripts for installing, preparing data and training language model is present in lm_training folder.

Inference/Evaluation

Evaluation Scripts with complete documentation are present in w2v_inference folder.

Link to arXiv

The link to arXiv can be found here

Cite

Please cite as:

@inproceedings{javed2021building,
    title = {Towards Building ASR Systems for the Next Billion Users},
    author = {Tahir Javed and Sumanth Doddapaneni and Abhigyan Raman and Kaushal Santosh Bhogale and Gowtham Ramesh and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
    booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
    year = "2022 (to appear)",
}

License

IndicWav2Vec is MIT-licensed. The license applies to all the pretrained, fine-tuned and language models