Code and pretrained models for ACL 2023 main conference short paper: "Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation".
conda create -n tab python=3.6
conda activate tab
conda install pytorch==1.8.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
# cudatoolkit=10.2 is also acceptable.
mkdir ~/TAB && cd ~/TAB && mkdir data checkpoints
git clone https://github.com/hannlp/tab Fairseq-S2T
cd Fairseq-S2T
pip install --upgrade pip
pip install --editable ./
pip install sacrebleu==1.4.13 pandas sacremoses sentencepiece configargparse gpustat tensorboard editdistance
cd ~/TAB/Fairseq-S2T/egs/mustc/st
ln -s path/to/mustc_root/ende ~/TAB/data/mustc.en-de
bash run.sh --tgt_lang de
# It may take a bit of time.
Once you have pre-processed the data, the directory should look like this:
ls ~/TAB/data/mustc.en-de/st
config_share.yaml dev.tsv spm_unigram10000_st_share.model spm_unigram10000_st_share.vocab tst-COMMON.en
dev.en fbank80.zip spm_unigram10000_st_share.txt train.tsv tst-COMMON.tsv
- Download the pre-trained ASR and MT models and modify the parameters in
conf/tab.yaml
such asload-pretrained-xxx-from
to match the corresponding models. - Some other parameters in the paper can be adjusted, including:
use-auxiliary-branch,consistency-type,consistency-weight,replacement-probability-strategy,replacement-probability,uncertainty-gamma
. - Run the following command to start training (Note that
gpu_num * max_tokens * update_freq
should be 320k. Andexp_tag
can be set to mark and distinguish different experiments).
bash run.sh --stage 1 --stop_stage 1 --tgt_lang de --train_config tab --exp_tag uncertainty_gamma0.5_alpha5 --gpu_num 4 --max_tokens 20000 --update_freq 4
The trained models will be saved at ~/TAB/checkpoints/mustc.en-de/st/{DATE}_tab_{EXP_TAG}
bash run.sh --stage 2 --stop_stage 2 --tgt_lang de --exp_name {DATE}_tab_{EXP_TAG} --gpu_num 1 --test_subset dev,tst-COMMON
To keep things simple, we have not included the scripts for the pre-training stage. If you have any questions, please do not hesitate to contact me at [email protected]
. If you find this repository helpful, please cite it as:
@inproceedings{han-etal-2023-modality,
title = "Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation",
author = "Han, Yuchen and
Xu, Chen and
Xiao, Tong and
Zhu, Jingbo",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-short.115",
pages = "1340--1348",
abstract = "Pre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace {''}modality gap{''} between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occurs in the early stages of fine-tuning, but does not have a major impact on the final performance. On the other hand, we find that there has another gap, which we call the {''}capacity gap{''}: high resource tasks (such as ASR and MT) always require a large model to fit, when the model is reused for a low resource task (E2E ST), it will get a sub-optimal performance due to the over-fitting. In a case study, we find that the regularization plays a more important role than the well-designed modality adaption method, which achieves 29.0 for en-de and 40.3 for en-fr on the MuST-C dataset.",
}