Skip to content

Latest commit

 

History

History
47 lines (34 loc) · 1.75 KB

File metadata and controls

47 lines (34 loc) · 1.75 KB

MALA-ASR_SLIDESPEECH

Guides

MaLa-ASR is an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content.

Model Architecture

We use the official WavLM-Large model as our speech encoder, the public Vicuna 7B as our large language model decoder, and a simple-structured linear projector, consisting of a 1-D convolution layer and two linear layers as our projector. Refer to the paper for more details.

Performance and checkpoints

We only train the linear projector in this recipe.

Encoder Projector LLM dev test
WavLM-large Linear(~15.74M) vicuna-7b-v1.5 8.91 9.14

Data preparation

Refer to official SLIDESPEECH CORPUS

Decode with checkpoints

bash decode_MaLa-ASR_withkeywords_L95.sh

Modify the path including speech_encoder_path, llm_path, output_dir, ckpt_path and decode_log in the script when you run the shell script.

Train a new model

Use self-supervised model(such as WavLM) as the encoder

bash finetune_MaLa-ASR_withkeywords_L95.sh

Citation

You can refer to the paper for more results.

@inproceedings{yang2024malaasr,
      title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR}, 
      author={Guanrou Yang and Ziyang Ma and Fan Yu and Zhifu Gao and Shiliang Zhang and Xie Chen},
      booktitle={Proc. INTERSPEECH},
      year={2024},
}