MALA-ASR_SLIDESPEECH

Guides

MaLa-ASR is an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content.

Model Architecture

We use the official WavLM-Large model as our speech encoder, the public Vicuna 7B as our large language model decoder, and a simple-structured linear projector, consisting of a 1-D convolution layer and two linear layers as our projector. Refer to the paper for more details.

Performance and checkpoints

We only train the linear projector in this recipe.

Encoder	Projector	LLM	dev	test
WavLM-large	Linear(~15.74M)	vicuna-7b-v1.5	8.91	9.14

Data preparation

Refer to official SLIDESPEECH CORPUS

Decode with checkpoints

bash decode_MaLa-ASR_withkeywords_L95.sh

Modify the path including speech_encoder_path, llm_path, output_dir, ckpt_path and decode_log in the script when you run the shell script.

Train a new model

Use self-supervised model(such as WavLM) as the encoder

bash finetune_MaLa-ASR_withkeywords_L95.sh

Citation

You can refer to the paper for more results.

@inproceedings{yang2024malaasr,
      title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR}, 
      author={Guanrou Yang and Ziyang Ma and Fan Yu and Zhifu Gao and Shiliang Zhang and Xie Chen},
      booktitle={Proc. INTERSPEECH},
      year={2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MALA-ASR_SLIDESPEECH

Guides

Model Architecture

Performance and checkpoints

Data preparation

Decode with checkpoints

Train a new model

Use self-supervised model(such as WavLM) as the encoder

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MALA-ASR_SLIDESPEECH

Guides

Model Architecture

Performance and checkpoints

Data preparation

Decode with checkpoints

Train a new model

Use self-supervised model(such as WavLM) as the encoder

Citation