title | booktitle | year | volume | series | month | publisher | url | openreview | abstract | layout | issn | id | tex_title | firstpage | lastpage | page | order | cycles | bibtex_editor | editor | bibtex_author | author | date | address | container-title | genre | issued | extras | |||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AMG-AVSR: Adaptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement |
Proceedings of the 16th Asian Conference on Machine Learning |
2025 |
260 |
Proceedings of Machine Learning Research |
0 |
PMLR |
sXkQhSX3Ib |
Audio-Visual Speech Recognition (AVSR) is a task that identifies spoken words by analyzing both lip movements and auditory signals. Compared to Automatic Speech Recognition (ASR), AVSR demonstrates greater robustness in noisy environments due to the support of dual modalities. However, the inherent differences between these modalities present a challenge: effectively accounting for their disparities and leveraging their complementary information to extract useful information for AVSR. To address this, we propose the AMG-AVSR model, which utilizes a two-stage curriculum learning strategy and incorporates a feature compression and recovery mechanism. By leveraging the characteristics of different modalities in various scenarios to guide each other, the model extracts refined features from audio-visual data, thereby enhancing recognition performance in both clean and noisy environments. Compared to the baseline model AV-HuBERT, AMG-AVSR demonstrates superior performance on the LRS2 dataset in both noisy and clean environments. AMG-AVSR achieves a word error rate (WER) of 2.9% under clean speech conditions. In various noisy conditions, AMG-AVSR shows a significant reduction in WER compared to previous methods. |
inproceedings |
2640-3498 |
zhao25a |
{AMG-AVSR}: {A}daptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement |
952 |
967 |
952-967 |
952 |
false |
Nguyen, Vu and Lin, Hsuan-Tien |
|
Zhao, Zhishuo and Guo, Dongyue and Ou, Wenjie and Liu, Hong and Lin, Yi |
|
2025-01-14 |
Proceedings of the 16th Asian Conference on Machine Learning |
inproceedings |
|