xls-r-analysis-sqa

1. Overview

This repository hosts the models for the paper "Analysis of XLS-R for Speech Quality Assessment".

1.1. Performance On Unseen Datasets

Comparison of model performance on each unseen corpus individually (NISQA, IUB) and combined together (Unseen). The metric is RMSE, lower is better.

V1 Results

Model	NISQA	IUB	Unseen
XLS-R 300M Layer24 Bi-LSTM [1]	0.5907	0.5067	0.5323
DNSMOS [2]	0.8718	0.5452	0.6565
MFCC Transformer	0.8280	0.7775	0.7924
XLS-R 300M Layer5 Transformer	0.6256	0.5049	0.5425
XLS-R 300M Layer21 Transformer	0.5694	0.5025	0.5227
XLS-R 300M Layer5+21 Transformer	0.5683	0.4886	0.5129
XLS-R 1B Layer10 Transformer	0.5456	0.5815	0.5713
XLS-R 1B Layer41 Transformer	0.5657	0.4656	0.4966
XLS-R 1B Layer10+41 Transformer	0.5748	0.5288	0.5425
XLS-R 2B Layer10 Transformer	0.6277	0.4899	0.5334
XLS-R 2B Layer41 Transformer	0.5724	0.4897	0.5150
XLS-R 2B Layer10+41 Transformer	0.6036	0.4743	0.5150
Human	0.6738	0.6573	0.6629

V2 Results

UPDATE: the code has been updated to use version 2 of the models. Version 1 used the final model checkpoint by mistake, version 2 uses the checkpoint with the minimum validation loss.

Model	NISQA	IUB	Unseen
XLS-R 300M Layer24 Bi-LSTM [1]	0.5907	0.5067	0.5323
DNSMOS [2]	0.8718	0.5452	0.6565
MFCC Transformer	0.9291	0.7415	0.8003
XLS-R 300M Layer5 Transformer	0.6494	0.5117	0.5550
XLS-R 300M Layer21 Transformer	0.5852	0.4838	0.5152
XLS-R 300M Layer5+21 Transformer	0.5861	0.4768	0.5108
XLS-R 1B Layer10 Transformer	0.6217	0.4763	0.5225
XLS-R 1B Layer41 Transformer	0.5615	0.4646	0.4946
XLS-R 1B Layer10+41 Transformer	0.6024	0.4624	0.5068
XLS-R 2B Layer10 Transformer	0.5227	0.4447	0.4686
XLS-R 2B Layer41 Transformer	0.5295	0.4926	0.5035
XLS-R 2B Layer10+41 Transformer	0.5191	0.4573	0.4760
Human	0.6738	0.6573	0.6629

[1] Tamm, B., Balabin, H., Vandenberghe, R., Van hamme, H. (2022) Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications. Proc. Interspeech 2022, 4083-4087, doi: 10.21437/Interspeech.2022-10147

[2] C. K. A. Reddy, V. Gopal and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6493-6497, doi: 10.1109/ICASSP39728.2021.9414878.

1.2. Visualization of MOS Predictions

MOS predictions on two unseen datasets: NISQA (top) and IU Bloomington (bottom). Our proposed model based on embeddings extracted from the 10th layer of the pre-trained XLS-R 2B outperforms DNSMOS and the MFCC baseline. The human ACRs are also visualized for the IUB corpus.

1.3. Example Audio Segments

🔊

Excellent (MOS = 4.808)

Audio Sample	Model	Prediction	Error
iub-excellent.mp4	DNSMOS	3.699	-1.109
	MFCC Transformer	3.497	−1.311
	XLS-R 2B Layer10 Transformer	3.935	-0.873

🔊

Good (MOS = 4.104)

Audio Sample	Model	Prediction	Error
iub-good.mp4	DNSMOS	3.269	-0.835
	MFCC Transformer	2.498	-1.606
	XLS-R 2B Layer10 Transformer	3.793	-0.311

🔊

Fair (MOS = 3.168)

Audio Sample	Model	Prediction	Error
iub-fair.mp4	DNSMOS	3.309	+0.141
	MFCC Transformer	3.931	+0.763
	XLS-R 2B Layer10 Transformer	3.080	-0.088

🔊

Poor (MOS = 2.240)

Audio Sample	Model	Prediction	Error
iub-poor.mp4	DNSMOS	2.704	+0.464
	MFCC Transformer	1.927	-0.313
	XLS-R 2B Layer10 Transformer	2.284	+0.044

🔊

Bad (MOS = 1.416)

Audio Sample	Model	Prediction	Error
iub-bad.mp4	DNSMOS	2.553	+1.137
	MFCC Transformer	1.806	+0.390
	XLS-R 2B Layer10 Transformer	2.312	+0.896

2. Installation

First, clone the repository.

git clone https://github.com/lcn-kul/xls-r-analysis-sqa.git

Next, install the requirements to a virtual environment of your choice.

cd xls-r-analysis-sqa/
pip3 install -r requirements.txt

Finally, this code uses truncated XLS-R models. These can be obtained by downloading them from our HuggingFace repositories (recommended, follow [these instructions]) or by downloading the full pre-trained models (follow [these instructions]) and running the script truncate_w2v2.py.

Warning: the size of the truncated XLS-R models sums to 15GB (times 2 since the .git directory is also a similar size).

3. Usage

A working example is provided in test_e2e_sqa.py.

4. Citation

@INPROCEEDINGS{10248049,
  author={Tamm, Bastiaan and Vandenberghe, Rik and Van Hamme, Hugo},
  booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, 
  title={Analysis of XLS-R for Speech Quality Assessment}, 
  year={2023},
  volume={},
  number={},
  pages={1-5},
  doi={10.1109/WASPAA58266.2023.10248049}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

xls-r-analysis-sqa

1. Overview

1.1. Performance On Unseen Datasets

1.2. Visualization of MOS Predictions

1.3. Example Audio Segments

2. Installation

3. Usage

4. Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

xls-r-analysis-sqa

1. Overview

1.1. Performance On Unseen Datasets

1.2. Visualization of MOS Predictions

1.3. Example Audio Segments

2. Installation

3. Usage

4. Citation