Performance degrade for multi-person meeting #29

PES2g · 2019-01-31T09:34:40Z

During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?

wq2012 · 2019-01-31T15:21:31Z

We didn't run any evaluations on ICSI, because we didn't find any benchmark on this dataset, thus there is no good baseline to compare with.

About the poor performance you are seeing on ICSI, here are a few possible reasons I have in mind:

Quality of the embeddings. The diarization performance significantly depends on the quality of speaker embeddings. If you want to see good performance on ICSI, your speaker embedding training set should contain similar data (acoustic environment, microphone, accents, etc.).
Training of UIS-RNN. The training data of UIS-RNN should also contain some data similar to ICSI. If all your training data are significantly different than ICSI, UIS-RNN is expected to fail, because it is supervised. The main purpose for UIS-RNN is for in-domain training-deployment.
Some hyperparameters may need to be slightly changed to perform well on a new set. The default parameters perform well only on CALLHOME.
The current open sourced UIS-RNN is an incomplete version, due to Add support for estimation of crp_alpha #4. We are still working on this.

PES2g · 2019-02-01T03:15:05Z

Thanks for your detailed explanation.

In my experiment, during training of UIS-RNN, i used part of ICSI data as training data.

But for embeddings, the amount of audios in ICSI is small compared to training dataset of speaker embedding, so i fine-tune the speaker embedding on ICSI, then i use verification accuracy to judge the performance of the fine-tuned speaker embedding, i found the promotion is limited. So for a specific new scenery, large number of similar speaker utterances is needed to train speaker embedding.

At last, i want to know if you have some intuition about the amount of data (how many hours) which is enough for training UIS-RNN.

wq2012 · 2019-02-01T16:00:15Z

For the five-fold experiments on SRE 2000 disk-8 (CALLHOME), each fold is using 400 utterances for training, and 100 utterance for testing. Each utterance is about 1min long (some are longer).

So in this set up training data is about 400min.

PES2g added the question Further information is requested label Jan 31, 2019

wq2012 self-assigned this Jan 31, 2019

wq2012 closed this as completed Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degrade for multi-person meeting #29

Performance degrade for multi-person meeting #29

PES2g commented Jan 31, 2019

wq2012 commented Jan 31, 2019

PES2g commented Feb 1, 2019

wq2012 commented Feb 1, 2019

Performance degrade for multi-person meeting #29

Performance degrade for multi-person meeting #29

Comments

PES2g commented Jan 31, 2019

wq2012 commented Jan 31, 2019

PES2g commented Feb 1, 2019

wq2012 commented Feb 1, 2019