Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrade for multi-person meeting #29

Closed
PES2g opened this issue Jan 31, 2019 · 3 comments
Closed

Performance degrade for multi-person meeting #29

PES2g opened this issue Jan 31, 2019 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@PES2g
Copy link

PES2g commented Jan 31, 2019

During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?

@PES2g PES2g added the question Further information is requested label Jan 31, 2019
@wq2012 wq2012 self-assigned this Jan 31, 2019
@wq2012
Copy link
Member

wq2012 commented Jan 31, 2019

We didn't run any evaluations on ICSI, because we didn't find any benchmark on this dataset, thus there is no good baseline to compare with.

About the poor performance you are seeing on ICSI, here are a few possible reasons I have in mind:

  1. Quality of the embeddings. The diarization performance significantly depends on the quality of speaker embeddings. If you want to see good performance on ICSI, your speaker embedding training set should contain similar data (acoustic environment, microphone, accents, etc.).
  2. Training of UIS-RNN. The training data of UIS-RNN should also contain some data similar to ICSI. If all your training data are significantly different than ICSI, UIS-RNN is expected to fail, because it is supervised. The main purpose for UIS-RNN is for in-domain training-deployment.
  3. Some hyperparameters may need to be slightly changed to perform well on a new set. The default parameters perform well only on CALLHOME.
  4. The current open sourced UIS-RNN is an incomplete version, due to Add support for estimation of crp_alpha #4. We are still working on this.

@PES2g
Copy link
Author

PES2g commented Feb 1, 2019

Thanks for your detailed explanation.

In my experiment, during training of UIS-RNN, i used part of ICSI data as training data.

But for embeddings, the amount of audios in ICSI is small compared to training dataset of speaker embedding, so i fine-tune the speaker embedding on ICSI, then i use verification accuracy to judge the performance of the fine-tuned speaker embedding, i found the promotion is limited. So for a specific new scenery, large number of similar speaker utterances is needed to train speaker embedding.

At last, i want to know if you have some intuition about the amount of data (how many hours) which is enough for training UIS-RNN.

@wq2012
Copy link
Member

wq2012 commented Feb 1, 2019

For the five-fold experiments on SRE 2000 disk-8 (CALLHOME), each fold is using 400 utterances for training, and 100 utterance for testing. Each utterance is about 1min long (some are longer).

So in this set up training data is about 400min.

@wq2012 wq2012 closed this as completed Apr 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants