You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?
The text was updated successfully, but these errors were encountered:
We didn't run any evaluations on ICSI, because we didn't find any benchmark on this dataset, thus there is no good baseline to compare with.
About the poor performance you are seeing on ICSI, here are a few possible reasons I have in mind:
Quality of the embeddings. The diarization performance significantly depends on the quality of speaker embeddings. If you want to see good performance on ICSI, your speaker embedding training set should contain similar data (acoustic environment, microphone, accents, etc.).
Training of UIS-RNN. The training data of UIS-RNN should also contain some data similar to ICSI. If all your training data are significantly different than ICSI, UIS-RNN is expected to fail, because it is supervised. The main purpose for UIS-RNN is for in-domain training-deployment.
Some hyperparameters may need to be slightly changed to perform well on a new set. The default parameters perform well only on CALLHOME.
In my experiment, during training of UIS-RNN, i used part of ICSI data as training data.
But for embeddings, the amount of audios in ICSI is small compared to training dataset of speaker embedding, so i fine-tune the speaker embedding on ICSI, then i use verification accuracy to judge the performance of the fine-tuned speaker embedding, i found the promotion is limited. So for a specific new scenery, large number of similar speaker utterances is needed to train speaker embedding.
At last, i want to know if you have some intuition about the amount of data (how many hours) which is enough for training UIS-RNN.
For the five-fold experiments on SRE 2000 disk-8 (CALLHOME), each fold is using 400 utterances for training, and 100 utterance for testing. Each utterance is about 1min long (some are longer).
During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?
The text was updated successfully, but these errors were encountered: