I want one audio can generate one exactly result, what should i do? #40
-
Hello author, thank you for open-sourcing the UTMOSV2 model code. The integration of the visual model and SSL features in this work is an excellent approach. However, I have a question. Unlike UTMOS1, when I tested UTMOSv2 (using the Hugging Face demo and keeping the default parameter settings, link to Hugging Face demo), I noticed that the model's output results vary. Even when using the same audio and domain, the model produces different predictions. Of course, I've observed the random slicing within the dataset. I specifically limited the audio to exactly 3 seconds (without altering any parameter settings). Yet, the model still predicts different outcomes. Why is that? PS: I haven't gone through all the internal code. Additionally, as stated in the title, how can I achieve a stable score prediction? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Problem solved, there is no randomness used internally in the model. I found that modifying the utmosv2.dataset._utils.select_random_start function can achieve non-random slicing; simply change return y[start : start + length] to return y[:length]. |
Beta Was this translation helpful? Give feedback.
Problem solved, there is no randomness used internally in the model. I found that modifying the utmosv2.dataset._utils.select_random_start function can achieve non-random slicing; simply change return y[start : start + length] to return y[:length].