Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I simplify Inference in this case? #1527

Closed
zuowanbushiwo opened this issue Nov 3, 2023 · 4 comments
Closed

Can I simplify Inference in this case? #1527

zuowanbushiwo opened this issue Nov 3, 2023 · 4 comments

Comments

@zuowanbushiwo
Copy link

zuowanbushiwo commented Nov 3, 2023

Hi hbredin

Thanks your open source! I read the article pyannote.audio speaker diarization pipeline at VoxSRC 2023. The entire speaker diarization pipeline contains 3 parts: local end-to-end neural speaker segmentation with 10-seconds windows with 1-second stride,neural speaker embedding of each speaker of each window, and agglomerative hierarchical clustering.

If I have a 2-minute audio clip, and I'm sure there are only 3 speakers in the entire clip, Can I remove the speaker embedding and clustering, only use one global end-to-end neural speaker segmentation using the entire speech as input? Will there be any speaker confusion problem if I do this? Or are there other adverse effects?

Thanks!

Copy link

github-actions bot commented Nov 3, 2023

Thank you for your issue.
We found the following entries in the FAQ which you may find helpful:

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

  • installation
  • data preparation
  • model download
  • etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

  • paid scientific consulting around speaker diarization and speech processing in general;
  • custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

@hbredin
Copy link
Member

hbredin commented Nov 5, 2023

If I have a 2-minute audio clip, and I'm sure there are only 3 speakers in the entire clip, Can I remove the speaker embedding and clustering, only use one global end-to-end neural speaker segmentation using the entire speech as input? Will there be any speaker confusion problem if I do this? Or are there other adverse effects?

I'd suggest you try and report back :-)

@zuowanbushiwo zuowanbushiwo changed the title Can I simplify inter in this case? Can I simplify Inference in this case? Nov 6, 2023
@zuowanbushiwo
Copy link
Author

Hi hbredin
The test result of my using this project seems to be OK. This project only uses the segment part, and it is the entire voice input.

@hbredin
Copy link
Member

hbredin commented Nov 16, 2023

Closing as it reads like the initial question has been answered.

@hbredin hbredin closed this as completed Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants