Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix instantaneous speaker count exceeding max_speakers or detected number of clusters #1351

Merged
merged 11 commits into from
Nov 16, 2023

Conversation

flyingleafe
Copy link
Contributor

@flyingleafe flyingleafe commented Apr 27, 2023

The number of labels in the diarization result was based on the result of speaker_count method, which did not account for max_speakers variable or the actual detected number of speakers after clusterization. This led to the appearance of additional erroneous speakers in the diarization result, so the output could have more labels than provided max_speakers or more labels than number of clusters detected by the clusterization algorithm. This PR fixes this.

Additionally, the extra binarization step was removed from speaker_count method - now it accepts segmentations in already binarized form, so no extra binarization step in the pipeline is done.

@codecov
Copy link

codecov bot commented Apr 27, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.21 ⚠️

Comparison is base (a581536) 33.00% compared to head (e289d18) 32.79%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1351      +/-   ##
===========================================
- Coverage    33.00%   32.79%   -0.21%     
===========================================
  Files           65       65              
  Lines         4109     4135      +26     
===========================================
  Hits          1356     1356              
- Misses        2753     2779      +26     
Impacted Files Coverage Δ
pyannote/audio/pipelines/clustering.py 0.00% <0.00%> (ø)
pyannote/audio/pipelines/speaker_diarization.py 0.00% <0.00%> (ø)
pyannote/audio/pipelines/utils/diarization.py 0.00% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@hbredin
Copy link
Member

hbredin commented Jun 23, 2023

Lots is going on in this PR.

Can we just focus on fixing speaker_count to honor max_speakers constraint?

@flyingleafe
Copy link
Contributor Author

@hbredin Okay, no problem, I will move the stuff regarding the k-means fallback into the separate branch.

@flyingleafe flyingleafe force-pushed the fix-max-speakers-count branch 2 times, most recently from 6b118f9 to 6391cd6 Compare July 4, 2023 09:42
@flyingleafe
Copy link
Contributor Author

@hbredin I only left the first commit. Now the PR only does the following:

  • speaker_count accepts binarized segmentations (one of your TODOs which is very straightforward, binarization is not done twice now)
  • speaker_count cannot exceed max_speakers or number of centroids.

The other stuff (falling back to k-means when no good cluster assignment is found by agglomerative clustering) is indeed another part of the problem, I can make a separate PR for this in separate branch arguing why I think it is necessary later.

@hbredin
Copy link
Member

hbredin commented Jul 4, 2023

SpeakerDiarizationMixin.speaker_count is also used in Resegmentation pipeline, which is therefore broken by this PR.

Two options:

  1. update Resegmentation pipeline to also use this new API
  2. narrow down this PR to only fix the max_speakers thing.

@flyingleafe
Copy link
Contributor Author

@hbredin good catch, went with the option 1.

BTW, I am willing to provide some basic tests for applicability of basic pipelines to the test suite in some separate PR. Currently, the test coverage of the entire pyannote.audio.pipelines module is 0%, which is why such silly bugs are not caught immediately by CI.

Comment on lines 533 to 537
# quick sanity check
assert (
num_different_speakers >= min_speakers
and num_different_speakers <= max_speakers
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that assert ever fail?

If it does, we should handle it or this will raise an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hbredin Good catch. With the removal of the k-means fallback which used to be included in this PR (and which I use in my application anyway), this can definitely fail, and surely will in some circumstances due to the min_cluster_size parameter. The best we can do here I think is to issue a warning.

# during counting, we could possibly overcount the number of instantaneous
# speakers due to segmentation errors, so we cap the maximum instantaneous number
# of speakers by the number of detected clusters
count.data = np.minimum(count.data, num_different_speakers)
Copy link
Member

@hbredin hbredin Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs further investigation on my side (e.g. to check that this change does not degrade the overall performance on my usual benchmarks) which might take some time...

Indeed, there might be cases where the embedding/clustering step would (wrongly) decide that there is only 1 speaker in the recording, while the segmentation step actually (correctly) detects 2 overlapping speakers for certain frames.

@flyingleafe
Copy link
Contributor Author

@hbredin I removed the assert. In the meanwhile, did you have time to run benchmarks?
I can do so myself if you tell which data you use and how exactly you run those.

@hbredin
Copy link
Member

hbredin commented Jul 16, 2023

@hbredin I removed the assert. In the meanwhile, did you have time to run benchmarks? I can do so myself if you tell which data you use and how exactly you run those.

I just ran a few benchmarks (with neither min_speakers nor max_speakers constraint) and this change actually slightly degrades the performance (though the difference is definitely not statistically significant). More precisely, it increases missed detection rate more than it reduces false alarm rate.

That being said, I do think you have a point in the case where max_speakers is provided by the user: the pipeline does need to honor this constraint. Can you please update the PR to only cap count.data when it goes above max_speakers?

Also, I think having the change in both Pipeline.speaker_count() and Pipeline.apply() is a bit redundant. You should remove it from Pipeline.speaker_count() and only do the change just before Pipeline.reconstruct().

@flyingleafe
Copy link
Contributor Author

@hbredin Sorry for taking too long to get back to this. Your last comment is accounted for.

Copy link
Member

@hbredin hbredin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and a question.

pyannote/audio/pipelines/speaker_diarization.py Outdated Show resolved Hide resolved
pyannote/audio/pipelines/speaker_diarization.py Outdated Show resolved Hide resolved
@flyingleafe
Copy link
Contributor Author

@hbredin Sorry for such a large delay. Primary work was really heavy last couple of months. I have considered your last comments.

@hbredin hbredin merged commit bbc8044 into pyannote:develop Nov 16, 2023
3 checks passed
@hbredin
Copy link
Member

hbredin commented Nov 16, 2023

🎉 Merged! Thanks @flyingleafe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants