Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive slowdown in 3.0.0 version from 2.1.1 #1523

Closed
mochan-b opened this issue Nov 2, 2023 · 14 comments
Closed

Massive slowdown in 3.0.0 version from 2.1.1 #1523

mochan-b opened this issue Nov 2, 2023 · 14 comments

Comments

@mochan-b
Copy link

mochan-b commented Nov 2, 2023

I've been using Version 2.1.1 and it would process 1 hour of audio around 15 minutes or so.

In version 3.0.1, it's been more than 1.5 hours and its still not done for 1 hour of audio.

In both cases I've been using pipeline.to(torch.device("cuda"))

So, I'm not using 2.1.1 on PyPi but with the to function added that was pip installed from github.

I did some basic investigation only. It seems like GPU utilization is lower and CPU utilization is one single core only.

Before, there was more CPU utilization in more cores.

Profiling shows most time spent on torchaudio functions. There is new message about backends on torchaudio. Could that be a cause?

Anything else I can look for to narrow down where the performance problem lies?

My code is essentially this. I tried both mp3 and wav audio_file for this.

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization',
                                        use_auth_token=HUGGING_FACE_API_KEY)
    pipeline.to(torch.device("cuda"))
    diarization = pipeline(audio_file)
Copy link

github-actions bot commented Nov 2, 2023

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

  • installation
  • data preparation
  • model download
  • etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

  • paid scientific consulting around speaker diarization and speech processing in general;
  • custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

@hbredin
Copy link
Member

hbredin commented Nov 2, 2023

Try loading the audio first maybe?

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io(audio_file)

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

@grazder
Copy link

grazder commented Nov 2, 2023

I've got the same problem. I've tried to replace onnx providers for embedder model, and it gave me full GPU utilization, but pipeline started to work slower. So in my opinion bottleneck here isn't model, may be it's croping audio or resampling or smth else.

I load audio first and also tried to resample it before providing audio into model.

You can also increase batch sizes, it gave speed up

@martinkallstrom
Copy link

martinkallstrom commented Nov 2, 2023

Also have the same problem, working on it as we speak. Resampling from 48000 to 16000 hz gives a little speed bump but nothing remarkable (~10% faster). Trying to move the waveform to gpu with waveform = waveform.to(torch.device("cuda")) results in an exception in the speaker verification step:

  File "/workspace/reason-api/diarize.py", line 75, in __from_audio
    self.annotation, self.embeddings = pipeline(audio, return_embeddings=True, min_speakers=1, max_speakers=3)
  File "/workspace/reason-api/venv/lib/python3.10/site-packages/pyannote/audio/core/pipeline.py", line 325, in __call__
    return self.apply(file, **kwargs)
  File "/workspace/reason-api/venv/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 512, in apply
    embeddings = self.get_embeddings(
  File "/workspace/reason-api/venv/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 344, in get_embeddings
    embedding_batch: np.ndarray = self._embedding(
  File "/workspace/reason-api/venv/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_verification.py", line 609, in __call__
    input_feed={"feats": masked_feature.numpy()[None]},
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.```

@grazder
Copy link

grazder commented Nov 2, 2023

Yeah, you don't need to transfer waveform to cuda, because model inferences with ONNX runtime, which requires numpy array as input

@martinkallstrom
Copy link

What then could be the cause of the slow inference? Diarization of 38 seconds of audio takes ~10s, even if done repetitively and measured when models should be loaded into GPU. Which is a surprisingly high real-time factor.

@grazder
Copy link

grazder commented Nov 2, 2023

I can't identify exact reason right now but I see problems with batching with for

Here

and here

for f, (feature, imask) in enumerate(zip(features, imasks)):

I also right now tried to add .cuda() to input waveform.
And added .cpu() here

input_feed={"feats": masked_feature.numpy()[None]},

And I got a huge increase in speed. Now 204 seconds of audio inferences in 2 seconds instead of 28 seconds previously.

@grazder
Copy link

grazder commented Nov 2, 2023

So IMO we lose speed in feature generation on CPU

@grazder
Copy link

grazder commented Nov 2, 2023

@hbredin Are you planning to change this batch processing scheme? For model i think you need to adapt original wespeaker-voxceleb-resnet34-LM, so it will take masks on input.

For feature generation I don't see any options for masking or batching in doc

You can try different feature generation with batching available I guess

@grazder
Copy link

grazder commented Nov 2, 2023

Also found out that simple .cuda() cast won't work here

features = self.compute_fbank(waveforms)

This doesn't give as much of a speed boost as casting audio to the GPU on input to the pipeline.
So, there are some problems somewhere else

@mochan-b
Copy link
Author

mochan-b commented Nov 3, 2023

Try loading the audio first maybe?

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io(audio_file)

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Thank you that solved my problem.

In my test file of 5 minutes, it was first taking 95 seconds with 3.0.1. With 2.1.1, it was taking 9 seconds.

After passing the down-sampled mono waveform, it is now processing it in 6.7 seconds.

@mochan-b mochan-b closed this as completed Nov 3, 2023
@jocastrocUnal
Copy link

Try loading the audio first maybe?

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io(audio_file)

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

This was my solution also. Thanks

@hbredin
Copy link
Member

hbredin commented Nov 9, 2023

FYI: #1537

@hbredin
Copy link
Member

hbredin commented Nov 16, 2023

Latest version no longer relies on ONNX runtime.
Please update to pyannote.audio 3.1 and pyannote/speaker-diarization-3.1 (and open new issues if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants