What are the bottlenecks in Whisper for inference? Is beam search decoding one of them? #688

galv · 2022-12-15T00:22:55Z

galv
Dec 15, 2022

Hi all,

Has anyone benchmarked Whisper doing inference? e.g., have you ran Whisper on a suitable large dataset (e.g., 1 to 5 hours) under nvidia nsight systems?

I recently wrote a GPU-accelerated beam search decoder for a customer using a similar architecture model (a Transformer model) for speech recognition. Their original architecture was particularly slow because beam search was done on CPU in python, taking about 50% of the total inference time (!). Whisper also does its beam search on the CPU in Python, so I have a hunch that it may have a similar bottle neck

I am working on including it in https://github.com/nvidia-riva/riva-asrlib-decoder/ so that others can benefit. Whisper comes to mind as a potential beneficiary. My implementation is missing patience, which I know whisper's python CPU implementation has, but that can be added.

If anyone could let me know (1) whether or not beam search is a bottleneck in whisper, (2) if not that, what are the bottlenecks, and (3) whether this repo is open to this sort of contribution, that would definitely help me prioritize, so many thanks.

bakermanbrian · 2023-02-11T01:31:06Z

bakermanbrian
Feb 11, 2023

@jongwook do you think this could be helpful for Whisper?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the bottlenecks in Whisper for inference? Is beam search decoding one of them? #688

{{title}}

Replies: 1 comment

{{title}}

Select a reply

What are the bottlenecks in Whisper for inference? Is beam search decoding one of them? #688

galv Dec 15, 2022

Replies: 1 comment

bakermanbrian Feb 11, 2023

galv
Dec 15, 2022

bakermanbrian
Feb 11, 2023