Performance benchmark of different GPUs #918

chesha1 · 2023-02-02T14:01:40Z

chesha1
Feb 2, 2023

I made some experiments to see time costs of transcription on different GPUs. The results may help you choose which type of GPU to buy or rent.
Environment: Pytorch 1.13 CUDA 11.6 Ubuntu 18.04
Dataset: LJSpeech from LJ001-0001.wav to LJ001-0150.wav Total length is 971.26s

GPU	Medium FP16	Medium FP32	Large FP16	Large FP32	Realtive Time Cost
2080Ti	148.18s	219.39s	255.30s	448.24s	3.7x
3080Ti	109.24s	107.81s	172.35s	186.32s	1.6x
3090	145.53s	122.58s	214.87s	186.12s	1.6x
4090	99.60s	79.99s	149.03s	120.05s	1x
V100	203.25s	174.18s	284.62s	253.42s	2.1x
A4000	135.99s	128.12s	205.21s	214.69s	1.8x
A5000	143.97s	124.50s	205.43s	198.11s	1.7x
A6000	139.25s	115.63s	194.95s	175.97s	1.5x
A100 PCIE 40G	133.01s	113.94s	196.19s	175.62s	1.5x
A100 SXM4 40G	123.66s	105.85s	172.36s	162.67s	1.4x

I have to mention that the experiment is done on the official implement of Whisper, which means batch size is equal to 1. If you only use 1 batch and process audios in serial, faster GPUs cannot show much better performance over slower GPUs. But faster GPUs can have bigger batch size and higher theoretical maxiumum performance.
For transcribing audios in parallel, please refer to model.transcribe() modified to perform batch inference on audio files #662. That will dramatically boost your transcription.
CPUs of the above configurations are different, this may also have a slight impact on the results.
There are reports that current pytorch and cuda version do not support 4090 well, especially for fp16 operations. So you may see 4090 is slower than 3090 in some other tasks optimized for fp16. And that also means performance of 4090 may also increase when pytorch and cuda updates to a new version.
Realtive Time Cost is calculated in terms of Large FP32.

Experiment codes are below:

import torch
import time
import whisper
import os

device = torch.device('cuda')
model_list = ['medium', 'large-v2']
fp16_bool = [True, False]
path = './benchmark/'
file_list = os.listdir(path)
for i in model_list:
    for k in fp16_bool:
        model = whisper.load_model(name=i, device=device)
        duration_sum = 0
        for idx, j in enumerate(file_list):
            audio = whisper.load_audio(path + j, sr=16000)
            start = time.time()
            result = model.transcribe(audio, language='en', task='transcribe', fp16=k)
            end = time.time()
            duration_sum = duration_sum + end - start
        print("{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum))
        del model

FurkanGozukara · 2023-02-03T13:05:19Z

FurkanGozukara
Feb 3, 2023

nice test

i got a test too but primitive

How Good is RTX 3060 for ML AI Deep Learning Tasks and Comparison With GTX 1050 Ti and i7 10700F CPU

2 replies

wildwind1973 Nov 28, 2023

your video is also quite informative... thank you!
I also subscribed to your channel because I found it quite educational for myself.

FurkanGozukara Nov 29, 2023

thank you so much

RasmusBacklund · 2023-04-03T15:46:36Z

RasmusBacklund
Apr 3, 2023

Are the scores for the 2080ti flipped? Or what could be the reason for FP16 to be significantly faster than FP32 for that GPU compared to all the others, where FP32 is usually faster or about the same?

2 replies

ck-gd Apr 6, 2023

Hi @RasmusBacklund ,

Are the scores for the 2080ti flipped? Or what could be the reason for FP16 to be significantly faster than FP32 for that GPU
compared to all the others, where FP32 is usually faster or about the same?

According to #530 , the relevant number is the FP16 performance...

In my understanding, the deep learning industry heads towards less precision in general, as with less precision still a similar performance can be achieved

(see e.g. this translated article: Floating point numbers in machine learning -> German original version: Gleitkommazahlen im Machine Learning )

Elefant-Freeciv Jan 10, 2024

Turing has x2 fp16 while all other architectures run fp16 on the fp32 cores.

jnhck · 2023-04-18T13:46:09Z

jnhck
Apr 18, 2023

Hi there, would you expect something like a RTX 4070 to perform better than a Tesla T4? Thanks in advance :)

6 replies

jnhck Nov 6, 2023

"You can search T4 in Techpowerup and see the relative performance comparison."

After buying 4x RTX 4070 we did some benchmarks and you are actually spot on with this. The performance differences between different GPUs regarding transcription with whisper seem to be very similar to the ones you see with rasterization performance. We tested our T4 against the RTX 4070 and the RTX 4060 Ti and came to the conclusion that the RTX 4070 has the best price-to-performance ratio.

EarningsCall Dec 2, 2023

Did you install all 4 RTX 4070s in a single machine? These are non-blower style cards and I'm wondering if heat dissipation is an issue for you?

ejentos Dec 2, 2023

@jnhck , have you faced any sort of issues like #1801 (comment) ?
I wanna buy 4070 as well, but worry if 12GB suits well for large-v3 model.

EarningsCall Dec 21, 2023

I believe a 10GB card is big enough for the large-v3 model. I have purchased 3080 10G card and it runs large-v3 model just fine.

ejentos Dec 21, 2023

@EarningsCall , that's really valuable info for me. Thank you!

jvodan · 2023-04-24T04:21:34Z

jvodan
Apr 24, 2023

Nvidia Jetson Xaiver AGX 32GB
NVIDIA Volta™ architecture with 512 NVIDIA CUDA® cores and 64 Tensor cores 22 TOPS (INT8)
Environment: Pytorch 2.0.0a0+ec3941ad.nv23.2 CUDA 11.4 Ubuntu 20.04

Medium FP16: 535.19s
Medium FP32: 616.53s
Large FP16: 954.41s
Large FP32: 1199.93s

More practical model sizes for the hardware

Tine FP16: 82.46s
Tiny FP32: 68.35s
Base: FP16: 115s
Base: FP32: 100s ?? lower?
Small FP16: 217.84s
Small FP32: 239.27s

2 replies

jvodan Apr 26, 2023

Jetson Nano 4GB maxwell GPU
tiny.en model with fp16 True costs 295.16s
tiny.en model with fp16 False costs 185.55s
base.en model with fp16 True costs 439.25s
base.en model with fp16 False costs 296.70s
small.en model with fp16 True costs 1585.28s

small.en with fp16 False too large to load

GCV-Sleeper-Service Jan 28, 2025

Jetson Nano 4GB maxwell GPU tiny.en model with fp16 True costs 295.16s tiny.en model with fp16 False costs 185.55s base.en model with fp16 True costs 439.25s base.en model with fp16 False costs 296.70s small.en model with fp16 True costs 1585.28s

small.en with fp16 False too large to load

I know that I am responding almost 2 year old message but would like to know how did you setup whisper on Nano on the first place. Do you have detailed instructions from the beginning to end to accompish this?

Also, the related question - in addition to the 'native' Whisper, since that there were number of other models released that optimize performance, WER, etc - like Whisper-Distilled, WhisperX and others. It would be very interesting to compare these new models performances on edge devices.

P.S. The seconds in your example means how much time it took to process the audio file? If so does small.en model with fp16 1585.28s means that it took 1.5x times more to process then the original file duration was?

FurkanGozukara · 2023-04-24T10:51:41Z

FurkanGozukara
Apr 24, 2023

I made a comprehensive test in this video for those who are interested in

28.) Automatic1111 Web UI - PC - Free
RTX 3090 vs RTX 3060 Ultimate Showdown for Stable Diffusion, ML, AI & Video Rendering Performance

0 replies

oakieoaktree · 2023-04-26T12:50:26Z

oakieoaktree
Apr 26, 2023

Great work.
Is the recognition output / result stable across experiments (at least when using the same model?)

0 replies

gileneusz · 2023-06-07T12:55:20Z

gileneusz
Jun 7, 2023

anyone tested with 8GB 4060Ti?

0 replies

yzlee · 2023-07-12T15:32:42Z

yzlee
Jul 12, 2023

Macbook Pro 14 M2 Max with 32GB memory got this result:
Medium model with fp32 546.83s
large-v2 model with fp32 costs 904.46s
7.537X

result:

computer:

code:

import torch
import time
import whisper
import os

device = torch.device('cpu')
model_list = ['medium', 'large-v2']
fp16_bool = [False]
path = './benchmark/'
file_list = os.listdir(path)
for i in model_list:
    for k in fp16_bool:
        model = whisper.load_model(name=i, device=device)
        duration_sum = 0
        for idx, j in enumerate(file_list):
            audio = whisper.load_audio(path + j, sr=16000)
            start = time.time()
            result = model.transcribe(audio, language='en', task='transcribe', fp16=k)
            end = time.time()
            duration_sum = duration_sum + end - start
        print("{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum))
        del model

1 reply

Davo00 Nov 13, 2023

Did you try setting device to "MPS"?

dgoryeo · 2023-08-03T20:39:11Z

dgoryeo
Aug 3, 2023

Hi @chesha1 , by any chance have you done any benchmrking with Tesla P40? The prices have come down for that card but I see some comments that state P40 is missing floating point. I was just wondering how would P40 perform?

3 replies

chesha1 Aug 4, 2023
Author

Unfortunately, I did not do tests on Tesla P40. In terms of FP32, P40 indeed is a little bit worse than the newer GPU like 2080Ti, but it has great FP16 performance, much better than many geforce cards like 2080Ti and 3090. If you use P40, you can have a try with FP16. Theoretically, it will be better.
It may be wrong, but I think the most significant advantage of P40 is the large VRAM. If you use this card to do AI painting, that will be great. For other tasks like whisper, the consumption of VRAM is not that high, so carefully considering buying this GPU.

dgoryeo Aug 4, 2023

Thanks for your help @chesha1.

Elefant-Freeciv Jan 10, 2024

Actually, it is the p100 with high fp16, the p40 is based on the gp102, and should be similar to the gtx 1080ti in performance. The p40 has more cores, more VRAM, and a larger bus than 1080ti, but gddr5, meaning faster fp32, but less VRAM bandwidth.

ejentos · 2023-11-14T14:31:27Z

ejentos
Nov 14, 2023

Has anybody tested it on AMD Radeon PRO W7000 Series (https://www.amd.com/en/graphics/workstations) and ROCm 7?

0 replies

elixirs-facing · 2023-11-20T15:57:19Z

elixirs-facing
Nov 20, 2023

Had the following results on L4 with your script. Debian 12, CUDA 12.0, torch 2.0.1 - not sure how comparable that is. Average power draw was 60W. Wish they'd release L16

medium model with fp16 True costs 150.82s
medium model with fp16 False costs 128.08s
large-v2 model with fp16 True costs 204.21s
large-v2 model with fp16 False costs 189.98s

0 replies

BillyFKidney · 2023-12-29T17:49:57Z

BillyFKidney
Dec 29, 2023

Does the performance improve when using Windows without PyTorch?

It seems using Windows DirectX can allow for non-CUDA GPUs and better VRAM use.
https://www.tomshardware.com/news/whisper-audio-transcription-gpus-benchmarked

0 replies

Elefant-Freeciv · 2024-01-11T22:56:32Z

Elefant-Freeciv
Jan 11, 2024

I bench-marked a Huggingface space with free tier hardware:

Hugging Face Space Free Tier 2 vCPUs 16GBs ram
tiny.en model with fp16 False costs 430.07s
base.en model with fp16 False costs 563.06s
small.en model with fp16 False costs 1822.23s
medium model with fp16 False costs 4142.89s
large-v2 model with fp16 False costs 7335.92s

as well as a I9 13900K:

INTEL I9-13900K 64GB DDR4
Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]
Torch 2.1.0+cu121
Whisper 20230918
tiny.en model with fp16 True costs 31.67s
tiny.en model with fp16 False costs 31.03s
base.en model with fp16 True costs 60.54s
base.en model with fp16 False costs 60.83s
medium.en model with fp16 True costs 530.74s
medium.en model with fp16 False costs 532.10s
large-v2 model with fp16 True costs 1003.62s
large-v2 model with fp16 False costs 1021.10s

and a RTX 3090 on the same whisper version for comparison:

NVIDIA GeForce RTX 3090
Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]
Torch 2.1.0+cu121
Whisper 20230918
tiny.en model with fp16 True costs 10.22s
tiny.en model with fp16 False costs 5.65s
base.en model with fp16 True costs 9.75s
base.en model with fp16 False costs 8.06s
medium.en model with fp16 True costs 44.88s
medium.en model with fp16 False costs 43.01s
large-v2 model with fp16 True costs 75.89s
large-v2 model with fp16 False costs 66.34s

Here is the code I used:

import gradio as gr
import torch
import time
import whisper
import os, sys

def bench(device_name):
    if device_name=="":
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    else:
        device = torch.device(device_name)
    string = torch.cuda.get_device_name(0)
    string += "\nPython " + sys.version
    string += "\nTorch " + torch.__version__
    string += "\nWhisper " + whisper.__version__ + "\n"
    print(string)
    model_list = ['tiny.en', 'base.en', 'medium.en', 'large-v2']
    fp16_bool = [True, False]
    path = 'benchmark/'
    file_list = os.listdir(path)
    for i in model_list:
        for k in fp16_bool:
            model = whisper.load_model(name=i, device=device)
            duration_sum = 0
            for idx, j in enumerate(file_list):
                audio = whisper.load_audio(path + j, sr=16000)
                start = time.time()
                result = model.transcribe(audio, language='en', task='transcribe', fp16=k)
                end = time.time()
                duration_sum = duration_sum + end - start
            print("{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum))
            string += "{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum) + "\n"
            del model
    return string
    
iface = gr.Interface(fn=bench, inputs="text", outputs="text")
iface.launch()

0 replies

tomich · 2024-05-30T18:09:17Z

tomich
May 30, 2024

As a disclaimer, I believe "medium" now defaults to medium-v3. But either way. This is my result:

AMD Radeon RX 6900 XT
Python 3.12.3 (main, Apr 17 2024, 00:00:00) [GCC 14.0.1 20240411 (Red Hat 14.0.1-0)]
Torch 2.4.0.dev20240519+rocm6.0
Whisper 20231117
medium model with fp16 True costs 107.84s
medium model with fp16 False costs 108.50s
large-v2 model with fp16 True costs 144.53s
large-v2 model with fp16 False costs 165.06s

0 replies

FurkanGozukara · 2025-01-28T21:45:38Z

FurkanGozukara
Jan 28, 2025

hopefully i will publish 5090 tests very soon stay tuned

https://www.youtube.com/@SECourses

0 replies

Performance benchmark of different GPUs #918

Replies: 15 comments · 16 replies

chesha1 Aug 4, 2023 Author

Replies: 15 comments 16 replies

chesha1 Aug 4, 2023
Author