Performance benchmark of different GPUs #918
Replies: 15 comments 16 replies
-
nice test i got a test too but primitive How Good is RTX 3060 for ML AI Deep Learning Tasks and Comparison With GTX 1050 Ti and i7 10700F CPU |
Beta Was this translation helpful? Give feedback.
-
Are the scores for the 2080ti flipped? Or what could be the reason for FP16 to be significantly faster than FP32 for that GPU compared to all the others, where FP32 is usually faster or about the same? |
Beta Was this translation helpful? Give feedback.
-
Hi there, would you expect something like a RTX 4070 to perform better than a Tesla T4? Thanks in advance :) |
Beta Was this translation helpful? Give feedback.
-
Nvidia Jetson Xaiver AGX 32GB Medium FP16: 535.19s More practical model sizes for the hardware Tine FP16: 82.46s |
Beta Was this translation helpful? Give feedback.
-
I made a comprehensive test in this video for those who are interested in 28.) Automatic1111 Web UI - PC - Free |
Beta Was this translation helpful? Give feedback.
-
Great work. |
Beta Was this translation helpful? Give feedback.
-
anyone tested with 8GB 4060Ti? |
Beta Was this translation helpful? Give feedback.
-
Macbook Pro 14 M2 Max with 32GB memory got this result: code: import torch
import time
import whisper
import os
device = torch.device('cpu')
model_list = ['medium', 'large-v2']
fp16_bool = [False]
path = './benchmark/'
file_list = os.listdir(path)
for i in model_list:
for k in fp16_bool:
model = whisper.load_model(name=i, device=device)
duration_sum = 0
for idx, j in enumerate(file_list):
audio = whisper.load_audio(path + j, sr=16000)
start = time.time()
result = model.transcribe(audio, language='en', task='transcribe', fp16=k)
end = time.time()
duration_sum = duration_sum + end - start
print("{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum))
del model |
Beta Was this translation helpful? Give feedback.
-
Hi @chesha1 , by any chance have you done any benchmrking with Tesla P40? The prices have come down for that card but I see some comments that state P40 is missing floating point. I was just wondering how would P40 perform? |
Beta Was this translation helpful? Give feedback.
-
Has anybody tested it on AMD Radeon PRO W7000 Series (https://www.amd.com/en/graphics/workstations) and ROCm 7? |
Beta Was this translation helpful? Give feedback.
-
Had the following results on L4 with your script. Debian 12, CUDA 12.0, torch 2.0.1 - not sure how comparable that is. Average power draw was 60W. Wish they'd release L16
|
Beta Was this translation helpful? Give feedback.
-
Does the performance improve when using Windows without PyTorch? It seems using Windows DirectX can allow for non-CUDA GPUs and better VRAM use. |
Beta Was this translation helpful? Give feedback.
-
I bench-marked a Huggingface space with free tier hardware:
as well as a I9 13900K:
and a RTX 3090 on the same whisper version for comparison:
Here is the code I used: import gradio as gr
import torch
import time
import whisper
import os, sys
def bench(device_name):
if device_name=="":
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
else:
device = torch.device(device_name)
string = torch.cuda.get_device_name(0)
string += "\nPython " + sys.version
string += "\nTorch " + torch.__version__
string += "\nWhisper " + whisper.__version__ + "\n"
print(string)
model_list = ['tiny.en', 'base.en', 'medium.en', 'large-v2']
fp16_bool = [True, False]
path = 'benchmark/'
file_list = os.listdir(path)
for i in model_list:
for k in fp16_bool:
model = whisper.load_model(name=i, device=device)
duration_sum = 0
for idx, j in enumerate(file_list):
audio = whisper.load_audio(path + j, sr=16000)
start = time.time()
result = model.transcribe(audio, language='en', task='transcribe', fp16=k)
end = time.time()
duration_sum = duration_sum + end - start
print("{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum))
string += "{} model with fp16 {} costs {:.2f}s".format(i, k, duration_sum) + "\n"
del model
return string
iface = gr.Interface(fn=bench, inputs="text", outputs="text")
iface.launch() |
Beta Was this translation helpful? Give feedback.
-
As a disclaimer, I believe "medium" now defaults to medium-v3. But either way. This is my result: AMD Radeon RX 6900 XT |
Beta Was this translation helpful? Give feedback.
-
hopefully i will publish 5090 tests very soon stay tuned |
Beta Was this translation helpful? Give feedback.
-
I made some experiments to see time costs of transcription on different GPUs. The results may help you choose which type of GPU to buy or rent.
Environment: Pytorch 1.13 CUDA 11.6 Ubuntu 18.04
Dataset: LJSpeech from LJ001-0001.wav to LJ001-0150.wav Total length is 971.26s
For transcribing audios in parallel, please refer to model.transcribe() modified to perform batch inference on audio files #662. That will dramatically boost your transcription.
Experiment codes are below:
Beta Was this translation helpful? Give feedback.
All reactions