Speech to Text Analysis Whitepaper

This project evaluates various Speech-to-Text (STT) configurations for both streaming and batch (offline) transcription, testing different model variants, container environments, model serving frameworks, deployment platforms, and hardware configurations.

OpenAI Whisper, Faster-Whisper, NVIDIA NeMo ASR, and Wav2Vec are all speech-to-text (ASR) models, but they differ in terms of architecture, performance, hardware requirements, and use cases. Here’s how they compare:

	OpenAI Whisper	Faster-Whisper	NeMo	Wav2Vec 2.0
Developer	OpenAI	OpenAI (optimized by Faster-Whisper)	NVIDIA	Meta (Facebook)
Model Type	Transformer-based	Transformer-based	Conformer-based	Self-supervised Transformer
Pretraining Method	Supervised (large-scale labeled data)	Optimized Whisper	Supervised & Semi-Supervised	Self-supervised (unsupervised learning from audio)
Multilingual	Yes (99+ languages)	Yes (99+ languages)	Yes (for some models)	Mostly English (some multilingual versions)
Hardware	CPU/GPU	Optimized for GPU	Optimized for NVIDIA GPUs	CPU/GPU
Streaming	No	No	Yes	No

Tracking the deployment and integration status of different speech-to-text (ASR) models (OpenAI Whisper, Faster-Whisper, NVIDIA NeMo ASR, and Wav2Vec 2.0) across various environments, model servers, and performance metrics.

	OpenAI Whisper	Faster-Whisper	NeMo	Wav2Vec 2.0
Ubuntu Dockerfile	X	TODO	TODO	TODO
UBI Dockerfile	X	TODO	TODO	TODO
ModelKit	TODO	TODO	TODO	TODO
RHEL OS	X	TODO	TODO	TODO
OCP	X	TODO	TODO	TODO
Embedded Server	X	TODO	TODO	TODO
Built-in Server	X	TODO	TODO	TODO
Decoupled Server	TODO	TODO	TODO	TODO
NVIDIA Triton	TODO	TODO	TODO	TODO
vLLM	TODO	TODO	TODO	TODO
Ray Serve	TODO	TODO	TODO	TODO
Batch	X	TODO	TODO	TODO
Streaming	TODO	TODO	TODO	TODO
Word Error Rate (WER)	X	TODO	TODO	TODO
Match Error Rate (MER)	X	TODO	TODO	TODO
Word Information Lost (WIL)	X	TODO	TODO	TODO
Word Information Preserved (WIP)	X	TODO	TODO	TODO
Character Error Rate (CER)	X	TODO	TODO	TODO
Pipeline Build	TODO	TODO	TODO	TODO
Summary	TODO	TODO	TODO	TODO

Performance Metrics Evaluated:

Infrastructure: What types of hardware were tested?

RHEL EC2 Instance g6.xlarge 1 x NVIDIA L4 OR g6.12xlarge 4 x NVIDIA L4
OpenShift Instance ``

Scale:

Max concurrent inference endpoints
Queries per second

Cost: How much does it cost to infer?

Resources: How many resources does it consume to infer?

Container size
GPU
CPU
VRAM

Speed: How fast is the model at transcribing using the time command which prints

real - wall-clock time (actual elapsed time) from when the command started to when it finished.
user - total amount of CPU time spent in user mode, meaning the time the CPU spent executing the process's code (excluding kernel operations).
sys - total amount of CPU time spent in kernel mode, meaning time spent executing system calls on behalf of the process (e.g., file I/O, memory allocation). If you are using a GPU, it's likely that much of the work gets offloaded resulting in a lower number.
responseLatency - i

Precision: Floating-Point Precision Comparison for Transcription:

Precision	Accuracy	Speed	Memory Usage	Hardware Support	ASR Models Using It
FP8 (8-bit Floating Point)	Lowest (accuracy degradation)	Fastest	Lowest	NVIDIA H100, A100 (TensorRT, CUDA 12)	Not widely used yet; experimental for some ASR models
FP16 (Half-Precision, 16-bit Floating Point)	Slightly reduced vs. FP32	Fast (GPU-optimized)	Lower than FP32	Most modern GPUs (NVIDIA Tensor Cores, AMD ROCm)	Faster-Whisper, NeMo ASR, Canary, Wav2Vec
FP32 (Full Precision, 32-bit Floating Point)	Highest (best transcription accuracy)	Slowest	Highest	Universal (CPU & GPU)	Whisper, NeMo ASR, Canary, Wav2Vec

Accuracy: How accurate is the model? JiWER is a simple and fast python package to evaluate an automatic speech recognition system. It supports the following measures:

Word Error Rate (WER) – Measures the percentage of words that were incorrectly predicted compared to the reference text.
- S = Substitutions
- D = Deletions
- I = Insertions
- N = Number of words in the reference transcript
- Lower is better.
WER = (S + D + I) / N
Match Error Rate (MER) – Represents the fraction of words that need to be transformed (inserted, deleted, or substituted) to match the reference text. Unlike WER, it considers the total number of words in both the reference and hypothesis.
- S = Substitutions
- D = Deletions
- I = Insertions
- C = Correctly recognized words
- Unlike WER, MER includes the total correct words in the denominator.
- Lower is better.
WER = (S + D + I) / (S + D + C)
Word Information Lost (WIL) – Estimates how much word-level information is lost due to errors. It penalizes deletions and substitutions while being less sensitive to insertions.
- Related to WER but normalizes by WIP.
WIL = WER / (1 - WIP)
Word Information Preserved (WIP) – The inverse of WIL, this measures how much word-level information is correctly preserved in the hypothesis relative to the reference.
- Measures how much information was retained in the STT output.
WIP = C / (C + S + D)
Character Error Rate (CER) – Similar to WER but at the character level, CER measures the percentage of incorrectly predicted characters compared to the reference text, making it useful for evaluating text with short words or heavy misspellings.
- Similar to WER but at the character level rather than words.
- Useful for languages with compound words or agglutinative structures.
CER = (S + D + I) / N

Getting Started

Environments

Request environments from demo.redhat.com
RHEL AI (GA) VM
- Activity: Practice / Enablement
- Purpose: Trying out a technical solution
- Region: us-east-2
- GPU Selection by Node Type: g6.xlarge 1 x L4 OR g6.12xlarge 4 x L4
AWS with OpenShift Open Environment
- Activity: Practice / Enablement
- Purpose: Trying out a technical solution
- Region: us-east-2
- OpenShift Version: 4.17
- Control Plane Count: 1
- Control Plane Instance Type: m6a.4xlarge

RHEL AI VM

SSH to your RHEL AI VM
Clone the git repo git clone https://github.com/redhat-na-ssa/whitepaper-stt-evaluation-on-kubernetes.git
Move to your cloned git folder cd whitepaper-stt-evaluation-on-kubernetes/

OCP AI

TBP

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
crawl		crawl
data		data
ocp		ocp
pipeline/ubi		pipeline/ubi
walk/openai-whisper		walk/openai-whisper
.gitignore		.gitignore
EXAMPLE.md		EXAMPLE.md
HOWTO.md		HOWTO.md
README.md		README.md
whitepaper-stt-evaluation-on-kubernetes.aux		whitepaper-stt-evaluation-on-kubernetes.aux
whitepaper-stt-evaluation-on-kubernetes.pdf		whitepaper-stt-evaluation-on-kubernetes.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech to Text Analysis Whitepaper

Performance Metrics Evaluated:

Getting Started

Environments

RHEL AI VM

OCP AI

Observations

Related resources

About

Releases

Packages

Contributors 4

Languages

redhat-na-ssa/whitepaper-stt-evaluation-on-kubernetes

Folders and files

Latest commit

History

Repository files navigation

Speech to Text Analysis Whitepaper

Performance Metrics Evaluated:

Getting Started

Environments

RHEL AI VM

OCP AI

Observations

Related resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages