Skip to content

Theodb/ASR-real-time

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Speech Transcription System

This repository contains a real-time speech transcription system using the Whisper model. The system is designed to transcribe speech with minimal delay and errors while a user is speaking. Although the real-time transcription functionality is operational, the evaluation metrics and multi-threaded capabilities are still under development.

Table of Contents

Current Status

  • Real-Time Transcription: The system can transcribe speech in real-time from a microphone or simulated dataset input. It works by constantly recording audio in a thread and concatenating the raw bytes over multiple recordings.
  • Evaluation Metrics: is under development.
  • Multi-Threading: Multi-threaded processing for multiple audio files is not yet implemented.
  • Challenges:
    • Difficulty using the latest libraries on lab GPUs due to CUDA compatibility issues.
    • Difficulties to do quantization with the libraries of my labGPU
    • faster_whisper cannot be imported on the server; CUDA issues are pending resolution.
    • On macOS, a C++ implementation of Whisper exists and may be utilized in future updates.

Methodology

Real-Time Transcription

The system leverages the Whisper model for speech recognition, supporting both live microphone input and simulated dataset input for testing.

Key features:

  • Model Selection: Supports various Whisper models.
  • Device Compatibility: Automatically detects and utilizes available hardware (NVIDIA GPU with CUDA, Apple Silicon with MPS, or CPU).
  • Language Support: Configurable language setting for transcription (default is French).
  • Input Sources:
    • Microphone: Captures live audio for real-time transcription.
    • Dataset Simulation: Uses the French Fleurs dataset to simulate audio input, facilitating testing without live speech.

Requirements

  • Python 3.7 or higher
  • Compatible hardware:
    • NVIDIA GPU with CUDA support (optional)
    • Apple Silicon (M1/M2) with MPS support (optional)
    • CPU (slower performance)
  • Python packages listed in requirements.txt

Usage

Run the real-time transcription script:

python main.py --model large-v3 --language fr --input_type microphone

Arguments:

  • --model: Model name to use (default: 'large-v3'). Choices include:
    • tiny, base, small, medium, large, large-v3
    • Distilled models: distil-large-v2, distil-large-v3
  • --language: Language code for transcription (e.g., 'fr' for French).
  • --input_type: Input source, either 'microphone' or 'dataset'.

Optional Arguments:

  • --energy_threshold: Energy level for the microphone to detect (default: 1000).
  • --record_timeout: Duration in seconds for each recording chunk (default: 2 seconds).
  • --phrase_timeout: Seconds of silence to consider the end of a phrase (default: 3 seconds).
  • --default_microphone: (Linux only) Default microphone name for SpeechRecognition.

Dataset Simulation

To simulate input using the French Fleurs dataset:

python main.py --model large-v3 --language fr --input_type dataset

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages