This repository contains an end-to-end pipeline for training and evaluating a speech recognition model using Wav2Vec2. It leverages various tools, including the Hugging Face Transformers library, to preprocess data, build vocabulary, and train a model capable of recognizing phonemes from audio input.
The pipeline is modular, enabling customization for different datasets and configurations. It includes preprocessing, phoneme vocabulary generation, training, evaluation, and inference.
This project specifically focuses on IPA-based phoneme recognition.
- Preprocessing: Efficient audio and text processing for training datasets.
- Vocabulary Generation: Automatic phoneme vocabulary creation based on training data.
- Model Training: Fine-tuning Wav2Vec2 with custom phoneme vocabularies.
- Evaluation Metrics: Word Error Rate (WER) and Character Error Rate (CER) computation.
- Dataset Caching: Intermediate datasets cached for reuse.
- Visualization: Phoneme distribution plotting.
- Scalability: Supports multi-GPU training and large datasets.
collator.py
: Custom data collator for handling CTC padding during training.metrics.py
: Functions to compute WER and CER for model evaluation.train.py
: Main script for configuring and training the model.config.py
: Handles loading YAML configurations for the pipeline.helpers.py
: Utility functions for phoneme processing and plotting.wav2vec2_ctc.py
: Defines model initialization and configuration.run_training.py
: Entry point for the training pipeline.loader.py
: Dataset loader with support for multi-threaded file processing.preprocess.py
: Functions for vocabulary creation and phoneme mapping.vocab.py
: Utilities to save and load vocabularies.
To install the project, run:
pip install -e .
- Python 3.8+
- CUDA-enabled GPU (optional but recommended)
Update the configs/config.yaml
file to specify dataset paths, model parameters, and training hyperparameters.
To start training the model:
python scripts/run_training.py
Evaluate the trained model using:
python scripts/eval.py
The model is fine-tuned with CTC loss, using the Trainer
API from Hugging Face. It incorporates gradient accumulation, mixed-precision training, and early stopping.
Custom collator ensures proper padding for input features and target labels during batch processing (collator.py
).
WER and CER are computed using the evaluate
library, ensuring detailed insights into model performance (metrics.py
).
- Vocabulary: Edit
vocab.py
for custom phoneme mappings. - Data: Update
config.yaml
to point to new dataset folders. - Model: Modify
wav2vec2_ctc.py
for model-specific tweaks.
- Implement data augmentation for robust training.
- Add support for real-time inference.
- Extend the pipeline for multilingual datasets.
This project is licensed under the MIT License.
Contributions, issues, and feature requests are welcome! Please create a pull request or open an issue for discussion.