This project demonstrates forced alignment using Wav2Vec2 for aligning phoneme sequences to audio data. It is designed for researchers, linguists, and developers who work with speech processing and phoneme alignment.
- Batch Processing: Efficient forced alignment for multiple audio files.
- Time-Aligned Segments: Outputs alignment results as JSON files.
- Dataset Flexibility: Supports datasets with
.wav
audio files and.txt
phoneme transcriptions in IPA format. - Visualization: Includes tools for visualizing alignment results.
-
Clone this repository:
git clone https://github.com/yourusername/forced-alignment.git cd forced-alignment
-
Install required dependencies:
pip install -r requirements.txt
-
Place your dataset in the required structure:
dataset/ ├── wav/ # Audio files (.wav) ├── phonemized/ # Phoneme files (.txt)
-
Run the alignment script:
python main.py --config config.yaml
-
Visualize a single piece of data using the
visualize_data.py
script:python visualize_data.py
.
├── aligner.py # Core alignment logic
├── dataloader.py # Dataset handling and preprocessing
├── main.py # Main script to run the alignment process
├── plot_utils.py # Utility functions for plotting and visualization
├── visualize_data.py # Script for visualizing alignment results
├── requirements.txt # Project dependencies
└── dataset/ # Input dataset folder (user-provided)
The dataset should follow this structure:
dataset/
├── wav/ # Audio files in .wav format
├── phonemized/ # Phoneme transcriptions in .txt format
Each .txt
file should correspond to an audio file and contain phoneme sequences in IPA format.
Aligned segments are saved in JSON format under the specified output folder (segments/
by default). Each JSON file contains the alignment results for its corresponding audio file.
Run the alignment process with:
python main.py
Visualize a single piece of data with:
python visualize_data.py