This repository aims at providing some useful scritps to do data preparation for WSJ data.
cd tools
make
# convert sphere to waveform
bash wsj0/1_sph2wav.sh # remember to change wsj0_dir and save_dir
# add noise
python wsj0/2_prep_noisy_data.py -h
There are some public datasets we can use, including noise, RIR and well-simulated noisy speech.
You can use any noise corpus. But the sample rate of noise and clean speech must be same. Ohterwise, you need to use tools/resample.py
to down-sample clean speech or noise. There are some open source noise we can use: