The script conda_create_environment.sh
is available to create an environment which runs the
following code (recommended to run line by line in case of problems).
Data Download
FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe'
it probably means you have to install ffmpeg on your machine.
A possible installation: sudo apt install ffmpeg
Training
If training appears too slow, check with top
and with nvidia-smi
that you
are effectively using a GPU and not the CPU.
If running python train_sed.py
uses by default the CPU you may have pytorch installed
without CUDA support.
Check with IPython by running this pytorch line torch.rand((1)).cuda()
If you encounter an error install CUDA-enabled pytorch from https://pytorch.org/
Check again till you can run torch.rand((1)).cuda()
successfully.
You can download the development dataset using the script: generate_dcase_task4_2022.py
.
The development dataset is composed of two parts:
-
real-world data (DESED dataset): this part of the dataset is composed of strong labels, weak labels, unlabeled, and validation data which are coming from Audioset.
-
synthetically generated data: this part of the dataset is composed of synthetically soundscapes, generated using [Scaper][scaper].
Run the command python generate_dcase_task4_2022.py --basedir="../../data"
to download the dataset (the user can change basedir to the desired data folder.)
If the user already has downloaded part of the dataset, it does not need to re-download the whole set. It is possible to download only part of the full dataset, if needed, using the options:
- only_strong (download only the strong labels of the DESED dataset)
- only_real (download the weak labels, unlabeled and validation data of the DESED dataset)
- only_synth (download only the synthetic part of the dataset)
For example, if the user already has downloaded the real and synthetic part of the set, it can integrate the dataset with the strong labels of the DESED dataset with the following command:
python generate_dcase_task4_2022.py --only_strong
If the user wants to download only the synthetic part of the dataset, it could be done with the following command:
python generate_dcase_task4_2022.py --only_synth
Once the dataset is downloaded, the user should find the folder missing_files, containing the list of files from the real-world dataset (desed_real) which was not possible to download. You need to download it and send your missing files to the task organisers to get the complete dataset (in priority to Francesca Ronchini and Romain serizel).
The dataset is composed by 4 different splits of training data:
- Synthetic training set with strong annotations
- Strong labeled training set (only for the SED Audioset baseline)
- Weak labeled training set
- Unlabeled in domain training set
This set is composed of 10000 clips generated with the [Scaper][scaper] soundscape synthesis and augmentation library. The clips are generated such that the distribution per event is close to that of the validation set.
The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]
For example: YOTsn73eqbfc_10.000_20.000.wav 0.163 0.665 Alarm_bell_ringing
This set is composed of 3470 audio clips coming from Audioset.
This set is used at training only for the SED Audioset baseline.
The strong annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][onset (in seconds) (float)][tab][offset (in seconds) (float)][tab][event_label (string)]
For example: Y07fghylishw_20.000_30.000.wav 0.163 0.665 Dog
This set contains 1578 clips (2244 class occurrences) for which weak annotations have been manually verified for a small subset of the training set.
The weak annotations are provided in a tab separated csv file under the following format:
[filename (string)][tab][event_labels (strings)]
For example: Y-BJNMHMZDcU_50.000_60.000.wav Alarm_bell_ringing,Dog
This set contains 14412 clips. The clips are selected such that the distribution per class (based on Audioset annotations) is close to the distribution in the labeled set. However, given the uncertainty on Audioset labels, this distribution might not be exactly similar.
The dataset uses FUSS, FSD50K, desed_soundbank and desed_real.
For more information regarding the dataset, please refer to the previous year DCASE Challenge website.
We provide three baselines for the task:
- SED baseline
- baseline using pre-trained embedding extractor DNN.
- baseline using Audioset data (real-world strong-label data)
For now, only the SED baseline is available (the missing baseline will be published soon).
The SED baseline can be run from scratch using the following command:
python train_sed.py
NOTE: Currently multi-GPUs is not supported
note: python train_sed.py --gpus 0
will use the CPU. GPU indexes start from 1 here.
Common issues
If you encounter:
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0] But your machine only has: [] (edited)
or
OSError: libc10_cuda.so: cannot open shared object file: No such file or directory
It probably means you have installed CPU-only version of Pytorch or have installed the incorrect cudatoolkit version. Please install the correct version from https://pytorch.org/
Note that the default training config will use GPU 0. Alternatively, we provide a pre-trained checkpoint along with tensorboard logs. The baseline can be tested on the development set of the dataset using the following command:
python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt
The tensorboard logs can be tested using the command tensorboard --logdir="path/to/exp_folder"
.
In this year DCASE Task 4 Challenge, we also use energy consumption (kWh) via CodeCarbon as an additional metric to rank the submitted systems.
We encourage the participants to provide, for each submitted system (or at least the best one), the following energy consumption figures in kWh using CodeCarbon:
- whole system training
- devtest inference
- evaluation set inference
You can refer to Codecarbon on how to do this (super simple! 😉 )
or to this baseline code see local/sed_trained.py
for some hints on
how we are doing this for the baseline system.
Important
In addition to this, we kindly suggest the participants to provide the energy consumption in kWh (using the same hardware used for 2) and 3)) of:
- devtest inference for baseline system using:
python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt
You can find the energy consumed in kWh in ./exp/2022_baseline/devtest_codecarbon/devtest_tot_kwh.txt
- evaluation set inference for baseline system using:
python train_sed.py --eval_from_checkpoint /path/to/downloaded.ckpt
You can find the energy consumed in kWh in./exp/2022_baseline/evaluation_codecarbon/eval_tot_kwh.txt
Why we require this ?
Energy consumption depends on hardware and each participant uses different hardware.
To obviate for this difference we use the baseline inference kWh energy consumption as a common reference. Because of this, it is important that the inference energy consumption figures for both submitted system and baseline are computed on same hardware under similar loading.
Dataset | PSDS-scenario1 | PSDS-scenario2 | Intersection-based F1 | Collar-based F1 |
---|---|---|---|---|
Dev-test | 0.336 | 0.536 | 64.1% | 40.1% |
Energy Consumption (GPU: NVIDIA A100 40Gb)
Dataset | Training | Dev-Test |
---|---|---|
kWh | 1.717 | 0.030 |
Collar-based = event-based. More information about the metrics in the DCASE Challenge webpage.
The results are from the student predictions.
NOTES:
All baselines scripts assume that your data is in ../../data
folder in DESED_task directory.
If your data is in another folder, you will have to change the paths of your data in the corresponding data
keys in YAML configuration file in conf/sed.yaml
.
Note that train_sed.py
will create (at its very first run) additional folders with resampled data (from 44kHz to 16kHz)
so the user need to have write permissions on the folder where your data are saved.
Hyperparameters can be changed in the YAML file (e.g. lower or higher batch size).
A different configuration YAML (for example sed_2.yaml) can be used in each run using --conf_file="confs/sed_2.yaml
argument.
The default directory for checkpoints and logging can be changed using --log_dir="./exp/2021_baseline
.
Training can be resumed using the following command:
python train_sed.py --resume_from_checkpoint /path/to/file.ckpt
In order to make a "fast" run, which could be useful for development and debugging, you can use the following command:
python train_sed.py --fast_dev_run
It uses very few batches and epochs so it won't give any meaningful result.
Architecture
The baseline is the same as the DCASE 2021 Task 4 baseline, based on a Mean-Teacher model [1].
The baseline uses a Mean-Teacher model which is a combination of two models: a student model and a teacher model, having the same architecture. The student model is the one used at inference while the goal of the teacher is to help the student model during training. The teacher's weight are the exponential average of the student model's weights. The models are a combination of a convolutional neural network (CNN) and a recurrent neural network (RNN) followed by an attention layer. The output of the RNN gives strong predictions while the output of the attention layer gives the weak predictions [2].
Figure 1 shows an illustration of the baseline model.
Figure 1: baseline Mean-teacher model. Adapted from [2]. |
Mixup is used as data augmentation technique for weak and synthetic data by mixing data in a batch (50% chance of applying it) [3].
For more information regarding the baseline model, the reader is referred to [1] and [2].
The SED baseline using the strongly annotated part of Audioset can be run from scratch using the following command:
python train_sed.py --strong_real
The command will automatically considered the strong labels recorded data coming from Audioset in the training process.
Alternatively, also in this case, we provide a pre-trained checkpoint. The baseline can be tested on the development set of the dataset using the following command:
python train_sed.py --test_from_checkpoint /path/to/downloaded.ckpt
Dataset | PSDS-scenario1 | PSDS-scenario2 | Intersection-based F1 | Collar-based F1 |
---|---|---|---|---|
Dev-test | 0.351 | 0.552 | 64.3% | 42.9% |
Energy Consumption (GPU: NVIDIA A100 40Gb)
Dataset | Training | Dev-Test |
---|---|---|
kWh | 2.418 | 0.027 |
Collar-based = event-based. More information about the metrics in the DCASE Challenge webpage.
The results are computed from the student predictions.
All the comments related to the possibility of resuming the training and the fast development run in the [SED baseline][sed_baseline] are valid also in this case.
We added a new baseline which exploits pre-trained models such as PANNs and AST to increase the performance. to increase the performance.
In this baseline the frame-level or whole-clip level features are used in a late-fusion fashion
with the existing CRNN baseline classifier.
See desed_task/nnet/CRNN.py
for details. The whole-clip features are concatenated with CNN extracted features in the baseline
CRNN classifier.
Regarding he frame-level features, since they have different sequence length w.r.t. CNN features we use a trainable RNN-based encoder to encode those to a fixed dim output (obtaining again a whole-clip level embedding). This embedding is then concatenated in the same way as the whole-clip features.
We provide different ways to integrate such pre-trained models.
See the configuration file: ./confs/pretrained.yaml
:
pretrained:
model: ast
e2e: False
freezed: True
url: https://zenodo.org/record/3987831/files/Cnn14_16k_mAP%3D0.438.pth?download=1
dest: ./pretrained_models/Cnn14_16k_mAP%3D0.438.pth
extracted_embeddings_dir: ./embeddings
You can choose ast or panns. You can choose whether to keep the pre-trained model freezed or train it along with the CRNN architecture. If you want to keep it freezed, we already provide the pre-extracted embeddings for you. This is useful if you want to train with a big batch size because you won't have to store the rather heavy PANNs or AST models on your GPU.
Here are the links to the pre-extracted embeddings for AST and PANNs:
https://zenodo.org/record/6541454#.YnzHq2YzbDI (unalabeled ast)
https://zenodo.org/record/6539466#.YnvtWmYzbAM (ast synth train, ast synth val,ast weak val)
https://zenodo.org/record/6518380#.YnvWZGYzbAM (panns, ast weak train, ast devtest)
You can download and unpack them in your preferred directory.
Do not forget then to set in the configuration
above extracted_embeddings_dir: YOUR_PATH
.
The script expects a folder structure like this:
YOUR_PATH |--- ast
|---- devtest.hdf5
|---- synth_train.hdf5
|---- unlabeled_train.hdf5
|---- weak_train.hdf5
|---- weak_val.hdf5
|---- synth_val.hdf5
|--- panns
|---- devtest.hdf5
|---- synth_train.hdf5
|---- unlabeled_train.hdf5
|---- weak_train.hdf5
|---- weak_val.hdf5
|---- synth_val.hdf5
You can also select if you want to do late fusion with global, whole-clip features from PANNs or
frame-level features in ./confs/pretrained.yaml
:
nb_filters: [ 16, 32, 64, 128, 128, 128, 128 ]
pooling: [ [ 2, 2 ], [ 2, 2 ], [ 1, 2 ], [ 1, 2 ], [ 1, 2 ], [ 1, 2 ], [ 1, 2 ] ]
dropout_recurrent: 0
use_embeddings: True
embedding_size: 768 # use 2048 for PANNs global and frame, 527 for AST global and 768 for AST frame
embedding_type: frame # or global
The training can be started simply with
python train_pretrained.py
By default this uses AST with frame-level embeddings. The pre-trained model is freezed and expects the pre-extracted AST
embeddings in a local folder ./embeddings
as you can see from the details provided before about the YAML config.
Thus you would need to download the AST embeddings from the Zenodo links above, unless you set freezed: False
.
However, the latter requires significant GPU memory.
Also in this case, we provide a pre-trained checkpoint. The baseline can be tested on the development set of the dataset using the following command:
python train_pretrained.py --test_from_checkpoint /path/to/downloaded.ckpt
Dataset | PSDS-scenario1 | PSDS-scenario2 | Intersection-based F1 | Collar-based F1 |
---|---|---|---|---|
Dev-test | 32.24% | 72.22% | 90.34 | 37.16 |
Energy Consumption (GPU: NVIDIA A100 40Gb) Note we used pre-extracted embeddings, so the power consuption for the pre-trained model is not accounted for
Dataset | Training | Dev-Test |
---|---|---|
kWh | 4.41 | 0.036 |
Collar-based = event-based. More information about the metrics in the DCASE Challenge webpage.
The results are computed from the teacher predictions.
All the comments related to the possibility of resuming the training and the fast development run in the [SED baseline][sed_baseline] are valid also in this case.
Architecture
The architecture of the SED Audioset baseline is the same as the [SED baseline][sed_baseline].
[zenodo_pretrained_ast_embedding_model]: [google_sourcesep_repo]: https://github.com/google-research/sound-separation/tree/master/datasets/yfcc100m [sdk_installation_instructions]: https://cloud.google.com/sdk/docs/install [zenodo_evaluation_dataset]: https://zenodo.org/record/4892545#.YMHH_DYzadY [scaper]: https://github.com/justinsalamon/scaper [sed_baseline]: https://github.com/DCASE-REPO/DESED_task/tree/master/recipes/dcase2022_task4_baseline#sed-baseline
[1] L. Delphin-Poulat & C. Plapous, technical report, dcase 2019.
[2] Turpault, Nicolas, et al. "Sound event detection in domestic environments with weakly labeled data and soundscape synthesis."
[3] Zhang, Hongyi, et al. "mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412 (2017).
[4] Thomee, Bart, et al. "YFCC100M: The new data in multimedia research." Communications of the ACM 59.2 (2016)
[5] Wisdom, Scott, et al. "Unsupervised sound separation using mixtures of mixtures." arXiv preprint arXiv:2006.12701 (2020).
[6] Turpault, Nicolas, et al. "Improving sound event detection in domestic environments using sound separation." arXiv preprint arXiv:2007.03932 (2020).
[7] Ronchini, Francesca, et al. "The impact of non-target events in synthetic soundscapes for sound event detection." arXiv preprint arXiv:2109.14061 (DCASE2021)
[8] Ronchini, Francesca, et al. "A benchmark of state-of-the-art sound event detection systems evaluated on synthetic soundscapes." arXiv preprint arXiv:2202.01487