- Sponsors
- About
- Installation
- Data pipeline
- What's available
- What's coming up
- On contributing
- References
This work would not be possible without cloud resources provided by Google's TPU Research Cloud (TRC) program. I also thank the TRC support team for quickly resolving whatever issues I had: you're awesome!
Want to become a sponsor? Feel free to reach out!
A home for audio ML in JAX. Has common features, popular learnable frontends, and pretrained supervised and self-supervised models. As opposed to popular frameworks, the objective is not to become an end-to-end, end-all-be-all DL framework, but instead to act as a starting point for doing things the jax way, through reference implementations and recipes, using the jax / flax / optax stack.
PS: I'm quite new to using Jax and it's functional-at-heart design, so I admit the code can be a bit untidy at places. Expect changes, restructuring, and like the official Jax repository itself says, sharp edges!
pip install audax
To install from the latest source use following command
git clone https://github.com/SarthakYadav/audax.git
cd audax
pip install -r requirements.txt
pip install .
A colab installation walkthrough can be found here
- All training is done on custom TFRecords. Initially tried using tensorflow-datasets, but decided against it.
- tfrecords comprise of examples with audio file stored as an encoded
PCM_16
flac
buffer, label info and duration, resulting in smallertfrecord
files and faster I/O as compared to storing audio as a sequence of floats. - A step-by-step guide to setup data can be found in the recipes/data_prep, including sample script to convert data into tfrecords.
- More info could be found in audax.training_utils.data_v2
At the time of writing, jax.signal
does not have a native Short-time Fourier Transform (stft
) implementation.
Instead of trying to emulate the scipy.signal
implementation that has a lot more bells and whistles and is more feature packed,
the stft
implementation in audax.core
is designed such that it can be build upon to extract spectrogram
and melspectrogram
features
as those found in torchaudio, which are quite popular.
The result is a simple implementation of stft
, spectrogram
and melspectrogram
, which are compatible with their torchaudio counterparts, as shown in the figure below.
Currently, spectrogram
and melspectrogram
features are supported. Visit audax.core.readme for more info.
Apart from features, jax.vmap compatible mixup and SpecAugment (no TimeStretch as of now unfortunately) implementations are also provided.
Several prominent neural network architecture reference implementations are provided, with more to come. The current release has:
- ResNets [1]
- EfficientNet [2]
- ConvNeXT [3]
Pretrained models can be found in respective recipes, and expect more to be added soon.
Two popular learnable feature extraction frontends are available in audax.frontends LEAF
[4] and SincNet
[5].
Sample recipes, as well as pretrained models (AudioSet for now) can be found in the recipes/leaf.
COLA
[6] models on AudioSet for various aforementioned architectures can be found in recipes/cola.- A working implementation of
SimCLR
[7, 8] can be found in recipes/simclr, and pretrained models will be added soon (experiments ongoing!).
- Pretrained
COLA
models and linear probe experiments. (VERY SOON!) - Better documentation and walk-throughs.
- Pretrained
SimCLR
models. - Recipes for Speaker Recognition on VoxCeleb
- More
AudioSet
pretrained checkpoints for architectures already added. - Reference implementations for more neural architectures, esp. Transformer based networks.
- At the time of writing, I've been the sole person involved in development of this work, and quite frankly, would love to have help!
- Happy to hear from open source contributore, both newbies and experienced, about their experience and needs
- Always open to hearing about possible ways to clean up/better structure code.
[1] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[2] Tan, M. and Le, Q., 2019, May. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
[3] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T. and Xie, S., 2022. A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.
[4] Zeghidour, H., Teboul, O., Quitry, F., and Tagliasacchi, M., LEAF: A Learnable Frontend for Audio Classification, In International Conference on Learning Representations, 2021.
[5] Ravanelli, M. and Bengio, Y., 2018, December. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
[6] Saeed, A., Grangier, D. and Zeghidour, N., 2021, June. Contrastive learning of general-purpose audio representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3875-3879). IEEE.
[7] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.