Skip to content

A home for audio ML in JAX. Has common features, learnable frontends, pretrained supervised and self-supervised models.

License

Notifications You must be signed in to change notification settings

SarthakYadav/audax

Repository files navigation

audax

Sponsors

This work would not be possible without cloud resources provided by Google's TPU Research Cloud (TRC) program. I also thank the TRC support team for quickly resolving whatever issues I had: you're awesome!

Want to become a sponsor? Feel free to reach out!

About

A home for audio ML in JAX. Has common features, popular learnable frontends, and pretrained supervised and self-supervised models. As opposed to popular frameworks, the objective is not to become an end-to-end, end-all-be-all DL framework, but instead to act as a starting point for doing things the jax way, through reference implementations and recipes, using the jax / flax / optax stack.

PS: I'm quite new to using Jax and it's functional-at-heart design, so I admit the code can be a bit untidy at places. Expect changes, restructuring, and like the official Jax repository itself says, sharp edges!

Installation

pip install audax

To install from the latest source use following command

git clone https://github.com/SarthakYadav/audax.git
cd audax
pip install -r requirements.txt
pip install .

A colab installation walkthrough can be found here

Data pipeline

  • All training is done on custom TFRecords. Initially tried using tensorflow-datasets, but decided against it.
  • tfrecords comprise of examples with audio file stored as an encoded PCM_16 flac buffer, label info and duration, resulting in smaller tfrecord files and faster I/O as compared to storing audio as a sequence of floats.
  • A step-by-step guide to setup data can be found in the recipes/data_prep, including sample script to convert data into tfrecords.
  • More info could be found in audax.training_utils.data_v2

What's available

Audio feature extraction

At the time of writing, jax.signal does not have a native Short-time Fourier Transform (stft) implementation.

Instead of trying to emulate the scipy.signal implementation that has a lot more bells and whistles and is more feature packed, the stft implementation in audax.core is designed such that it can be build upon to extract spectrogram and melspectrogram features as those found in torchaudio, which are quite popular. The result is a simple implementation of stft, spectrogram and melspectrogram, which are compatible with their torchaudio counterparts, as shown in the figure below.

audax_vs_torchaudio

Currently, spectrogram and melspectrogram features are supported. Visit audax.core.readme for more info.

Apart from features, jax.vmap compatible mixup and SpecAugment (no TimeStretch as of now unfortunately) implementations are also provided.

Network architectures

Several prominent neural network architecture reference implementations are provided, with more to come. The current release has:

Pretrained models can be found in respective recipes, and expect more to be added soon.

Learnable frontends

Two popular learnable feature extraction frontends are available in audax.frontends LEAF [4] and SincNet [5]. Sample recipes, as well as pretrained models (AudioSet for now) can be found in the recipes/leaf.

Self-supervised models

  • COLA [6] models on AudioSet for various aforementioned architectures can be found in recipes/cola.
  • A working implementation of SimCLR [7, 8] can be found in recipes/simclr, and pretrained models will be added soon (experiments ongoing!).

What's coming up

  • Pretrained COLA models and linear probe experiments. (VERY SOON!)
  • Better documentation and walk-throughs.
  • Pretrained SimCLR models.
  • Recipes for Speaker Recognition on VoxCeleb
  • More AudioSet pretrained checkpoints for architectures already added.
  • Reference implementations for more neural architectures, esp. Transformer based networks.

On contributing

  • At the time of writing, I've been the sole person involved in development of this work, and quite frankly, would love to have help!
  • Happy to hear from open source contributore, both newbies and experienced, about their experience and needs
  • Always open to hearing about possible ways to clean up/better structure code.

References

[1] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[2] Tan, M. and Le, Q., 2019, May. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
[3] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T. and Xie, S., 2022. A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.
[4] Zeghidour, H., Teboul, O., Quitry, F., and Tagliasacchi, M., LEAF: A Learnable Frontend for Audio Classification, In International Conference on Learning Representations, 2021.
[5] Ravanelli, M. and Bengio, Y., 2018, December. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 1021-1028). IEEE.
[6] Saeed, A., Grangier, D. and Zeghidour, N., 2021, June. Contrastive learning of general-purpose audio representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3875-3879). IEEE.
[7] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.

About

A home for audio ML in JAX. Has common features, learnable frontends, pretrained supervised and self-supervised models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages