Skip to content

This repository contains the deep learning with audio examples and the course materials for DOM-E5129 - Intelligent Computational Media. Documentation on different Deep Learning Audio systems as well as instructions on using some of them. Tools for loading, playing and plotting audio. Some working simple classifiers Non-working sample-level/raw …

Notifications You must be signed in to change notification settings

SopiMlab/DeepLearningWithAudio18

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning with Audio

DOM-E5129 - Intelligent Computational Media

State of audio generation in Deep Learning (December 2018)

Speech and music (MIDI) generation are doing well, however the methods that work well with images don’t translate that well to the audio domain. Turning sounds to spectrograms and different signal processing algorithms make it possible to use image models, but the results tend to be a bit underwhelming and the sound quality is bad.

A blog post going deeper into why this is the case.

WaveNet (September 2016) was a massive breakthrough in audio generation. It creates waveforms sample by sample which seems to be the reason why it generates so much better results. It’s a convolutional neural network that wasn’t usually used for generation before. It is mainly used to create natural speech, but there was some tests with music generation too. This is one of the applications that has seen widespread real-world use.

Two Minute Paper video about WaveNet

Continuation Paper that makes generation a lot faster (November 2017)

It's part of Google Duplex, the restaurant reservation Assistant (May 2018)

The case of GANs is a good case to show how the audio domain is progressing is a lot slower than computer vision and image generation. Considering the original GAN paper came out in 2014 and there’s multiple amazing applications of it in the recent years. It took until 2018 until anyone managed to combine WaveNet sample generation approach and GAN.

Failed attempt from January 2017

Successful version from January 2018

One of the most promising works is “A Universal Music Translation Network” (May 2018) by Facebook Research. It can take a piece of music played one way and translate it to another style. Piano -> Harpsichord, Band -> Orchestra, Whistling -> Orchestra. It uses a clever system of convolving input into a shared musical “language” that it can then translate to different styles or instruments with separately trained models. Unfortunately the code for the project is not available and trained for 6 days with 8 GPUs.

One huge problem with all of these system is that the results are very idealised, when you pick only the best results, it gives a misleading picture of what is actually possible. Good early example is GRUV all the way from 2015. It seems it could generate music, but it actually just memorizes it (down to the lyrics). A more likely scenario in the current situation is presented in this video (three full days of training with just some plausible stuttering backing vocals to show.)

With massive datasets, the likelihood of your impressive results being just clever sampling from the dataset seems very likely.

The only reasonable and accessible system seems to be Magenta. It has a great set of trained models for different types of musical improvisation. It is also designed to work on the browser for fun, easily accessible toys. The problem is that it’s mainly MIDI-based, which massively limits the possibilities. Magenta also includes NSynth, a system that can combine instruments in fascinating ways. And you can actually use it as an instrument (March 2018).

Almost all of the applications listed here take intense amounts of training. Most of the big papers are training with 10-32 GPUs for around a week.

So any attempted practical application of these systems is likely to be unsuccessful at the current time.

Promising or interesting works

Strange and interesting offshoot work

Datasets

This is also one huge problem currently. There isn’t many high-quality large audio datasets. Especially for non-music, non-speech sounds, it feels pretty dead.

  • Google AudioSet

    • Is really big and categorized, but the problem is that it’s just 10-second clips of Youtube videos, with the type of sound somewhere in there. And one clip might even multiple types of sound. Good for classification, terrible for generation. Also, there's some legal problems of getting just the audio from these videos.
    • The VEGAS dataset is a human-curated subset of AudioSet that is less noisy and generally better for sound generation tasks.
  • ESC-50

    • A Dataset of 50-different environmental sounds. It’s main use is benchmarking classification, but it’s one of the only sources of environmental quality sounds currently. The problem is that it’s very small, 40 sounds per category. Makes it tricky to use for generation.
  • The NSynth Dataset

    • Absolutely massive set of 300 000 sound files. It’s basically notes played on different instruments. It’s done with MIDI instruments, so not the most interesting form that sense, but it’s easily big enough for generation too
  • Speech Commands Dataset and SC Zero to Nine Speech Commands

    • There’s multiple datasets for speech commands and they tend to be large and high-quality. Human speech is just not the most interesting thing to generate, but it’ll likely be the baseline for any future systems.
  • Kaggle audio datasets

    • There's some strange things here and more must be coming, but the quality varies wildly.
  • There’s also many sources of sound effects for example, but considering the amount you need, collecting them from different sources would be a major undertaking. One fun one is the BBC sound effect archive.

Other notes

  • The audio sample approach is so unexplored that many frameworks don’t even have a Conv1DTranspose-implementation. So people make their own by running it through Conv2DTranspose.
  • The only audio tutorial for Tensorflow is based on spectrograms and only does speech recognition.

Other interesting links

  • Creative.ai
    • An organization dedicated to creating interesting creative applications of AI in as many different fields as possible.
  • Keras-GAN on GitHub
    • Repository of most of the biggest image GANs, implemented in Keras.
  • SeedBank
    • A collection of Interactive Machine learning examples running on Google CoLab (with free GPUs)

About

This repository contains the deep learning with audio examples and the course materials for DOM-E5129 - Intelligent Computational Media. Documentation on different Deep Learning Audio systems as well as instructions on using some of them. Tools for loading, playing and plotting audio. Some working simple classifiers Non-working sample-level/raw …

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •