Table of Contents
In this project we attempt to translate the speech signals into image signals in two steps. The speech signal is converted into text with the help of Automatic speech recognition (ASR) and then high-quality images are generated from the text descriptions by using StackGAN.
Automatic speech recognition (ASR) consists of transcribing audio speech segments into text. ASR can be treated as a sequence-to-sequence problem, where the audio can be represented as a sequence of feature vectors and the text as a sequence of characters, words, or subword tokens.
The Dataset we are using is the LJSpeech dataset from the LibriVox project. It consists of short audio clips of a single speaker reading passages from 7 non-fiction books. Our model is similar to the original Transformer (both encoder and decoder) as proposed in the paper, "Attention is All You Need".
- weights (your model weights will be saved here)
- test (generated images from our stage I GAN)
- results_stage2 (generated images from stage II fo GAN)
Download from : https://data.caltech.edu/records/65de6-vp158
Download char-CNN-RNN text embeddings for birds from : https://github.com/hanzhanggit/StackGAN
- char-CNN-RNN-embeddings.pickle — Dataframe for the pre-trained embeddings of the text.
- filenames.pickle — Dataframe containing the filenames of the images.
- class_info.pickle — Dataframe containing the info of classes for each image.
- Stage 1
- Text Encoder Network
- Text description to a 1024 dimensional text embedding
- Learning Deep Representations of Fine-Grained Visual Descriptions Arxiv Link
- Conditioning Augmentation Network
- Adds randomness to the network
- Produces more image-text pairs
- Generator Network
- Discriminator Network
- Embedding Compressor Network
- Outputs a 64x64 image
- Text Encoder Network
- Stage 2
- Text Encoder Network
- Conditioning Augmentation Network
- Generator Network
- Discriminator Network
- Embedding Compressor Network
- Outputs a 256x256 image
- Attention is All You Need [Arxiv Link]
- Very Deep Self-Attention Networks for End-to-End Speech Recognition [Arxiv Link]
- Speech-Transformer [IEEE Xplore]
- StackGAN: Text to photo-realistic image synthesis [Arxiv Link]