Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech Paper Voting Round 2 #12

Open
EverlynAsiko opened this issue Feb 2, 2022 · 4 comments
Open

Speech Paper Voting Round 2 #12

EverlynAsiko opened this issue Feb 2, 2022 · 4 comments

Comments

@EverlynAsiko
Copy link

In this issue you can either:

  • Add papers that you think are interesting to read and discuss (please stick to the format).
  • Vote: should be done using 👍 on comments

Example: Voting Paper #1

I have added some papers collected on the Papers to read sheet

@EverlynAsiko

This comment was marked as outdated.

@EverlynAsiko
Copy link
Author

Listen, Attend and Spell

Link to paper

Abstract

We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

@EverlynAsiko EverlynAsiko changed the title Speech Paper Voting #1 Speech Paper Voting Round 1 Feb 2, 2022
@EverlynAsiko EverlynAsiko changed the title Speech Paper Voting Round 1 Speech Paper Voting Round 2 Feb 16, 2022
@EverlynAsiko
Copy link
Author

Automatic speech recognition: a survey

Link to paper

Abstract

Recently great strides have been made in the field of automatic speech recognition (ASR) by using various deep learning techniques. In this study, we present a thorough comparison between cutting-edged techniques currently being used in this area, with a special focus on the various deep learning methods. This study explores different feature extraction methods, state-of-the-art classification models, and vis-a-vis their impact on an ASR. As deep learning techniques are very data-dependent different speech datasets that are available online are also discussed in detail. In the end, the various online toolkits, resources, and language models that can be helpful in the formulation of an ASR are also proffered. In this study, we captured every aspect that can impact the performance of an ASR. Hence, we speculate that this work is a good starting point for academics interested in ASR research.

@JRMeyer
Copy link

JRMeyer commented Feb 16, 2022

Sequence Transduction with Recurrent Neural Networks

The transducer described in this paper extends CTC by defining a distribution over output sequences of all lengths, and by jointly modelling both input-output and output-output dependencies.

link to paper

Abstract

Many machine learning tasks can be expressed as the transformation — or transduction — of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating. Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since finding the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging. This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. Experimental results for phoneme recognition are provided on the TIMIT speech corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants