Skip to content

Commit

Permalink
Update README.md - Added Abstract
Browse files Browse the repository at this point in the history
  • Loading branch information
Srija616 authored Jun 11, 2024
1 parent 4f9f75c commit e9951b1
Showing 1 changed file with 20 additions and 1 deletion.
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,20 @@
# IndicOOV
# IndicOOV

## Abstract
Publicly available TTS datasets for low-resource languages like
Hindi and Tamil typically contain 10-20 hours of data, lead-
ing to poor vocabulary coverage. This limitation becomes evi-
dent in downstream applications where domain-specific vocab-
ulary coupled with frequent code-mixing with English, results
in many OOV words. To highlight this problem, we create a
benchmark containing OOV words from several real-world ap-
plications. Indeed, state-of-the-art Hindi and Tamil TTS sys-
tems perform poorly on this OOV benchmark, as indicated by
intelligibility tests. To improve the model’s OOV performance,
we propose a low-effort and economically viable strategy to ob-
tain more training data. Specifically, we propose using volun-
teers as opposed to high quality voice artists to record words
containing character bigrams unseen in the training data. We
show that using such inexpensive data, the model’s performance
improves on OOV words, while not affecting voice quality and
in-domain performance.

0 comments on commit e9951b1

Please sign in to comment.