EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
+
+ EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
+
-
Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann
+
+ Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo
+ Gerkmann
+
-
Abstract
+
+ Abstract
+
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality
@@ -84,9 +91,14 @@
Abstract
Dataset download links and automatic evaluation server can be found online.
-
EARS Dataset
+
+ EARS Dataset
+
-
The EARS dataset is characterized by its scale, diversity, and high recording quality. In Table 1, we list characteristics of the EARS dataset in comparison to other speech datasets.
+
+ The EARS dataset is characterized by its scale, diversity, and high recording quality. In Table 1, we list
+ characteristics of the EARS dataset in comparison to other speech datasets.
+
@@ -150,99 +162,106 @@
EARS Dataset
- EARS contains 100 h of anechoic speech recordings at 48 kHz from over 100 English speakers with high demographic diversity.
- The dataset spans the full range of human speech, including reading tasks in seven different reading styles, emotional reading
- and freeform speech in 22 different emotions, conversational speech, and non-verbal sounds like laughter or coughing. Reading
- tasks feature seven styles (regular, loud, whisper, fast, slow, high pitch, and low pitch). Additionally, the dataset features
- unconstrained freeform speech and speech in 22 different emotional styles. We provide transcriptions of the reading portion and
- meta-data of the speakers (gender, age, race, first language).
+ EARS contains 100 h of anechoic speech recordings at 48 kHz from over 100 English speakers with high demographic
+ diversity. The dataset spans the full range of human speech, including reading tasks in seven different reading
+ styles, emotional reading and freeform speech in 22 different emotions, conversational speech, and non-verbal sounds
+ like laughter or coughing. Reading tasks feature seven styles (regular, loud, whisper, fast, slow, high pitch, and
+ low pitch). Additionally, the dataset features unconstrained freeform speech and speech in 22 different emotional
+ styles. We provide transcriptions of the reading portion and meta-data of the speakers (gender, age, race, first
+ language).
-
Audio Examples
+
+ Audio Examples
+
-
Here we present a few audio examples from the EARS dataset.
+
+ Here we present a few audio examples from the EARS dataset.
+
The EARS dataset enables various speech processing tasks to be evaluated in a controlled and comparable way. Here, we
@@ -251,12 +270,14 @@
Benchmarks
-
EARS-WHAM
+
+ EARS-WHAM
+
- For the task of speech enhancement, we construct the EARS-WHAM dataset, which mixes speech from the EARS dataset with
- real noise recordings from the WHAM! dataset . More details can be
- found in the paper.
+ For the task of speech enhancement, we construct the EARS-WHAM dataset, which mixes speech from the EARS dataset
+ with real noise recordings from the WHAM! dataset . More details can
+ be found in the paper.
Results
@@ -264,8 +285,8 @@
Results
- Table 2: Results on EARS-WHAM. Values indicate the mean of the metrics over the test set. The
- best results are highlighted in bold.
+ Table 2: Results on EARS-WHAM. Values indicate the mean of the metrics over the test set.
+ The best results are highlighted in bold.
@@ -322,9 +343,18 @@
Results
-
Audio Examples
+
+ Audio Examples
+
-
Here we present audio examples for the speech enhancement task. Below we show the noisy input, processed files for Conv-TasNet , CDiffuSE , Demucs , SGMSE+ , and the clean ground truth.
+
+ Here we present audio examples for the speech enhancement task. Below we show the noisy input, processed files for
+ Conv-TasNet ,
+ CDiffuSE ,
+ Demucs ,
+ SGMSE+ ,
+ and the clean ground truth.
+