From d6b530e85c58c2c842469575ff3c3a2a5f264d66 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alexander=20K=C3=B6nig?= Date: Thu, 22 Aug 2024 12:11:21 +0200 Subject: [PATCH] added corpora of disordered speech --- .../1-CSD.csv | 18 ++++++++++++++++++ .../1-CSD.csv | 16 ++++++++++++++++ 2 files changed, 34 insertions(+) create mode 100644 rfhg/static/resource_families/Corpora/Disordered speech corpora/1-Corpora of disordered speech in the CLARIN infrastructure/1-CSD.csv create mode 100644 rfhg/static/resource_families/Corpora/Disordered speech corpora/2-Other corpora of disordered speech/1-CSD.csv diff --git a/rfhg/static/resource_families/Corpora/Disordered speech corpora/1-Corpora of disordered speech in the CLARIN infrastructure/1-CSD.csv b/rfhg/static/resource_families/Corpora/Disordered speech corpora/1-Corpora of disordered speech in the CLARIN infrastructure/1-CSD.csv new file mode 100644 index 0000000..c816f7c --- /dev/null +++ b/rfhg/static/resource_families/Corpora/Disordered speech corpora/1-Corpora of disordered speech in the CLARIN infrastructure/1-CSD.csv @@ -0,0 +1,18 @@ +Corpus;Corpus_URL;Language;Size;Annotation;Licence;Description;Buttons;Buttons_URL;Publication;Publication_URL;Note +AphasiaBank;https://aphasia.talkbank.org/;Cantonese, Croatian, English, French, German, Greek, Hungarian, Italian, Japanese, Mandarin, Romanian, Spanish;380 MB transcripts, 827 GB media;CHAT and CA/CHAT;email request for access;This is a corpus of multimedia interactions for the study of communication in aphasia.#SEP Access to the data in AphasiaBank is password protected and restricted to members of the AphasiaBank consortium group. #SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;Browse;https://sla.talkbank.org/TBB/aphasia;;;CLARIN +Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0;http://hdl.handle.net/11356/1435;Croatian;6760 texts, 34469 sentences, 426187 tokens;MULTEXT-East tagset;CC-BY-SA 4.0;The corpus consists of texts produced by nonprofessional typical speakers and speakers with different language disorders (developmental language disorder, dyslexia, traumatic brain injury, aphasia, other).#SEPRoughly half of the corpus consists of texts of typical speakers, and the other half of speakers with language disorders.#SEPLanguage samples were elicited by six groups of tasks representing different writing styles (descriptive, expository, narrative, and letter) and different levels of formality.;Download;http://hdl.handle.net/11356/1435;Kuvač Kraljević et al. (2021);https://hrcak.srce.hr/file/370152;CLARIN +ADHD and SLI corpus UvA database;https://hdl.handle.net/1839/00-2766F32F-4305-4F13-A02C-F4A8F5216425;Dutch;4 GB (67 recordings) of 26 Dutch children with ADHD, 19 Dutch children with SLI, 22 children Dutch controls;Transcriptions (CHAT-format);CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings);This corpus aims to compare the language and executive functioning profiles of children with ADHD to children with Specific Language Impairment and children with Tourette’s Disorder.;Download;https://hdl.handle.net/1839/00-2766F32F-4305-4F13-A02C-F4A8F5216425;;;CLARIN +Bilingual deaf children RU-Kentalis database;https://hdl.handle.net/1839/00-F6BC06C4-B2AD-4ED8-8527-AB81F4EF4E8F;Dutch;4 GB complete video recordings. 1 GB selected parts video recordings. 0,1 GB selected parts transcripts. 0,5 GB test and background data of 11 deaf children, longitudinal, 104 recordings; CHAT-like format for 104 recordings;CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings);The corpus is used for investigating the bilingual language and communication development of young deaf children in Sign Language of the Netherlands (SLN) and Dutch.;Download;https://hdl.handle.net/1839/00-F6BC06C4-B2AD-4ED8-8527-AB81F4EF4E8F;Klatter-Folmer et al. (2016);https://doi.org/10.1093/deafed/enj032;CLARIN +SLI RU-Kentalis database;https://hdl.handle.net/1839/00-97AF29EA-877D-422A-BAF7-25FA269351A6;Dutch;2 GB;Praat transcripts;CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings);The corpus has been collected to investigate of the expression of spatial relations by children with SLI and normally developing children in their spoken language production. ;Download;https://hdl.handle.net/1839/00-712802F3-C245-4EF0-BE9D-D09714DEDE67;;;CLARIN +Dutch Corpus of Pathological and Normal Speech (COPAS) ;http://hdl.handle.net/10032/tm-a2-n3;Dutch (Flemish);319 speakers of which 122 normal controls and 197 with a speech disorder. Corpus size: 1.3 GB;Orthographic transcription;Academic, bespoke;This corpus has been constructed within the framework of the project Speech Algorithms for Clinical and Educational applications (SPACE).;Download;http://hdl.handle.net/10032/tm-a2-n3;Middag et al. (2010);http://hdl.handle.net/1854/LU-1053399;CLARIN +FluencyBank;https://fluency.talkbank.org/;Dutch, English, French, German;481 MB transcripts, 207 GB media;CHAT and CA/CHAT;email request for access;This corpus is intended for the study of fluency development.#SEPParticipants include typically-developing monolingual and bilingual children, children and adults who stutter (C/AWS) or who clutter (C/AWC), and second language learners.#SEPAccess to the research data in FluencyBank is password protected and restricted to members of the FluencyBank consortium group, although a subset of the corpus is publicly available.#SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;Browse;https://sla.talkbank.org/TBB/fluency;;;CLARIN +ASDBank;https://asd.talkbank.org/;Dutch, English, French, Greek, Mandarin, Spanish;42 MB transcripts, 401 MB media;CHAT and CA/CHAT;open access;This is a corpus of multimedia interactions for the study of communication in autism-spectrum disorder.#SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;Browse;https://sla.talkbank.org/TBB/asd;;;CLARIN +Deaf adults RU database;https://hdl.handle.net/1839/00-97AF29EA-877D-422A-BAF7-25FA269351A6;Dutch, Turkish, Moroccan;2GB of 46 deaf Dutch adults, 38 hearing Turkish adults, 24 hearing Moroccan adults, 10 Dutch controls;;CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings);This corpus aims at the investigation of the acquisition of Dutch by deaf Dutch adults (late L1/early L2) and comparison to hearing Turkish and Moroccan-Arabic.;Download;https://hdl.handle.net/1839/00-97AF29EA-877D-422A-BAF7-25FA269351A6;Parriger (2012);https://pure.uva.nl/ws/files/1840998/113644_thesis.pdf;CLARIN +TBIBank;https://tbi.talkbank.org/;English;63 MB transcripts, 98 GB media;CHAT and CA/CHAT;email request for access;This is a corpus of multimedia interactions for the study of communication in people with traumatic brain injury.#SEPAccess to the data in TBIBank is password protected and restricted to members of the TBIBank consortium group.#SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;Browse;https://sla.talkbank.org/TBB/tbi;;;CLARIN +PsychosisBank;https://psychosis.talkbank.org/;English (various dialects), Spanish;Not available;CHAT and CA/CHAT;email request for access;This is a corpus intended for the study of language in psychosis.#SEPThe site is noted as under construction.#SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;;;;;CLARIN +Alzheimer's Dementia Recognition through Spontaneous Speech (audio only): The ADReSSo Challenge;https://sla.talkbank.org/TBB/dementia;English, German, Mandarin, Spanish, Taiwanese;;CHAT and CA/CHAT ;email request for access;This is a corpus of multimedia interactions for the study of communication in dementia.#SEPAccess to the data in DementiaBank is password protected and restricted to members of the DementiaBank consortium group.#SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;Browse;https://sla.talkbank.org/TBB/dementia;;;CLARIN +RHDBank;https://rhd.talkbank.org/;English, Spanish;30 MB transcripts, 28 GB media;CHAT and CA/CHAT;email request for access;This is a corpus of multimedia interactions for the study of communication in people with Right Hemisphere Damage (RHD).#SEPAccess to the data in RHDBank is password protected and restricted to members of the RHDBank consortium group.#SEPData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.;Browse;https://sla.talkbank.org/TBB/rhd;;;CLARIN +DemCorpus-Basilicata: Dementia Corpus;http://hdl.handle.net/20.500.11752/OPEN-989;Italian;08:50 hours;;Processed data available by request;This corpus consists of semi-spontaneous speech data produced by elderly residents of the Basilicata region in Italy.#SEPIn total, 40 individuals participated: the patient group consists of 20 participants with a diagnosis of dementia (9 cases of Alzheimer’s disease, 2 patients with mixed dementia, 5 patients with not-further-specified dementia, 3 patients with vascular dementia, and 1 patient with frontotemporal dementia).#SEPthe control group consists of 20 healthy individuals matched for age, gender, and geographical origin. Three linguistic tasks were administered to all participants: two narrative tasks (the first one was about an excursion or a trip, and the second was about Christmas festivities), and an image description task. This resulted in 8 hours and 50 minutes of recorded semi-spontaneous speech, which was then transcribed, segmented, and annotated using ELAN. ;;;Martinelli et al. (2022);http://hdl.handle.net/20.500.11752/OPEN-989;CLARIN +ItaASD: Italian speech corpus Austism Spectrum Disorder;http://hdl.handle.net/20.500.11752/OPEN-990;Italian;04.19 hours;Orthographic;;This is a corpus of semi-spontaneous speech produced by 34 children between 6 and 13 years of age, residents in the Campania region of Italy.#sepHalf of the participating children were diagnosed with high-functioning Autism Spectrum Disorder, and the other half were neurotypical children matched for age, gender, and geographical origin.#SEPAll participants were administered three tasks: a complex image description task, a story-telling task, and a story-retelling task. This resulted in 4 hours and 19 minutes of recorded speech, which were then transcribed and annotated using ELAN. ;;;;;CLARIN +OPLON: Opportunities for active and healthy LONgevity;http://hdl.handle.net/20.500.11752/ILC-992;Italian;06:50 hours;;;This corpus consists of semi-spontaneous speech data collected from 96 elderly participants who were divided into two groups: the pathological and the control group.#SEPThe pathological group refers to three categories: (i) 16 participants with amnestic Mild Cognitive Impairment (MCI), (ii) 16 participants with multiple-domain MCI, and (iii) 16 participants with Early Dementia (probable Alzheimer Dementia, Fronto-Temporal Dementia, Mixed Dementia, and Lewy Body Dementia).#SEPThe control group includes 48 healthy individuals matched for gender, age, educational level, and geographical origin. The corpus was subjected to PoS Tagging and Dependency Parsing (CoNLL format). ;;;;;CLARIN +Polish Cued Speech Corpus of Hearing-Impaired Children;https://hdl.handle.net/1839/dbcd8568-d17d-4861-94bb-aa553e943399;Polish;20 children (11 girls and 9 boys);CHAT format;open access or through email request for access;This is a corpus of recordings of the DIA (Dutch Intelligibilty Assessment).#SEPThe corpus also contains a variety of other samples like reading passages, isolated sentences and recordings of spontaneous speech.#SEPThe corpus contains samples of 187 speakers with a speech disorder and samples of 122 speakers without a speech disorder. ;Download;https://hdl.handle.net/1839/dbcd8568-d17d-4861-94bb-aa553e943399;;;CLARIN diff --git a/rfhg/static/resource_families/Corpora/Disordered speech corpora/2-Other corpora of disordered speech/1-CSD.csv b/rfhg/static/resource_families/Corpora/Disordered speech corpora/2-Other corpora of disordered speech/1-CSD.csv new file mode 100644 index 0000000..0a461de --- /dev/null +++ b/rfhg/static/resource_families/Corpora/Disordered speech corpora/2-Other corpora of disordered speech/1-CSD.csv @@ -0,0 +1,16 @@ +Corpus;Corpus_URL;Language;Size;Annotation;Licence;Description;Buttons;Buttons_URL;Publication;Publication_URL;Note +Perceptual Voice Qualities Database;https://data.mendeley.com/datasets/9dz247gnyb/4;English;296 audio files of varying sizes;;CC 4.0;This corpus contains voice samples which have been rated by experienced voice professionals (at least 3 different raters with a minimum of 2 years’ clinical experience) in order to provide educators with standardized materials to better train pre-service clinical voice professionals. ;Browse or download;https://data.mendeley.com/datasets/9dz247gnyb/4;Kempster (2007);https://pubs.asha.org/doi/10.1044/vvd17.2.11;Other +TORGO;http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html;English;Originally TORGO database contains 18GB of data;;CC-BY;This is a corpus of dysarthric articulation and consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), which are two of the most prevalent causes of speech disability, and matched controls.#SEPThis dataset contains 2000 samples for dysarthric males, dysarthric females, non-dysarthric males, and non-dysarthric females.;Browse;http://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html;Rudzicz et al. (2012);https://doi.org/10.1007/s10579-011-9145-0;Other +University College London Archive of Stuttered Speech (UCLASS);https://www.uclass.psychol.ucl.ac.uk/;English;56 files ;None;open access;"This corpus consists of data from a study by Howell, Davis, Bartrip, and Wormald (2004).#SEPThe study looked at the fluency-enhancing effects of speaking at the same time as a frequency shifted version of the voice.#SEPThere were 14 speakers and four recording per speaker making 56 files in all. Recording are in SFS format.#SEThe four recordings for a speaker were for two texts and two readings of each text.";Download;https://www.uclass.psychol.ucl.ac.uk/uclassfsf.htm;Howell et al. (2004);https://www.uclass.psychol.ucl.ac.uk/Release2/hdbw.pdf;Other +Speech Exemplar and Evaluation Database (SEED);https://osf.io/ygc8n/;English (American);;;Access by registration;This corpus includes recordings of single words and continuous speech samples that provide examples of speakers with and without speech disorders. ;Browse;https://osf.io/ygc8n/;Atkins et al. (2020);https://www.tandfonline.com/doi/full/10.1080/02699206.2020.1743761;Other +STAR Child speech-error database;https://www.seeingspeech.ac.uk/speechstar/child-speech-error-database/;English (Scottish);162 audio files ;orthographic, phonemic, phonetic;CC BY-NC-ND;This is a collection of multiple audio-articulatory speech disorder corpora#SEPThe corpus is constituted of composite videos containing (i) midsagittal tongue movement, imaged with ultrasound tongue imaging (UTI), (ii) optional profile lip movement, recorded with a headset-mounted camera, and (iii) synchronised audio.#SEPRecordings in this database are of single words, or short phrases, produced by child speakers who were either reading orthographic stimuli from a screen, naming pictures, or repeating words produced by a researcher. Phonemic transcriptions are provided in order that those who are not familiar with the (rhotic) central Scottish accent can be aware of the speech sound targets. ;Browse;https://www.seeingspeech.ac.uk/speechstar/child-speech-error-database/?type=errorType&;Lawson et al. (2023);https://guarant.cz/icphs2023/236.pdf;Other +STAR Disordered child-speech sentences database;https://www.seeingspeech.ac.uk/speechstar/disordered-child-speech-sentences-database/;English (Scottish);18 speakers;orthographic, phonemic, phonetic;CC BY-NC-ND;This is a collection of multiple audio-articulatory speech-disorder corpora.#SEPDatabase items are composite videos containing (i) midsagittal tongue movement, imaged with ultrasound tongue imaging (UTI), (ii) optional profile lip movement, recorded with a headset-mounted camera, and (iii) synchronised audio.#SEPRecordings in this database are of sentences produced by child speakers (aged 6,1-13,4) who were either reading orthographic stimuli from a screen, or repeating sentences produced by a researcher. Diagnoses are based on clinicians' reports.;Browse;https://www.seeingspeech.ac.uk/speechstar/disordered-child-speech-sentences-database/;Lawson et al. (2023);https://guarant.cz/icphs2023/236.pdf;Other +The Cleft Dataset;https://ultrasuite.github.io/data/cleft/;English (Scottish);11 speakers;Orthographic, phonetic;open access;This is a corpus of ultrasound and audio recorded with children with cleft lip and palate. ;Download;https://ultrasuite.github.io/download/;Cleland et al. (2020);https://doi.org/10.1159/000499753;Other +Ultraphonix ;https://ultrasuite.github.io/data/uxssd/;English (Scottish);19 hours;Orthographic, phonetic;open access;This is a corpus of ultrasound and audio recordings from children with speech sound disorders. It contains data from 20 speakers (16 male, 4 female), aged 6-13 years. ;Download;https://ultrasuite.github.io/download/;Eshky et al. (2018);https://doi.org/10.48550/arXiv.1907.00835;Other +Ultrax 2020 Dataset;https://ultrasuite.github.io/data/ux2020/;English (Scottish);37 speakers;Orthographic, phonetic;open access;This is a corpus of ultrasound tongue imaging and audio data, gathered from children with speech sound disorders by speech and language therapists in hospital environments.#SEP11 female speakers and 26 male, aged 5-12 years. There is one recording per child.#SEPThe following metadata are available for each recording: speech waveform, raw ultrasound data, ultrasound parameters, and prompt text with date/time of utterance recording. ;Download;https://ultrasuite.github.io/download/;Eshky et al. (2018);https://doi.org/10.48550/arXiv.1907.00835;Other +Ultrax Speech Sound Disorders;https://ultrasuite.github.io/data/uxssd/;English (Scottish);11 hours;Orthographic, phonetic;open access;This is a corpus of ultrasound and audio recordings from children with speech sound disorders.#SEPIt contains data from 8 speakers (2 female and 6 male), aged 5-10 years. ;Download;https://ultrasuite.github.io/download/;Eshky et al. (2018);https://doi.org/10.48550/arXiv.1907.00835;Other +Phonological Development Tools and Cross-Linguistic Phonologyt Project;https://phonodevelopment.sites.olt.ubc.ca/;English, French, Spanish, Mandarin, Cantonese, Slovenian;4 speakers for transcription resource;Phonemic and phonetic transcription;CC 4.0 Non-commercial;This corpus is used for investigating the phonological development across languages, and to evaluate intervention outcomes given a nonlinear phonological approach and ultrasound intervention outcomes across speech disorders.;Browse;https://phonodevelopment.sites.olt.ubc.ca/;;;Other +Plan-V Aphasia Corpus;https://planv-project.gr/;Greek (Modern);1.84 MB;Sentence, utterance, clause, POS;CC-BY 4.0;This corpus contains spoken discourse data collected from Greek-speaking People with Aphasia (PWA) and from neurotypical adults.;Download;https://inventory.clarin.gr/corpus/1284;Stamouli et al. (2023);https://doi.org/10.3389/fcomm.2023.919617;Other +EWA DB Early Warning of Alzheimers speech database;https://catalogue.elra.info/en-us/repository/browse/ELRA-S0489/;Slovak;150 hours;;Non-commercial and commercial options;This corpus contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects.#SEPSpeech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), and picture description (two single pictures and three complex pictures).;;;;;Other +Ahoslabi-esophageal speech database;https://catalog.elra.info/en-us/repository/browse/ELRA-S0413/;Spanish, Castilian;10.8 hours;;Non Commercial Use - ELRA END USER;This corpus primarily consists of recordings of 31 laryngectomees (27 males and 4 females) pronouncing 100 phonetically balanced sentences.#SEPEsophageal voices were recorded in a soundproof recording cubicle with a Neuman microphone.#SEPThe corpus also includes parallel recordings of the sentences by 9 healthy speakers (6 males and 3 females) to facilitate speech processing tasks that require small parallel corpora, such as voice conversion or synthetic speech adaptation. Apart from the sentences, the database also contains 4 sustained vowels and a small set of isolated words (14) which can be very valuable for research on esophageal speech analysis, diagnosis and evaluation. ;;;Serrano García (2021);https://doi.org/10.1016/j.csl.2020.101168;Other +The SSNCE Database of Tamil Dysarthric Speech;https://catalog.ldc.upenn.edu/LDC2021S04;Tamil;30 speakers;phonetic;"LDC";This is a corpus of Tamil Dysarthric Speech.#SEP The corpus contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).#SEPThe non-dysarthric speakers consisted of five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years ol.#SEP In total, each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases.#SEPThe corpus includes time-aligned phonetic transcripts for all collected speech data. Additional documentation includes phoneme mappings and speaker metadata. Audio data is presented as 16-bit 16kHz FLAC compressed linear pcm wav. Transcripts are presented as UTF-8 encoded plain text.;Download;https://catalog.ldc.upenn.edu/LDC2021S04;;;Other