Skip to content

Commit

Permalink
cleaned up missed SEPs
Browse files Browse the repository at this point in the history
  • Loading branch information
kreetrapper committed Oct 2, 2024
1 parent 28a7ba3 commit 1960adc
Show file tree
Hide file tree
Showing 10 changed files with 10 additions and 10 deletions.
2 changes: 1 addition & 1 deletion corpora/academic-corpora/acl-anth.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "ACL Anthology Reference Corpus",
"URL": "https://hdl.handle.net/10.35111/rfeg-z495",
"Family": "Academic corpora",
"Description": "This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format.#SEPThe corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website.",
"Description": "This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format.\nThe corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website.",
"Languages": ["eng"],
"License": "CC BY SA",
"Size": ["75 million tokens"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/czec-soc.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "Czech Sociological Review",
"URL": "https://hdl.handle.net/11372/LRT-2703",
"Family": "Academic corpora",
"Description": "This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format.#SEPThe corpus is available for download from the LINDAT repository.",
"Description": "This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format.\nThe corpus is available for download from the LINDAT repository.",
"Languages": ["ces"],
"License": "MIT",
"Size": ["3 million words"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/monitor-slo-trendi.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@
"Concordancer (noSketchEngine)": "https://www.clarin.si/ske/#dashboard?corpname=trendi",
"Concordancer(KonText)": "https://www.clarin.si/kontext/query?corpname=trendi"
},
"Publication":"Kosem (2022)#SEPKosem et al. (2022)"
"Publication": ["Kosem (2022)", "Kosem et al. (2022)"]
}
2 changes: 1 addition & 1 deletion corpora/corpora-of-disordered-speech/aphasiabank.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "AphasiaBank",
"URL": "https://aphasia.talkbank.org/",
"Family": "Corpora of Disordered Speech",
"Description": "This is a corpus of multimedia interactions for the study of communication in aphasia.#SEP Access to the data in AphasiaBank is password protected and restricted to members of the AphasiaBank consortium group.\nData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.",
"Description": "This is a corpus of multimedia interactions for the study of communication in aphasia.\n Access to the data in AphasiaBank is password protected and restricted to members of the AphasiaBank consortium group.\nData in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.",
"Languages": ["yue", "hrv", "eng", "fra", "deu", "ell", "hun", "ita", "jpn", "cmn", "ron", "spa"],
"License": "email request for access",
"Size": ["380 MB transcripts", "827 GB media"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/corpora-of-disordered-speech/raput.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0",
"URL": "http://hdl.handle.net/11356/1435",
"Family": "Corpora of Disordered Speech",
"Description": "The corpus consists of texts produced by nonprofessional typical speakers and speakers with different language disorders (developmental language disorder, dyslexia, traumatic brain injury, aphasia, other).#SEPRoughly half of the corpus consists of texts of typical speakers, and the other half of speakers with language disorders.\nLanguage samples were elicited by six groups of tasks representing different writing styles (descriptive, expository, narrative, and letter) and different levels of formality.",
"Description": "The corpus consists of texts produced by nonprofessional typical speakers and speakers with different language disorders (developmental language disorder, dyslexia, traumatic brain injury, aphasia, other).\nRoughly half of the corpus consists of texts of typical speakers, and the other half of speakers with language disorders.\nLanguage samples were elicited by six groups of tasks representing different writing styles (descriptive, expository, narrative, and letter) and different levels of formality.",
"Languages": ["hrv"],
"License": "CC-BY-SA 4.0",
"Size": ["6760 texts", "34469 sentences", "426187 tokens"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/historical-corpora/gysseling.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "Corpus Gysseling",
"URL": "http://hdl.handle.net/10032/tm-a2-j4",
"Family": "Historical corpora",
"Description": "This corpus contains texts from the 13th century.#SEPThe texts were prepared and originally published in the 1970s and 1980s by the Ghent linguist <a href=\"https://en.wikipedia.org/wiki/Maurits_Gysseling\">Maurits Gysseling</a>.\nThe corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer.",
"Description": "This corpus contains texts from the 13th century.\nThe texts were prepared and originally published in the 1970s and 1980s by the Ghent linguist <a href=\"https://en.wikipedia.org/wiki/Maurits_Gysseling\">Maurits Gysseling</a>.\nThe corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer.",
"Languages": ["nld"],
"License": "INT Licence for researchers",
"Size": ["1.5 million words"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/manually-annotated-corpora/rsdo-def.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@
"Access": {
"Download": "http://hdl.handle.net/11356/1841"
},
"Publication": "Tran et al. (2023)#SEPPollak (2014)"
"Publication": ["Tran et al. (2023)", "Pollak (2014)"]
}
2 changes: 1 addition & 1 deletion corpora/manually-annotated-corpora/uni-dep.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion corpora/parallel-corpora/pages.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "PaGeS",
"URL": "https://www.corpuspages.eu/corpus/about/about",
"Family": "Parallel corpora",
"Description": "This corpus is comprised of two major parts: the core corpus and the supplements.#SEPThe core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (approx. 6%) of German and Spanish texts translated from a third language. The core corpus includes samples from 178 works of fiction (novels and short stories) as well as samples from non-fiction (essays and popular texts).\nThe text have been manually verified at different levels and the automatic alignment of the bisegments, performed by <a href=\"https://sourceforge.net/projects/aligner/\">LF-Aligner</a>, has been manually reviewed. The German texts have been lemmatized and PoS-tagged with <a href=\"http://hdl.handle.net/11022/1007-0000-0000-8E4D-B\">Treetagger</a> (part of the <a href=\"https://www.clarin.eu/resource-families/tools-part-speech-tagging-and-lemmatization\">PoS taggers and lemmatizers Resource Family</a>) and the Spanish texts with <a href=\"https://nlp.lsi.upc.edu/freeling/node/1\">Freeling</a>. The tags of both have been mapped to the Universal PoS tags.\nThe supplements include so far: <a href=\"https://www.statmt.org/europarl/\">Europarl v7</a>, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011 (also part of the <a href=\"https://www.clarin.eu/resource-families/parliamentary-corpora\">Parliamentary Corpora Resource Family</a>); and Ted-Talks (part of this family), a corpus that collects the German and Spanish translations of the transcriptions of Ted-Talks from 2006 to 2020.\nThe corpus is available for online browsing via a dedicated interface.",
"Description": "This corpus is comprised of two major parts: the core corpus and the supplements.\nThe core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (approx. 6%) of German and Spanish texts translated from a third language. The core corpus includes samples from 178 works of fiction (novels and short stories) as well as samples from non-fiction (essays and popular texts).\nThe text have been manually verified at different levels and the automatic alignment of the bisegments, performed by <a href=\"https://sourceforge.net/projects/aligner/\">LF-Aligner</a>, has been manually reviewed. The German texts have been lemmatized and PoS-tagged with <a href=\"http://hdl.handle.net/11022/1007-0000-0000-8E4D-B\">Treetagger</a> (part of the <a href=\"https://www.clarin.eu/resource-families/tools-part-speech-tagging-and-lemmatization\">PoS taggers and lemmatizers Resource Family</a>) and the Spanish texts with <a href=\"https://nlp.lsi.upc.edu/freeling/node/1\">Freeling</a>. The tags of both have been mapped to the Universal PoS tags.\nThe supplements include so far: <a href=\"https://www.statmt.org/europarl/\">Europarl v7</a>, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011 (also part of the <a href=\"https://www.clarin.eu/resource-families/parliamentary-corpora\">Parliamentary Corpora Resource Family</a>); and Ted-Talks (part of this family), a corpus that collects the German and Spanish translations of the transcriptions of Ted-Talks from 2006 to 2020.\nThe corpus is available for online browsing via a dedicated interface.",
"Languages": ["German-Spanish"],
"License": "<a href=\"https://www.corpuspages.eu/corpus/about/privacyterms?lang=en\">Terms of Use</a>",
"Size": ["Main part: 38 million tokens; 1.1 million bisegments (alignments). Supplements: 80 million tokens"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/sign-language-resources/bsl-lexicon.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"Name": "BSL Lexicon CN",
"URL": "https://hdl.handle.net/1839/00-0000-0000-0008-1768-5",
"Family": "Sign language resources",
"Description": "This lexicon was derived from the <a href=\"https://bslcorpusproject.org/\">British Sign Language Corpus</a> and is part of the <a href=\"https://hdl.handle.net/1839/00-0000-0000-0001-494E-3\">ECHO</a> case study on sign languages.#SEPThe lexicon is available for download from the MPI Language Archive.",
"Description": "This lexicon was derived from the <a href=\"https://bslcorpusproject.org/\">British Sign Language Corpus</a> and is part of the <a href=\"https://hdl.handle.net/1839/00-0000-0000-0001-494E-3\">ECHO</a> case study on sign languages.\nThe lexicon is available for download from the MPI Language Archive.",
"Languages": ["British Sign Language (BSL)"],
"License": "Public",
"Size": [],
Expand Down

0 comments on commit 1960adc

Please sign in to comment.