Skip to content

Commit

Permalink
reverted all fields to strings to mirror CSVs
Browse files Browse the repository at this point in the history
  • Loading branch information
kreetrapper committed Sep 6, 2024
1 parent 552e249 commit ef1d1dd
Show file tree
Hide file tree
Showing 29 changed files with 31 additions and 31 deletions.
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/comere.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains e-mails, forum posts, online chats, tweets and SMS.\nThe corpus is available for download from Ortolang.",
"Languages": ["fra"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["80 million tokens"],
"Annotation": ["tokenised", "mostly untagged"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/contemp-blogs.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains blog posts.\nThe corpus is available for download from LINDAT.",
"Languages": ["ces"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["1 million tokens"],
"Annotation": ["tokenised", "<a href=\"https://nlp.fi.muni.cz/projekty/cocb/\">sentence tagged</a>"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/didi.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus consists of Facebook posts gathered from 136 Facebook users from South Tyrol. All texts are anonymised.\nThe corpus is available for download from the EURAC Research CLARIN repository.",
"Languages": ["deu","ita","eng","lad"],
"License": "https://gitlab.inf.unibz.it/commul/var/eurac-licenses/-/raw/v1.0/EULA-CLARIN-ACA-BY-NC-NORED.md",
"License": "ACA-BY-NC-NORED 1.0",
"Size": ["600,000 tokens"],
"Annotation": ["tokenised", "PoS-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/do-chat.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains online chats from 2000 to 2006\nThe corpus is available for download from the repository of CLARIN-D",
"Languages": ["deu"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["1 million tokens"],
"Annotation": ["tokenised", "PoS-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/ebay-petit.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains eBay listings from 2005, 2017, and 2018. The corpus is manually annotated.\nThe corpus is available for download from a dedicated webpage.",
"Languages": ["fra"],
"License": "https://creativecommons.org/licenses/by-nc-sa/4.0/",
"License": "CC-BY-NC-SA 4.0",
"Size": ["100,000 tokens"],
"Annotation": ["<a href=\"https://www.uni-potsdam.de/langage/la-bank/ebay.php\">see here</a>"],
"Infrastructure": "Other",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/global-web-en.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains texts from web-pages in United States, Great Britain, Australia, India, and 16 other countries. About 60% of the texts come from blogs.\nThe corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.",
"Languages": ["eng"],
"License": "FIXME CLARIN RES (download); CLARIN ACA (online)",
"License": "CLARIN RES (download); CLARIN ACA (online)",
"Size": ["1.8 billion words", "1.8 million texts"],
"Annotation": "",
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/hs-fi-news.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains the domestic news of the <a href=\"https://www.hs.fi/\">Helsingin Sanomat</a> website and their comments from 5 September 2011 to 4 September 2012.\nThe corpus has been syntactically parsed using TDT alpha.\nThe corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.",
"Languages": ["fin"],
"License": "FIXME CLARIN ACA – NC",
"License": "CLARIN ACA – NC",
"Size": ["8 million tokens", "593,760 sentences", "93,602 texts"],
"Annotation": ["PoS-tagged", "lemmatised", "syntactically parsed"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/janes-blog.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains blog posts from RTV Slovenija and Publishwall.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText",
"Languages": ["slv"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["34 million tokens"],
"Annotation": ["tokenised", "sentence segmented", "MSD-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/janes-forum.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains forum posts from Avtomobilizem.com, MedOver.net and RTV Slovenija.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.",
"Languages": ["slv"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["47 million tokens"],
"Annotation": ["tokenised", "sentence segmented", "MSD-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/janes-news.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains news comments from RTV Slovenija, Mladina and Reporter.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.",
"Languages": ["slv"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["14 million tokens"],
"Annotation": ["tokenised", "sentence segmented", "MSD-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/janes-tweet.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains tweets written by Slovenian Twitter users from 2013 to 2017.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.",
"Languages": ["slv"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["139 million tokens"],
"Annotation": ["tokenised", "sentence segmented", "MSD-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/janes-wiki.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains Slovenian Wikipedia user and talk pages.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.",
"Languages": ["slv"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["5 million tokens"],
"Annotation": ["tokenised", "sentence segmented", "MSD-tagged", "lemmatised"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/litis.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains forum posts from portals delfi.lt and lrytas.lt from 2010 to 2014.\nThe corpus is available for download from the CLARIN-LT repository.",
"Languages": ["lit"],
"License": "FIXME CLARIN_ACA",
"License": "CLARIN_ACA",
"Size": ["190,000 comments"],
"Annotation": "",
"Infrastructure": "CLARIN",
Expand Down
4 changes: 2 additions & 2 deletions corpora/cmc-corpora/macocu.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. The crawler is available at <a href=\"https://github.com/macocu/MaCoCu-crawler\">MaCoCu GitHub channel</a>. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing <a href=\"https://corpus.tools/wiki/Justext\">boilerplate</a> and <a href=\"https://corpus.tools/wiki/Onion\">near-duplicated paragraphs</a>, discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed.\nThe dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and <a href=\"https://github.com/bitextor/monotextor\">other criteria</a>, making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as \"short\" or \"good\", assigned based on paragraph length, URL and stopword density via the <a href=\"https://corpus.tools/wiki/Justext\">jusText tool</a>) and fluency (score between 0 and 1, assigned with the <a href=\"https://github.com/bitextor/monocleaner\">Monocleaner tool</a>), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the <a href=\"https://github.com/bitextor/biroamer\">Biroamer tool</a>). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using <a href=\"https://github.com/CLD2Owners/cld2\">Google's Compact Language Detector 2 (CLD2)</a>, a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be easily read with the <a href=\"https://pypi.org/project/prevert/\">prevert parser</a>.",
"Languages": ["sqi","bos","bul","cat","hrv","ell","isl","mkd","mlt","cnr","srp","tur","ukr","slv"],
"License": "https://creativecommons.org/publicdomain/zero/1.0/",
"License": "CC0 No Rights Reserved",
"Size": "",
"Annotation": ["annotated with extensive metadata"],
"Infrastructure": "CLARIN",
Expand All @@ -24,5 +24,5 @@
"Download (Turkish)": "http://hdl.handle.net/11356/1802",
"Download (Ukrainian)": "http://hdl.handle.net/11356/1838"
},
"Publication": "FIXME Bañón et al. (2022)"
"Publication": "Bañón et al. (2022)"
}
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/monitor-slo-trendi.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@
"Concordancer (noSketchEngine)": "https://www.clarin.si/ske/#dashboard?corpname=trendi",
"Concordancer(KonText)": "https://www.clarin.si/kontext/query?corpname=trendi"
},
"Publication":"FIXME:Kosem (2022)#SEPKosem et al. (2022)"
"Publication":"Kosem (2022)#SEPKosem et al. (2022)"
}
4 changes: 2 additions & 2 deletions corpora/cmc-corpora/paisa.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.\nThe corpus is available for download from the EURAC Research CLARIN repository.",
"Languages": ["ita"],
"License": "https://creativecommons.org/licenses/by-nc-sa/4.0/",
"License": "CC-BY-NC-SA 4.0",
"Size": ["380,000 pages", "250 million words"],
"Annotation": "",
"Infrastructure": "CLARIN",
"Access": {
"Download": "http://hdl.handle.net/20.500.12124/3"
},
"Publication": "FIXME https://aclanthology.org/W14-0406/"
"Publication": ""
}
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/pdrs.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains texts from the web obtained by crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a frequency between 5,000 and 500,000 in srWaC have been used. The texts are deduplicated, cyrillic texts have been transliterated into the Latin alphabet. The linguistic processing was done with the <a href=\"https://github.com/clarinsi/classla\">CLASSLA package</a> for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies).\nIn addition, some 80% of the URLs are manually tagged for 10 different types of sources (\"area\"): media (media outlets with several posts daily), inform (topic-centered sites with infrequent posts - maximum 3 per day), company (presentations of companies), state (websites of government bodies on nationa, regional and local level), forum (forum posts), portal (topic-centered portals without daily coverage), science (scientific publications), shop (with descriptions of products), database (knowledge bases, dictionaries, databases and similar) and community (NGOs, fan clubs, associations and other). The corpus is distributed in the CoNLL-U format in batches of appr. 2x50 mio. tokens.\nThe corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through noSketchEngine and KonText concordancers.",
"Languages": ["srp"],
"License": "https://creativecommons.org/licenses/by/4.0/",
"License": "CC-BY",
"Size": ["715 million tokens"],
"Annotation": ["tokenised", "MSD-tagged (MULTEXT-East & UD)", "lemmatised", "annotated for text source"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/sfnet.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains written posts from the SFNET forum in Finnish from 2002 to 2003.\nThe PoS-tagging has been done with the FI-FDG Parser, which uses a computational implementation of Functional Dependency Grammar.\nThe corpus is available for download from META-SHARE (the Finnish Language Bank)",
"Languages": ["fin"],
"License": "FIXME CLARIN ACA – NC",
"License": "CLARIN ACA – NC",
"Size": ["100 million words"],
"Annotation": ["PoS-tagged", "sentence and word segmentation"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/suomi24.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "This corpus contains forum posts from the Suomi24 website from 2001 to 2016.\nThe corpus is available for download from the FIN-CLARIN repository and through the concordancer Korp.",
"Languages": ["fin"],
"License": "FIXME CLARIN ACA",
"License": "CLARIN ACA",
"Size": ["2.6 billion tokens"],
"Annotation": ["tokenised", "MSD-tagged"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/cmc-corpora/ylilauta.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Computer-mediated communication corpora",
"Description": "The corpus contains text from discussions of the <a href=\"https://ylilauta.org/\">Ylilauta</a> online discussion board from 2012 to 2014.\nThe corpus has been syntactically annotated with the TDT alpha parser, while the named entities have been assigned using the FiNER tool.\nThe corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.",
"Languages": ["fin"],
"License": "https://creativecommons.org/licenses/by-nc/4.0/",
"License": "CC-BY-NC",
"Size": ["26.9 million words"],
"Annotation": ["PoS-tagged", "lemmatised", "syntactically parsed", "named entities"],
"Infrastructure": "CLARIN",
Expand Down
2 changes: 1 addition & 1 deletion corpora/corpora-of-disordered-speech/ssnce-tamil.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Corpora of Disordered Speech",
"Description": "This is a corpus of Tamil Dysarthric Speech.\nThe corpus contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).\nThe non-dysarthric speakers consisted of five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years ol.\nIn total, each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases.\nThe corpus includes time-aligned phonetic transcripts for all collected speech data. Additional documentation includes phoneme mappings and speaker metadata. Audio data is presented as 16-bit 16kHz FLAC compressed linear pcm wav. Transcripts are presented as UTF-8 encoded plain text.",
"Languages": ["tam"],
"License": "https://catalog.ldc.upenn.edu/license/the-ssnce-database-of-tamil-dysarthric-speech-agreement.pdf",
"License": "<a href=\"https://catalog.ldc.upenn.edu/license/the-ssnce-database-of-tamil-dysarthric-speech-agreement.pdf\">LDC</a>",
"Size": ["30 speakers"],
"Annotation": ["phonetic"],
"Infrastructure": "Other",
Expand Down
2 changes: 1 addition & 1 deletion corpora/historical-corpora/letter-sinebrychoff.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"Languages": ["fin", "swe"],
"License": "CC-BY",
"Size": ["8.6 million words"],
"Annotation": ["FIXME Finnish subset: MSD-tagged, syntactically parsed; Swedish subset: no linguistic annotation"],
"Annotation": ["Finnish subset: MSD-tagged, syntactically parsed; Swedish subset: no linguistic annotation"],
"Infrastructure": "CLARIN",
"Access": {
"Concordancer": "http://kirjearkisto.siff.fi/Sinebrychoff/tabid/55/Default.aspx"
Expand Down
2 changes: 1 addition & 1 deletion corpora/reference-corpora/bnc.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/20.500.14106/2554",
"Family": "Reference corpora",
"Description": "This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993.\nThe corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language.\nThe corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK).",
"Languages": ["FIXME eng (British)"],
"Languages": ["English (British)"],
"License": "BNC User Licence (restricted for the downloadable version)",
"Size": ["100 million words"],
"Annotation": ["PoS-tagged", "lemmatized"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/reference-corpora/conae.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://urn.fi/urn:nbn:fi:lb-2019031901",
"Family": "Reference corpora",
"Description": "This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012.\nThe corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution).",
"Languages": ["FIXME eng (American)"],
"Languages": ["English (American)"],
"License": ["CLARIN ACA (online version)", "CLARIN RES (downloadable version)"],
"Size": ["440 million words", "190,000 texts"],
"Annotation": ["PoS-tagged", "lemmatized"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/reference-corpora/dereko.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"Family": "Reference corpora",
"Description": "This corpus includes German texts in <a href=\"https://www.ids-mannheim.de/digspra/kl/projekte/korpora/archiv-1/\">a wide variety of genres</a> published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information.\nPart of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform.",
"Languages": ["deu"],
"License": "https://creativecommons.org/licenses/by-sa/4.0/",
"License": "CC-BY-SA",
"Size": ["31.7 billion words"],
"Annotation": ["MSD-tagged", "lemmatized"],
"Infrastructure": "CLARIN",
Expand Down
Loading

0 comments on commit ef1d1dd

Please sign in to comment.