Skip to content

Commit

Permalink
Languages -> Language
Browse files Browse the repository at this point in the history
  • Loading branch information
kreetrapper committed Oct 17, 2024
1 parent 9bc906f commit 24c9f1e
Show file tree
Hide file tree
Showing 942 changed files with 945 additions and 945 deletions.
2 changes: 1 addition & 1 deletion corpora/academic-corpora/ac-lit.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://coralit.lt/en/node/18",
"Family": "Academic corpora",
"Description": "This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and master’s and PhD theses from the following disciplines:\n<ul><li>humanities (architecture, fine art studies, ethnology, folklore studies, philosophy, linguistics, literary theory, librarianship, history, theology),</li><li>social sciences (law, political science,\neconomics, psychology, education, management),</li><li>physical sciences (mathematics, astronomy, physics, chemistry, geography, geology and mineralogy, informatics),</li><li>biomedical sciences (medicine, dental surgery, biology, botany, agronomy, animal husbandry, pharmacy, veterinary science, forestry studies), and</li><li>technological sciences (energy studies, chemical technology, materials science, mechanics, metrology, building construction, transport technology, agricultural and\nenvironmental sciences, management and informatics).</li></ul>The materials were published between 1999 and 2009. The corpus is encoded in TEI 5.\nThe corpus is available for online querying through a dedicated website.",
"Languages": ["lit"],
"Language": ["lit"],
"Licence": "",
"Size": ["9 million words"],
"Annotation": ["no linguistic annotation"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/aca-hum.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/10794/49",
"Family": "Academic corpora",
"Description": "This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.\nThe corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).",
"Languages": ["swe"],
"Language": ["swe"],
"Licence": "CC BY",
"Size": ["14.5 million tokens"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/aca-soc.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/10794/50",
"Family": "Academic corpora",
"Description": "This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.\nThe corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).",
"Languages": ["swe"],
"Language": ["swe"],
"Licence": "CC BY",
"Size": ["10.8 million tokens"],
"Annotation": ["sentence segmentation"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/acl-anth.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://hdl.handle.net/10.35111/rfeg-z495",
"Family": "Academic corpora",
"Description": "This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format.\nThe corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website.",
"Languages": ["eng"],
"Language": ["eng"],
"Licence": "CC BY SA",
"Size": ["75 million tokens"],
"Annotation": ["PoS-tagged", "lemmatised", "author/text metadata"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/acnz.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://www.wgtn.ac.nz/lals/resources/academicwordlist/information/corpus",
"Family": "Academic corpora",
"Description": "This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology.\nThis corpus is not available.",
"Languages": ["eng"],
"Language": ["eng"],
"Licence": "",
"Size": ["3.5 million words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/chambers-lb.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/20.500.14106/2527",
"Family": "Academic corpora",
"Description": "This corpus contains research papers in the following disciplines:\n<ul><li>media/culture,</li><li>literature,</li><li>linguistics and language learning,</li><li>social anthropology,</li><li>law, economics,</li><li>sociology and social sciences,</li><li>philosophy,</li><li>history, and</li><li>communication.</li></ul>\nThe research papers were published between 1998 and 2006. This is a plain text corpus.\nThe corpus is available for download from the Oxford Text Archive.",
"Languages": ["fra"],
"Language": ["fra"],
"Licence": "Oxford Text Archive licence (academic use)",
"Size": ["1 million words"],
"Annotation": ["No annotation"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/czec-soc.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://hdl.handle.net/11372/LRT-2703",
"Family": "Academic corpora",
"Description": "This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format.\nThe corpus is available for download from the LINDAT repository.",
"Languages": ["ces"],
"Language": ["ces"],
"Licence": "MIT",
"Size": ["3 million words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/eng-sci.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/11858/00-246C-0000-0023-8CF9-6",
"Family": "Academic corpora",
"Description": "This corpus contains journal articles in the following disciplines:\n<ul><li>computer science,</li><li>computational linguistics,</li><li>informatics,</li><li>digital construction,</li><li>microelectronics,</li><li>linguistics,</li><li>biology,</li><li>mechanical engineering, and</li><li>electrical engineering.</li></ul>\nThe articles were published in the 1970s, 1980s and the 200s.\nThe corpus is available for online querying through CQPWeb (CLARIN-D distribution).",
"Languages": ["eng"],
"Language": ["eng"],
"Licence": "restricted",
"Size": ["35 million tokens"],
"Annotation": ["PoS-tagged", "lemmatised", "author/text metadata", "document structure"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/est-sci.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/11297/1-00-0000-0000-0000-0002-4",
"Family": "Academic corpora",
"Description": "This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format.",
"Languages": ["est"],
"Language": ["est"],
"Licence": "CLARIN ACA-NC",
"Size": ["5 million words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/genia.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://www.geniaproject.org/genia-corpus",
"Family": "Academic corpora",
"Description": "This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB.\nThe corpus is available for download from PORTULAN.",
"Languages": ["eng"],
"Language": ["eng"],
"Licence": "free but unspecified",
"Size": ["437,000 words"],
"Annotation": ["PoS-tagged", "syntactically parsed", "annotated for terms, events, semantic relations and coreference", "text metadata"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/jezkor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/11356/1755",
"Family": "Academic corpora",
"Description": "This corpus contains a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal \"Jezikoslovni zapiski\" and 28 papers published in the journal \"Slovenski jezik\". Note that the texts were obtained directly from PDFs, so they contain various types of noise.\nThe corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.\nThe corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.",
"Languages": ["slv"],
"Language": ["slv"],
"Licence": "CC BY",
"Size": ["9.3 million tokens"],
"Annotation": ["PoS-tagged (UD)", "MSD-tagged (UD & MULTEXT-East)", "lemmatised", "annotated for named entities and author/text metadata"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/kas.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/11356/1448",
"Family": "Academic corpora",
"Description": "This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the TEI format.\nThe corpus is available for download from CLARIN.SI. Version 1.0 is also available for online querying through <a href=\"https://www.clarin.si/noske/run.cgi/corp_info?corpname=kas&struct_attr_stats=1&subcorpora=1\">noSketch Engine</a> and <a href=\"https://www.clarin.si/kontext/first_form?corpname=kas\">KonText</a> (CLARIN.SI distribution).",
"Languages": ["slv"],
"Language": ["slv"],
"Licence": "CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0",
"Size": ["1.5 billion tokens"],
"Annotation": ["MSD-tagged", "lemmatised", "marked for bilingual and monolingual term candidates"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/kiap.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/11495/D989-605B-8F10-5",
"Family": "Academic corpora",
"Description": "This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003.\nThe corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution).",
"Languages": ["eng","fra","nor"],
"Language": ["eng","fra","nor"],
"Licence": "CC-BY 4.0",
"Size": ["3.9 million tokens"],
"Annotation": ["PoS-tagged"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/lit-trans.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.grnet.gr/11500/KEG-0000-0000-24F2-6",
"Family": "Academic corpora",
"Description": "This corpus contains journal articles in literary and translation studies. This is a plain text corpus.\nThe corpus is available for download from the CLARIN:EL repository.",
"Languages": ["ell"],
"Language": ["ell"],
"Licence": "CC-BY-SA",
"Size": ["48,300 words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/modern-greek.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.grnet.gr/11500/KEG-0000-0000-2502-4",
"Family": "Academic corpora",
"Description": "This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus.\nThe corpus is available for download from the CLARIN:EL repository.",
"Languages": ["ell"],
"Language": ["ell"],
"Licence": "CC-BY-SA",
"Size": ["113,000 words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/muchmore.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://muchmore.dfki.de/resources1.htm",
"Family": "Academic corpora",
"Description": "This paper contains journal paper abstracts from medical disciplines. The corpus is encoded in MuchMore XML.\nThe corpus is available for download from a dedicated website.",
"Languages": ["eng","deu"],
"Language": ["eng","deu"],
"Licence": "free but unspecified",
"Size": ["1 million tokens"],
"Annotation": ["PoS/MSD-tagged", "phrase chunking", "semantic class and relations", "document structure"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/open-slo.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/11356/1774",
"Family": "Academic corpora",
"Description": "This corpus contains a large collection of scientific writing in the Slovenian language gathered from the <a href=\"https://openscience.si\">Open Science Slovenia portal</a>. It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the <a href=\"https://github.com/clarinsi/classla\">CLASSLA pipeline</a> on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file.\nNote that there exist similar, but older and smaller corpora <a href=\"http://hdl.handle.net/11356/1448\">KAS 2.0</a> and <a href=\"http://hdl.handle.net/11356/1244\">KAS 1.0</a>. These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for \"KAS\" to find them.\nThe corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.",
"Languages": ["slv"],
"Language": ["slv"],
"Licence": "CC BY-SA",
"Size": ["326 million tokens"],
"Annotation": ["PoS-tagged (UD)", "MSD-tagged (UD & MULTEXT-East)", "lemmatised", "annotated for named entities and author/text metadata"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/orossimo.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.grnet.gr/11500/ATHENA-0000-0000-2410-5",
"Family": "Academic corpora",
"Description": "This corpus contains academic texts in the following disciplines:\n<ul<li>social sciences,</li><li>computer science,</li><li>economics,</li><li>linguistics,</li><li>photography,</li><li>law,</li><li>engineering,</li><li>history,</li><li>astronomy,</li><li>earth sciences and geology,</li><li>medicine and health, and</li><li>biology.</li></ul>\nThe corpus is encoded in XML (XCES).\nThe corpus is available for download from the CLARIN:EL repository.",
"Languages": ["ell"],
"Language": ["ell"],
"Licence": "CC-BY",
"Size": ["2.5 million tokens"],
"Annotation": ["marked for term candidates", "mixed structural annotation"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/reading.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://www.reading.ac.uk/internal/appling/corpus.htm",
"Family": "Academic corpora",
"Description": "This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML.\nThe corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement.",
"Languages": ["eng"],
"Language": ["eng"],
"Licence": "restricted",
"Size": [],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/roger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://roger-corpus.org/",
"Family": "Academic corpora",
"Description": "The corpus contains academic papers from eight disciplines, written by the Romanian students in native Romanian and English L2.\nThe corpus was collected over a three-year period (2018–2021) with the help of 27 collaborators from nine Romanian universities.\nThe corpus is available for online querying through a <a href=\"https://roger-corpus.org/index.php\">dedicated platform</a> developed at the <a href=\"https://codhus.projects.uvt.ro/\">CODHUS</a> research centre from the West University of Timisoara.",
"Languages": ["eng","ron"],
"Language": ["eng","ron"],
"Licence": "CC BY-NC-ND",
"Size": ["3.3 million words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/roysoc.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://hdl.handle.net/21.11119/0000-0001-7E8B-6",
"Family": "Academic corpora",
"Description": "This corpus contains journal articles published in <a href=\"http://rstl.royalsocietypublishing.org/\">Philosophical Transactions of the Royal Society of London</a> between 1665 and 1869.\nThe corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland.",
"Languages": ["English (late and early modern)"],
"Language": ["English (late and early modern)"],
"Licence": "CC BY",
"Size": ["32 million tokens"],
"Annotation": ["PoS-tagged", "lemmatised", "normalised", "author and document metadata"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/scientext.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://scientext.hypotheses.org/corpus",
"Family": "Academic corpora",
"Description": "This corpus contains scientific texts and argumentative essays in humanities, experimental sciences, and applied/technical sciences.\nThe corpus is available for online querying through a dedicated webpage.",
"Languages": ["fra","eng"],
"Language": ["fra","eng"],
"Licence": "CC BY",
"Size": ["20 million words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/span-eng.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://books.google.si/books?id=NZbWCgAAQBAJ&pg=PA178&lpg=PA178&dq=serac+corpus&source=bl&ots=A7F-vUMJsr&sig=ACfU3U1b8W_r944Bs8OviL9xauHtUoeqVg&hl=sl&sa=X&ved=2ahUKEwiRuq_5nczmAhXT5KYKHWUtBlcQ6AEwAHoECAUQAQ#v=onepage&q=serac%20corpus&f=false",
"Family": "Academic corpora",
"Description": "This corpus contains journal articles published between 2000 and 2010.\nThe corpus is unavailable.",
"Languages": ["spa","eng"],
"Language": ["spa","eng"],
"Licence": "",
"Size": ["5.7 million words"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/ufal-papers.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "https://hdl.handle.net/11234/1-1731",
"Family": "Academic corpora",
"Description": "This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format.\nThe corpus is available for download from the LINDAT repository.",
"Languages": ["ces","eng"],
"Language": ["ces","eng"],
"Licence": "CC BY",
"Size": ["2 million words"],
"Annotation": ["document aligned"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/uh-eng.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://urn.fi/urn:nbn:fi:lb-2016102401",
"Family": "Academic corpora",
"Description": "This corpus contains MA and PhD theses published between 1999 and 2016.\nThe corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).",
"Languages": ["eng"],
"Language": ["eng"],
"Licence": "CC BY",
"Size": ["200 million tokens"],
"Annotation": ["PoS-tagged", "syntactically parsed"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/uh-fin.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://urn.fi/urn:nbn:fi:lb-2016090601",
"Family": "Academic corpora",
"Description": "This corpus contains MA and PhD theses published between 1999 and 2016.\nThe corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).",
"Languages": ["fin"],
"Language": ["fin"],
"Licence": "CC BY",
"Size": ["12.5 million tokens"],
"Annotation": ["PoS-tagged", "lemmatised"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/uh-fra.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://urn.fi/urn:nbn:fi:lb-2016102806",
"Family": "Academic corpora",
"Description": "This corpus contains MA and PhD theses published between 1999 and 2016.\nThe corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).",
"Languages": ["fra"],
"Language": ["fra"],
"Licence": "CC BY",
"Size": ["580,000 tokens"],
"Annotation": [],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/uh-ger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://urn.fi/urn:nbn:fi:lb-2016102807",
"Family": "Academic corpora",
"Description": "This corpus contains MA and PhD theses published between 1999 and 2016.\nThe corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).",
"Languages": ["deu"],
"Language": ["deu"],
"Licence": "CC BY",
"Size": ["560,000 tokens"],
"Annotation": ["No annotation"],
Expand Down
2 changes: 1 addition & 1 deletion corpora/academic-corpora/uh-rus.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"URL": "http://urn.fi/urn:nbn:fi:lb-2016102808",
"Family": "Academic corpora",
"Description": "This corpus contains MA and PhD theses published between 1999 and 2016.\nThe corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).",
"Languages": ["rus"],
"Language": ["rus"],
"Licence": "CC BY",
"Size": ["1.1 million words"],
"Annotation": ["No annotation"],
Expand Down
Loading

0 comments on commit 24c9f1e

Please sign in to comment.