For those of you who are not used to long markdown files, GitHub gracefully generates a table of contents for you! See more info on how to find it here.
Corpus = a collection of raw unlabeled texts
- Språkbanken Text -- this is a hub page for many Swedish corpora maintained by the Språkbanken Text, monolingual corpora come from newspapers, blog posts, literature of different years (some from as early as the 18th century). Note that many of these corpora contain scrambled sentences.
- CC-100 -- documents extracted from Common Crawl, automatically classified and filtered. Swedish part is 21 GB of raw text.
- mC4 -- a colossal, cleaned version of Common Crawl's web crawl corpus (C4), Swedish part contains about 65GB of raw text
- SOU corpus -- cleaned and further processed versions of Swedish Government Official Reports (Statens offentliga utredningar, SOU), covers the reports between 1994 and 2020
- SweDraCor -- corpus of 68 TEI-encoded Swedish language plays taken from eDrama project
- Swedish Poetry -- poetry corpus
- LBPF -- Swedish prose fiction with modern spelling from Litteraturbanken
- SBS -- a collection of sentences from Swedish blog posts from November 2010 until September 2012, contains scrambled sentences -- NOTE: links seem to be broken as of 2022-05-25
- Project Runeberg -- copyright-free Swedish literature
- Swedish Diachronic Corpus -- text corpora covering the time period from Old Swedish to present day for various text genres
- OSCAR — scrambled sentences extracted from Common Crawl and classified with a language detection model. It's Swedish portion comprises 48GB of raw text with roughly 7.5M documents and 5B words
- Polyglot's processed Swedish Wikipedia
- OPUS -- The Open Parallel Corpus, a hub for parallel datasets for many pairs of languages, including to/from Swedish.
- Språkbanken Text -- this is a hub page for many Swedish corpora maintained by the Språkbanken Text, the available parallel corpora are EuroParl (Swedish part of European Parliament Proceedings Parallel Corpus) and ASPAC (Swedish part of The Amsterdam Slavic Parallel Aligned Corpus). Note that both corpora contain scrambled sentences.
- SMULTRON -- a parallel treebank that contains around 1000 sentences in English, German and Swedish
Dataset = a collection of labeled texts
- Swedish Universal Dependencies treebanks -- can be used to train PoS-taggers, lemmatizers and dependency parsers
- SweQUAD-MC -- a multiple choice question answering dataset
- Swedish-sentiment -- a sentiment analysis dataset of 10000 texts with roughly 50/50 split between positive and negative sentiments
- Swedish-Causality-Datasets -- namely causality recognition and causality ranking dataset, taking texts from the official reports of Swedish Government
- Swedish-MWE-dataset -- a multiword expression dataset, containing 96 Swedish expressions annotated for their degrees of compositionality
- Swedish-NER
- by Andreas Klintberg -- semi-manually annotated Webbnyheter 2012 corpus from Språkbanken, 4 types of named entities: person, organization, location, miscellaneous.
- by Robert Lorentz
- The Written Works Corpus -- named entities for written works: ART, BOOK, GAME, MOVIE, MUSIC, PLAY, RADIO and TV. A bit more detailed description about the corpus is here
- SIC -- a corpus of Swedish Internet tags, manually annotated wth part of speech tags and named entities
- SUSC -- a corpus of seven novels by August Strindberg annotated with part of speech tags with morphological analysis and lemmas
- SNEC -- The Strindberg National Edition Corpus, both plain text version and linguistically annotated CoNLL-U version -- NOTE: links seem to be broken as of 2022-05-25
- SuperLim -- a Swedish version of GLUE benchmark
- OverLim -- dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT, the translation quality was not manually checked
- XNLI -- an autotranslated (Google Translate) natural language inference (NLI) dataset, no info about human correction
- STS Benchmark -- a semantic textual similarity (STS) dataset, automatically translated version of the original STS Benchmark for English using Google's NMT API without human correction
- SwedSQuAD -- a machine-translated version of SQuAD (Stanford Question Answering Dataset), no info about human correction
- SUC 2.0 -- annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information
- SUC 3.0 -- improved and extended SUC 2.0
- Facebook's FastText vectors, 300-dimensional
- trained on Common Crawl + Wikipedia: vecs
- trained on language-specific Wikipedia only: vecs
- trained on Wikipedia with cross-lingual alignment: [vecs] (https://fasttext.cc/docs/en/aligned-vectors.html)
- Diachronic embeddings from Språkbanken Text (based on word2vec and FastText)
- NLPL repository maintained by Language Techonology Group at the University of Oslo
- Swectors, 300-dimensional (the released vectors are Word2Vec)
- Polyglot embeddings: vecs
- Kyubyong Park's vectors
- Flair embeddings, 2048-dimensional, can be used only within flair package from Zalando Research
The code for calculating the number of parameters (comes from this thread):
- PyTorch:
sum(p.numel() for p in model.parameters() if p.requires_grad)
- TensorFlow:
np.sum([np.prod(v.shape) for v in tf.trainable_variables])
And now to the models themselves, where the code snippet above was used to estimate the number of parameters.
- BERT* models from The National Library of Sweden/KBLab
- bert-base-swedish-cased: 12 layers, 768 hidden size, 12 heads, ~124M parameters
- albert-base-swedish-cased-alpha: 12 layers, 768 hidden size, 12 heads, ~14M parameters
- electra-small-swedish-cased
- generator: 12 layers, 256 hidden size, 4 heads, ~16M parameters
- discriminator: 12 layers, 256 hidden size, 4 heads, ~16M parameters
- BERT models from Arbetsförmedlingen (The Swedish Public Employment Service)
- bert-base-swedish-uncased: 12 layers, 768 hidden size, 12 heads, ~110M parameters
- bert-large-swedish-uncased: 24 layers, 1024 hidden size, 16 heads, ~335M parameters
- RoBERTa models
- trained on Swedish Wikipedia and OSCAR: model on HF Hub
- trained on mC4: model on HF Hub
- seems to be trained on OSCAR?: model on HF Hub
- GPT-2 models
- trained on the Wiki40B and OSCAR: model on HF Hub
- trained on the Wiki40B only: model on HF Hub
- GPT-SW3 model (3.5B parameters): model on HF Hub -- NOTE: The repository is empty as of 2022-08-23
- T5 models
- trained on OSCAR: model on HF Hub
- GPT-2 models:
- trained on Wiki40B: model on HF Hub
- mBERT -- multilingual BERT by Google Research
- mBART50 -- multilingual BART by FAIR
- mT5 -- multilingual T5 by Google Research
- Stanza's models -- trained on UD treebanks: one on Talbanken and another on LinES
- MaltParser
- OPUS-MT models: models on HF Hub