NEWS

This file lists noteworthy changes between releases, for full list of changes, see git log and then ChangeLog.old.

Significant changes in 0.9.11

New words from open sources: 7000 new lexemes
minor fixes
huge thanks to Patreons and Github Sponsors for their support

Significant changes in 0.9.10

new words from wiktionaries and open name databases: 77,000 new lexemes, mostly proper nouns from the two government's open access name databases:
- all first names and surnames used in Finland from dvv.fi name registry
- all place names from the GML data by maanmittauslaitos
- get details from statistics page
minor fixes to tags and formats, disambiguation and words:
- apertium format has more subcategories
- more words from wiktionaries have better paradigms (mainly consonant-final nouns)
- few minor tweaks to prevent odd plurals and singulars for personal pronouns that do not have them in normal ue
test results show same compatibility as always, except:
- FTB-3.1 is down to 88 % from 90 % and
- UD vs. Finnish DTD is down to 92 % from 94 %
python stuff should only use hfst package and not (legacy?) libhfst
- newest hfst-python should again be installable from pip and other packaging sources
big thanks to Patreons and Github Sponsors for contiinued support

Significant changes in 0.9.9

slight updates to convenience bash scripts
- bash scripts default to large coverage analyser now, use -Z for old behaviour
Unimorph 4 compatible
added the name database from Finnish governments open data repository: approx. 20,000 new names and 20,000 existing names verified
Changed to semver and not so bi-yearly schedule, and to main branch instead of outdated git flow model
nearly 10,000 words moved from main lexicon to MWE; added MWE fragments that were previously not in main lexicon (e.g. "Records", "Air", "Las", "Agia", "Group", "Air", ...)
few thousands of words from fiwkikt, enwikt and joukahainen including new paradigms for cool and chic (i.e. loan consonant final adjectives), galanga root, cisgender, genetic scissors, gay drumming, hybrid influencing, spike protein and a lot of birds, mice and compounds
preliminary support for conda
homonyms dropped cross part-of-speech, only lemmas within same pos get homonym code and analysis now
Removed multi-words from main lexicon, if a lexeme has space in it all parts are analysed separately
canonic sort order for TSV files based on python sort (since bash sort is not portable across OSes or stable)
minor fixes for c++ demo and api
updated words from wiktionaries, joukahainen...
basic NER parser (~90 % of finer covered)
improvements in documentation based on feedback
big thanks to Patreons and GitHub sponsors for continued support

Significant changes in 20200511

Universal dependencies version 2.6 compatible
3021 new words, some related to 2020
preliminary support for pip / pypi / venv
new logo
next version will be in semantic versioning scheme, and few breaks in API are to be expected

Significant changes in 20191111

Universal dependencies version 2.5 is a reference for recall tests
11,343 new words
Fixed ordinals as adjectives
Minor overhaul of documentation
Fixed injection vuln. in python OOV handling
Fixed tokeniser regression related to initial punctuations
No other big changes and no API changes

Significant changes in 20190511

Universal dependencies version 2.4 is a reference for recall tests
2879 new words
No other big changes and no API changes

Significant changes in 20181111

Universal dependencies version 2.3 is a reference for recall tests
At least 18,380 new words: 340,931 insertions(+), 322,551 deletions(-)
- Imported enwikt data on top of re-importing fiwikt, joukahainen
New CG based on UD tags
Some universal dependencies guessed (analysers using dep guessing are slower and process sentences instead of words)
Default processing mode for many analysers is now sentence-based
Slightly extended python API (somewhat modeled like SpaCy but not quite)
Ability to download compiled FST models from release instead of self-compiling (beta)
Unimorph used as new Recall / Precision reference gold test set
Probably some fixes to recasing
Gradle support for java stuff
renamed origin unihu → finer

Significant changes in 20180511

Universal dependencies version 2.2 is now used as target
At least 226 new words: 239 additions and 13 deletions in lexeme database
Most changes are in development infra, so not visible to end users...
Started rewriting CG from the scratch
The APIs for programming language deprecate load(filename) and load(dir) forms of filename guessing functions in favour of forthcoming loadAnalyser(file), loadLemmatiser(file), loadUDPipe(file) etc. etc. functions
Working towards more general tokenise-analyse-disambiguate pipelines maybe, or just refactoring
lots more automated tests -> lots less human errors
By popular request: there are two analysers now, one with small dictionary and one with full, use the smaller one when you do not want to see birds or languages or tribes analysed. The smaller one replaces the old default, but the new tools will require you to select one explicitly anyways
fixes and workarounds: java and c++ can now be disabled partially or totally
adopted SG0 as possible verb form analysis from UD data
The end users are now provided with bash-scripts wrappers for all functionalities, whereas the typically python versions allow more control of parametres

Significant changes in 20170515

Universal Dependencies version 2 is now used, still mainly lemma, UPOS, features fields are analysed
At least 2,336 new words (based on diffstat: 38886 additions, 3655 deletions)
Preliminary support for various guessing models: python-based, finite-state and UDPipe. This means that it is possibly to get analyses for all tokens, albeit quality of guesses varies.
A minimal C++ library version has been made to match java and python bindings. C++-11 and libhfst are required.
The dix version can now be compiled with lttoolbox with a lot of memory
A restricted "gold" dictionary mode has been added. This is good for both end users with limited memory and end users who require higher quality lexemes (i.e., only research institute approved, no wiktionary words or other weird stuffs)
Documentations and automatic testing much reworked with the new modern toys from github: travis-ci, jekyll
Started weeding the ADP/ADV jungle...
Fixed a horrible bug in the corpus coverage testing that terribly under-estimated our coverage for corpora where hapax legomena etc. were ignored
Lot of documentation has been semi-automated, therefore many changes can be viewed at the new gh-pages site: https://flammie.github.io/omorfi/

Significant changes in 20161115

Started drafting more blacklists and known good lexemes subsets for people who struggle with rare words and productive compounding, derivation
Updated to Universal Dependencies version 1.4
A lot of new derivations by the way
Preliminary guessers
More loopy guessery things for punctuation and digit combos
Minor fixes to UD feature sorting
Homonym numbers used in some applications
Added timeouts where downstream tools support them, so tools don't seem like they are freezing at random
moved old documentations to github-pages
added preliminary hfst-pmatch-based tokeniser

Significant changes in 20160515

Universal Dependencies for Finnish is the new standard format we now follow:
- POS is now UPOS and classes were changed accordingly (new classes: AUX, PROPN, DET, CONJ, SCONJ, PUNCT, SYM, and VERB, NOUN, ADP, ADV as before)
- other features mostly match the feature field in UD documentation
- release cycle aims to be same six month cycle as with UD
- the automatic tests verify compatibility with UD; 92 % of lemmas, primary POS tags and morphological features are the same as Finnish UD corpus, 75 % same as Finnish FTB UD corpus
- analyser for reading and writing CONLL-U format
tokenisation as script and more hacks to token stripping in corner cases
continuous integration with travis-ci, currently only testing basic script programming conventions
added a lot of high coverage words and forms by hand
by popular request, some of the words can now be blacklisted, when you don't want that guy named Mutta to ambiguate your conjunction analyses or the odd new guinean bird to clash with some common verb
the "database" is now only keyed on lemma + homonym number; paradigm is extra information like anything else
a lot of work on morphological segmentation towards statistical machine translation; check proceedings of WMT shared tasks 2015 and 2016 to see why
started refactoring some python code into classes

Significant changes in 20150904

allomorphy can be tagged again to distinguish e.g. -iden and -itten when generating
FinnTreeBank-1 format provided by Miikka Silfverberg is available but not built by default since it lacks a test set
lexicalised inflections can have separate tag, e.g. kännissä can be lexical inessive distinguished from regular inessive
preliminary VISL CG-3 support, with original grammar by Fred Karlsson; convenience bash scripts available for disambiguated parsing
preliminary support for conllu and conllx analysis formats
paradigm categorisation is now verified by regular expressions
lots of paradigm fixes and some added words

Significant changes in 20150326

speed is up to >20,000 tokens per second from ~500
coverages are up to: europarl (99 %) gutenberg (97 %), JRC Acquis (94 %) and fiwiki (93 %)
moses factored model format supported
segmentation supported
Java API
Python hacks packaged to API and module
Rest of hand-written Xerox legacy data removed; all is script-generated
github migration since google code is EOL'd
file naming for automata changed to include omorfi prefix for all file names in case they are distributed separately.

Significant changes in 20141014

The regressions are also set on coverage over popular corpora: Europarl (98 %), FTB 3.1 (97 %), gutenberg (96 %), JRC Acquis (93 %) and fiwiki (90 %)
sti derivation tentatively added
number of new paradigms and paradigm moves, esp. in old and archaic styles
some new words manually added
apertium formats updated totally
interjection chaining
rest of hand-written lexc removed: everything in db and python code now
more strict building and testing altogether (no more dangling references or missing tags allowed)
morphological segmentation should be usable now
lots of other classifications and attributes added

Significant changes in 20130829

Default tag format is now FTB3.1. Recall is 90 % and the format is stable and easy to read by humans, which is now the main target for computational morphologies.
The omor tagsets are now permanently unstable and subject to change any day. To use them, python scripts have been provided.
Lots of proper nouns and semantics from Uni Hel projects
speller build support for new voikko versions
New regression tests for stuffs
Most of legacy lexc sources removed; they are now generated from TSV "databases".
The morphological classes now follow 3 main classes with some subclasses that are less morphological
Twol rules and flag diacritics have been eliminated
Lots of support scripts to verify and extend classifications
Lots of new word-forms, inflections and changes to derivations
Some python support scripts for omor formats

Significant changes in 20121226

Added fi.wiktionary.org as lexical source (much thanks to students of my unix tools course for scripting)
Added first batch of new proper nouns from a project in Univ. Helsinki
Lexc data is now rebuild from lexical sources as standard processing;
- requiring python3
Minor bug fixes to man pages, special boundaries (e.g. in arkki_tehti)

Significant changes in 20120401

Fixed some twol rules w.r.t. new features that blocked compiling
Autogenerate lexicons from csv data all the time
Moved to git and googlecode -> chopped most of the documentation and such
Fixed scripts a bit, added man pages
Made very crude tests to have at least something back in.

Significant changes in 20110505

whole new finntreebank tagset for forthcoming finntreebank work
uppercasing is noted in the analysis level
the word boundaries of lexicalised compounds may be available for more cases (depending on the tagset)
whole new lemmatizer tagset is available
some dozens of new words added and fixed
combine corpus analysis script with apertium's preprocessors
causative derivation chain added
bbreviations, adpositions, prefixes and suffixes are no longer pos but subcat analyses

Significant changes since 20100401

Include deverbal nouns in compounding system
Start marking compound and strong morpheme boundaries
New lexical data handling systems
Implement generator from analyser
Subcategorize lots of classes for CG and apertium
Write documentation in booklet format
New URI and digit string guessers
New tagging style colorterm for interactive use
Include weighting scheme in default build
Demote SUFFIX from POS reading to SUBCAT

Significant changes since 20100111

Added marginal enclitics kA, kAs
Added LEMMA= structure
re-organized source code to modules
Added tagging schemes, weighting schemes and suggestion algorithms

Significant changes since 0.5

completely new morphology built on traditional lexc-twolc model
easier route to add new lexical data via simple CSV format
lots of new lexical data from Joukahainen project as well as extended from kotus-sanalista semi-automatically and by hand.
titlecasing filter for regular words
š filter for old orthography variants
compounding much less haphazard concoction
parts of speech classified and included
pronouns, interjections, numerals, proper nouns
much closer to real full fledged morphology
movement from SFST to HFST toolset with lots of new cool toys (SFST support is retained in HFST)
towards full-scale automatic test suite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEWS

NEWS

NEWS

Significant changes in 0.9.11

Significant changes in 0.9.10

Significant changes in 0.9.9

Significant changes in 20200511

Significant changes in 20191111

Significant changes in 20190511

Significant changes in 20181111

Significant changes in 20180511

Significant changes in 20170515

Significant changes in 20161115

Significant changes in 20160515

Significant changes in 20150904

Significant changes in 20150326

Significant changes in 20141014

Significant changes in 20130829

Significant changes in 20121226

Significant changes in 20120401

Significant changes in 20110505

Significant changes since 20100401

Significant changes since 20100111

Significant changes since 0.5

Files

NEWS

Latest commit

History

NEWS

File metadata and controls

NEWS

Significant changes in 0.9.11

Significant changes in 0.9.10

Significant changes in 0.9.9

Significant changes in 20200511

Significant changes in 20191111

Significant changes in 20190511

Significant changes in 20181111

Significant changes in 20180511

Significant changes in 20170515

Significant changes in 20161115

Significant changes in 20160515

Significant changes in 20150904

Significant changes in 20150326

Significant changes in 20141014

Significant changes in 20130829

Significant changes in 20121226

Significant changes in 20120401

Significant changes in 20110505

Significant changes since 20100401

Significant changes since 20100111

Significant changes since 0.5