This repository has been archived by the owner on Oct 8, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
tokenizer for OpenCorpora project
ksurent/Lingua--RU--OpenCorpora--Tokenizer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
NAME Lingua::RU::OpenCorpora::Tokenizer - tokenizer for OpenCorpora project SYNOPSIS my $tokens = $tokenizer->tokens($text); my $bounds = $tokenizer->bounds($text); DESCRIPTION This module tokenizes input texts in Russian language. Note that it uses probabilistic algorithm rather than trying to parse the language. It also uses some pre-calculated data freely provided by OpenCorpora project. NOTE: OpenCorpora periodically provides updates for this data. Checkout "opencorpora-update-tokenizer" script that comes with this distribution. The algorithm is this: 1. Split text into chars. 2. Iterate over the chars from left to right. 3. For every char get its context (see CONTEXT). 4. Find likelihood for the context in vectors file (see "VECTORS FILE") or use the default value - 0.5. CONTEXT See Lingua::RU::OpenCorpora::Tokenizer::Context. VECTORS FILE Contains a list of vectors with likelihood values showing the chance that given vector is a token boundary. Built by OpenCorpora project from semi-automatically annotated corpus. HYPHENS FILE Contains a list of hyphenated Russian words. Used in vectors calculations. Built by OpenCorpora project from semi-automatically annotated corpus. EXCEPTIONS FILE Contains a list of char sequences that are not subjects to tokenizing. Built by OpenCorpora project from semi-automatically annotated corpus. PREFIXES FILE Contains a list of common prefixes for decompound words. Built by OpenCorpora project from semi-automatically annotated corpus. NOTE: all files are stored as GZip archives and are not supposed to be edited manually. METHODS new($args) Constructs and initializes new tokenizer object. Takes a hashref as an argument with the folowwing keys: data_dir Path to a directory with OpenCorpora data. Optional. Defaults to distribution directory (see File::ShareDir). prefixes, hyphens, exceptions, vectors Data objects. Optional. You can provide any of those (or none of them). Default is to create an object from the data that comes with the distribution. tokens($text [, $options]) Takes text as input and splits it into tokens. Returns a reference to an array of tokens. You can also pass a hashref with options as a second argument. Current options: threshold Minimal likelihood value for tokens boundary. Boundaries with lower likelihood are excluded from consideration. Default value is 1, which makes tokenizer do splitting only when it's confident. tokens_bounds($text) Takes text as input and finds bounds of tokens in the text. It doesn't split the text into tokens, it just marks where tokens could be. Returns an arrayref of arrayrefs. Inner arrayref consists of two elements: boundary position in text and likelihood. bounds($text) Convenience alias for "tokens_bounds()". TO DO get rid of gzipped files KNOWN BUGS version 0.07 introduced a small regression in F1 score (using OpenCorpora data) SEE ALSO Lingua::RU::OpenCorpora::Tokenizer::Updater <http://mathlingvo.ru/nlpseminar/archive/s_49> AUTHOR OpenCorpora.org team <http://opencorpora.org> LICENSE This program is free software, you can redistribute it under the same terms as Perl itself.
About
tokenizer for OpenCorpora project
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published