GitHub - ksurent/Lingua--RU--OpenCorpora--Tokenizer: tokenizer for OpenCorpora project

This repository has been archived by the owner on Oct 8, 2022. It is now read-only.

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
lib/Lingua/RU/OpenCorpora		lib/Lingua/RU/OpenCorpora
script		script
share		share
t		t
.gitignore		.gitignore
Changelog		Changelog
MANIFEST.SKIP		MANIFEST.SKIP
Makefile.PL		Makefile.PL
README		README

Repository files navigation

NAME
    Lingua::RU::OpenCorpora::Tokenizer - tokenizer for OpenCorpora project

SYNOPSIS
        my $tokens = $tokenizer->tokens($text);

        my $bounds = $tokenizer->bounds($text);

DESCRIPTION
    This module tokenizes input texts in Russian language.

    Note that it uses probabilistic algorithm rather than trying to parse
    the language. It also uses some pre-calculated data freely provided by
    OpenCorpora project.

    NOTE: OpenCorpora periodically provides updates for this data. Checkout
    "opencorpora-update-tokenizer" script that comes with this distribution.

    The algorithm is this:

    1. Split text into chars.
    2. Iterate over the chars from left to right.
    3. For every char get its context (see CONTEXT).
    4. Find likelihood for the context in vectors file (see "VECTORS FILE")
    or use the default value - 0.5.

  CONTEXT
    See Lingua::RU::OpenCorpora::Tokenizer::Context.

  VECTORS FILE
    Contains a list of vectors with likelihood values showing the chance
    that given vector is a token boundary.

    Built by OpenCorpora project from semi-automatically annotated corpus.

  HYPHENS FILE
    Contains a list of hyphenated Russian words. Used in vectors
    calculations.

    Built by OpenCorpora project from semi-automatically annotated corpus.

  EXCEPTIONS FILE
    Contains a list of char sequences that are not subjects to tokenizing.

    Built by OpenCorpora project from semi-automatically annotated corpus.

  PREFIXES FILE
    Contains a list of common prefixes for decompound words.

    Built by OpenCorpora project from semi-automatically annotated corpus.

    NOTE: all files are stored as GZip archives and are not supposed to be
    edited manually.

METHODS
  new($args)
    Constructs and initializes new tokenizer object.

    Takes a hashref as an argument with the folowwing keys:

    data_dir
        Path to a directory with OpenCorpora data. Optional. Defaults to
        distribution directory (see File::ShareDir).

    prefixes, hyphens, exceptions, vectors
        Data objects. Optional. You can provide any of those (or none of
        them). Default is to create an object from the data that comes with
        the distribution.

  tokens($text [, $options])
    Takes text as input and splits it into tokens. Returns a reference to an
    array of tokens.

    You can also pass a hashref with options as a second argument. Current
    options:

    threshold
        Minimal likelihood value for tokens boundary. Boundaries with lower
        likelihood are excluded from consideration.

        Default value is 1, which makes tokenizer do splitting only when
        it's confident.

  tokens_bounds($text)
    Takes text as input and finds bounds of tokens in the text. It doesn't
    split the text into tokens, it just marks where tokens could be.

    Returns an arrayref of arrayrefs. Inner arrayref consists of two
    elements: boundary position in text and likelihood.

  bounds($text)
    Convenience alias for "tokens_bounds()".

TO DO
    get rid of gzipped files

KNOWN BUGS
    version 0.07 introduced a small regression in F1 score (using
    OpenCorpora data)

SEE ALSO
    Lingua::RU::OpenCorpora::Tokenizer::Updater

    <http://mathlingvo.ru/nlpseminar/archive/s_49>

AUTHOR
    OpenCorpora.org team <http://opencorpora.org>

LICENSE
    This program is free software, you can redistribute it under the same
    terms as Perl itself.