GitHub - soshial/text-normalization: Python tool for normilizing text and text canonicalization (DISCONTINUED)

soshial / text-normalization Public

Notifications You must be signed in to change notification settings
Fork 17
Star 41

Python tool for normilizing text and text canonicalization (DISCONTINUED)

41 stars 17 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
hunspell		hunspell
in		in
numword		numword
out		out
LICENSE.txt		LICENSE.txt
README		README
main.py		main.py
normalizer_ru.php		normalizer_ru.php
num_base.py		num_base.py
num_en.py		num_en.py
num_ru.py		num_ru.py
parse_subs.php		parse_subs.php

Repository files navigation

This library is used for normalizing ru/en texts, according to http://en.wikipedia.org/wiki/Text_normalization
Best suites  for Text-to-Speech text normalizing.

The library is licensed under LGPL v3, full text of which can be found at http://www.gnu.org/copyleft/lesser.html
All sql files are licensed under GNU FDL.

Prerequisites:
* python 2.7 + libs: regex, nltk (+ download 'punkt' package), MySQLdb, ConfigParser
* pymorphy (install.sh converts AOT dics, than copy them into /normalization/dics directory)
* upgraded numword library (http://code.google.com/p/numword/, which is based on pynum2word) is already inluded

How to use:
    ./normalization/main.py [language] [in_folder] [out_folder]
    you should specify 2-letter [langugage] and [IN folder] with files to process and [OUT folder] to put the result
    the main script is a loop that processes all files from IN folder, which have .meta duplicates

There also should be ./normalization/normalization.cfg which specifies database detailsand russian AOT paths
    [db]
    host:...
    user:...
    passwd:...
    db:...

    [lms]
    rml_path:~/ling2/rml/
    aot_path:~/ling2/
    dicts:~/text-normalization/dicts/converted/ru/

numword lib (without num wrappers) is quite handy if you need to get text from any float or integer variable.
    There is also an easy way for inflecting numbers in Russian:
        >>> from numword import numword_ru
        >>> numw_class = numword_ru.NumWordRU()
        >>> numw_class.inflection_case = u'вн,мн,жр'
        >>> numw_class.ordinal(325002)
        ТРИСТА ДВАДЦАТЬ ПЯТЬ ТЫСЯЧ ДВЕ