Skip to content

A general purpose text tokenizing module for python.

License

Notifications You must be signed in to change notification settings

Ro5bert/tokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenization

A general purpose text tokenizing module for python. Create application-specific tokenizers while writing little code.

Most tokenizing libraries require one to subclass a tokenizing class to achieve one's desired functionality, but tokenization merely takes a variety of simple arguments to fit nearly any use case. (With that said, it is always better to use a library suited specifically for one's needs, if it exists.)

Features

  • Choose which characters should be included in tokens and which characters should mark the end of a token
  • Input data either from a string or a stream (e.g. a file)
  • Specify how "unknown" characters (characters not given an explicit classification) should be handled
  • Specify types of "containers," which are pairs of characters which treat everything between them as one token (e.g. useful containers might be parentheses "(...)" or square brackets "[...]")
  • Customizable escape sequence handling
  • Highly documented/commented source code

Install

pip install tokenization

Examples

Parentheses Container
>>>from tokenization import Tokenizer
>>>input_str = "hello 123 (this is a container) #comment"
>>>container = {"(": (")", True, False, True)}
>>>tokenizer = Tokenizer(input_str, containers=container)
>>>tokenizer.get_all_tokens()
['hello', '123', '(this is a container)']
Parentheses Container with Internal Tokenization
>>>from tokenization import Tokenizer
>>>input_str = "hello 123 (this is a container) #comment"
>>>container = {"(": (")", True, True, True)}
>>>tokenizer = Tokenizer(input_str, containers=container)
>>>tokenizer.get_all_tokens()
['hello', '123', ['(', 'this', 'is', 'a', 'container', ')']]
Escaping Containers
>>>from tokenization import Tokenizer
>>>input_str = "hello 123 \\(this is a container) #comment"
>>>container = {"(": (")", True, True, True)}
>>>tokenizer = Tokenizer(input_str, containers=container)
>>>tokenizer.get_all_tokens()
['hello', '123', '(', 'this', 'is', 'a', 'container', ')']

About

A general purpose text tokenizing module for python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages