A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.
The format of the char filter definition is as follows:
{
"name": <TOKENIZER_NAME>,
"options": <TOKENIZER_OPTIONS>
}
<TOKENIZER_NAME>
:<TOKENIZER_OPTIONS>
:
The following char filters are available:
- Character
- Exception
- Kagome
- Letter
- Regular Expression
- Single Token
- Unicode
- Web
- Whitespace
Outputs tokens with the specified rune. The following parameters can be set for rune
.
graphic
: Such characters include letters, marks, numbers, punctuation, symbols, and spaces.print
: Such characters include letters, marks, numbers, punctuation, symbols, and the ASCII space character.control
: Control characters.letter
: Letter characters.mark
: Mark characters.number
: Number characters.punct
: Unicode punctuation characters.space
: space character as defined by Unicode's White Space property; in the Latin-1 space this is '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).symbol
: Symbolic characters
Example:
{
"name": "unicode_normalize",
"options": {
"rune": "letter"
}
}
Split strings that match multiple regular expression patterns into tokens with UnicodeTokenizer.
Example:
{
"name": "exception",
"options": {
"patterns": [
"[hH][tT][tT][pP][sS]?://(\S)*",
"[fF][iI][lL][eE]://(\S)*",
"[fF][tT][pP]://(\S)*",
"\S+@\S+"
]
}
}
Use Kagome, a morphological analyzer for Japanese, to split Japanese text into tokens.
dictionary
: You can setIPADIC
orUniDIC
.stop_tags
: You can specify the Japanese part of speech to be removed. The specified part of speech will not be output as a token.base_form
: Converts the token of the specified Japanese part of speech to its base form. Example, convert美しく
to美しい
.
Example:
{
"name": "kagome",
"options": {
"dictionary": "IPADIC",
"stop_tags": [
"接続詞",
"助詞",
"助詞-格助詞",
"助詞-格助詞-一般",
"助詞-格助詞-引用",
"助詞-格助詞-連語",
"助詞-接続助詞",
"助詞-係助詞",
"助詞-副助詞",
"助詞-間投助詞",
"助詞-並立助詞",
"助詞-終助詞",
"助詞-副助詞/並立助詞/終助詞",
"助詞-連体化",
"助詞-副詞化",
"助詞-特殊",
"助動詞",
"記号",
"記号-一般",
"記号-読点",
"記号-句点",
"記号-空白",
"記号-括弧開",
"記号-括弧閉",
"その他-間投",
"フィラー",
"非言語音"
],
"base_forms": [
"動詞",
"形容詞",
"形容動詞"
]
}
}
This is the same as specifying letter
for rune
option in the Character tokenizer.
Example:
{
"name": "letter"
}
Outputs strings that matches the specified regular expression as a token.
Example:
{
"name": "regexp",
"pattern": "[0-9a-zA-Z_]*"
}
Output text as a single token.
Example:
{
"name": "single_token",
}
Output tokens based on Unicode character categories.
Example:
{
"name": "unicode",
}
Extracts E-mail, URL, Twitter handle, and Twitter hashtag from web content based on Exception tokenizer and outputs the token.
Example:
{
"name": "web",
}
Outputs tokens by splitting the text in whitespace.
Example:
{
"name": "whitespace",
}