This plugin enables URL tokenization and token filtering by URL part.
This repository has been changed to trunk-based development since we only support a single ElasticSearch version internally. Fixes will not be backported to earlier versions, nor do we guarantee tagging each release.
ElasticSearch 8.5.0
part
: Defaults tonull
. If leftnull
, all URL parts will be tokenized, and some additional tokens (host:port
andprotocol://host
) will be included. Can be either a string (single URL part) or an array of multiple URL parts. Options arewhole
,protocol
,host
,port
,path
,query
, andref
.url_decode
: Defaults tofalse
. Iftrue
, URL tokens will be URL decoded.allow_malformed
: Defaults tofalse
. Iftrue
, malformed URLs will not be rejected, but will be passed through without being tokenized.tokenize_malformed
: Defaults tofalse
. Has no effect ifallow_malformed
isfalse
. If both aretrue
, an attempt will be made to tokenize malformed URLs using regular expressions.tokenize_host
: Defaults totrue
. Iftrue
, the host will be further tokenized using a reverse path hierarchy tokenizer with the delimiter set to.
.tokenize_host_no_tld
: Defaults tofalse
. Iftrue
, and used in conjunction withtokenize_host
, the TLD of the tokenized host will be omitted.tokenize_path
: Defaults totrue
. Iftrue
, the path will be tokenized using a path hierarchy tokenizer with the delimiter set to/
.tokenize_query
: Defaults totrue
. Iftrue
, the query string will be split on&
.
Index settings:
{
"settings": {
"analysis": {
"tokenizer": {
"url_host": {
"type": "url",
"part": "host"
}
},
"analyzer": {
"url_host": {
"tokenizer": "url_host"
}
}
}
}
}
Make an analysis request:
curl 'http://localhost:9200/index_name/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'
{
"tokens" : [ {
"token" : "foo.bar.com",
"start_offset" : 8,
"end_offset" : 19,
"type" : "host",
"position" : 1
}, {
"token" : "bar.com",
"start_offset" : 12,
"end_offset" : 19,
"type" : "host",
"position" : 2
}, {
"token" : "com",
"start_offset" : 16,
"end_offset" : 19,
"type" : "host",
"position" : 3
} ]
}
part
: This option defaults towhole
, which will cause the entire URL to be returned. In this case, the filter only serves to validate incoming URLs. Other possible values are:protocol
,host
,port
,path
,query
, andref
. Can be either a single URL part (string) or an array of URL parts.url_decode
: Defaults tofalse
. Iftrue
, the desired portion of the URL will be URL decoded.allow_malformed
: Defaults tofalse
. Iftrue
, documents containing malformed URLs will not be rejected, and an attempt will be made to parse the desired URL part from the malformed URL string. If the desired part cannot be found, no value will be indexed for that field.passthrough
: Defaults tofalse
. Iftrue
,allow_malformed
is implied, and any non-url tokens will be passed through the filter. Valid URLs will be tokenized according to the filter's other settings.tokenize_host
: Defaults totrue
. Iftrue
, the host will be further tokenized using a reverse path hierarchy tokenizer with the delimiter set to.
.tokenize_host_no_tld
: Defaults tofalse
. Iftrue
, and used in conjunction withtokenize_host
, the TLD of the tokenized host will be omitted.tokenize_path
: Defaults totrue
. Iftrue
, the path will be tokenized using a path hierarchy tokenizer with the delimiter set to/
.tokenize_query
: Defaults totrue
. Iftrue
, the query string will be split on&
.
Set up your index like so:
{
"settings": {
"analysis": {
"filter": {
"url_host": {
"type": "url",
"part": "host",
"url_decode": true,
"tokenize_host": false
}
},
"analyzer": {
"url_host": {
"filter": ["url_host"],
"tokenizer": "whitespace",
}
}
}
},
"mappings": {
"example_type": {
"properties": {
"url": {
"type": "multi_field",
"fields": {
"url": {"type": "string"},
"host": {"type": "string", "analyzer": "url_host"}
}
}
}
}
}
}
Make an analysis request:
curl 'http://localhost:9200/index_name/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'
{
"tokens" : [ {
"token" : "foo.bar.com",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 1
} ]
}