Skip to content

gbigenios/elasticsearch-analysis-decompound

 
 

Repository files navigation

Decompound plugin for Elasticsearch

elasticsearch analysis decompound coverage badge License Apache%202.0 blue xbib

This is an implementation of a word decompounder plugin for Elasticsearch.

Compounding several words into one word is a property not all languages share. Compounding is used in German, Scandinavian Languages, Finnish and Korean.

This code is a reworked implementation of the Baseforms Tool found in the ASV toolbox of Chris Biemann, Automatische Sprachverarbeitung of Leipzig University.

Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant. Both of them have a disadvantage, they require loading a word list in memory before they run. This decompounder does not require word lists, it can process german language text out of the box. The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox.

Installation

Use Gradle to build the plugin and install it using the elasticsearch-plugin command. Check the "gradle.properties" for the supported version.

Issues

All feedback is welcome! If you find issues, please post them at Github

Example

PUT /test
{
   "settings": {
       "index": {
           "analysis": {
               "filter": {
                   "decomp":{
                       "type" : "decompound"
                   }
               },
               "analyzer": {
                   "decomp": {
                       "type": "custom",
                       "tokenizer" : "standard",
                       "filter" : [
                           "decomp",
                            "unique",
                            "german_normalization",
                            "lowercase"
                       ]
                   }
               }
           }
       }
   },
   "mappings": {
      "docs" : {
            "properties": {
                "text" : {
                    "type" : "text",
                    "analyzer": "decomp"
                }
            }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet"
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "dampf schiff"
        }
    }
}

"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"

It is recommended to add the Unique token filter to skip tokens that occur more than once.

The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".

Threshold

The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.

The default threshold value is 0.51. You can modify it in the settings
"index" : {
    "analysis" : {
        "filter" : {
            "decomp" : {
                "type" : "decompound",
                "threshold" : 0.51
            }
        }
    }
}

Subwords

Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter "subwords_only": true

"index" : {
    "analysis" : {
        "filter" : {
            "decomp" : {
                "type" : "decompound",
                "subwords_only" : true
            }
        }
    }
}

Caching

The time consumed by the decompound computation may increase your overall indexing time drastically if applied in the billions. You can configure the cache size (in number of entries) for mapping a token to an array of decompound tokens. Reaching the cache size limit results in clearing of the cache and starting anew. This setting and the cache respectively is applied to a node, so configure it in the elasticsearch.yml file:

# default: 8388608 entries
# minimum: 131072 entries
# decompound_max_cache_size: 8388608

Exact phrase matches

The usage of decompounds can lead to undesired results regarding phrase queries. After indexing, decompound tokens are indistinguishable from their original token. The outcome of a phrase query "Deutsche Bank" could be Deutsche Spielbankgesellschaft, what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is tagged with additional payload data. To evaluate this payload data use the newly introduced query "exact_phrase" as a wrapper around a query-tree containing your phrase queries.

{
  "query": {
    "exact_phrase": {
      "query": {
        "query_string": {
          "query": "\"deutsche Bank\"",
          "fields": [
            "message"
          ]
        }
      }
    }
  }
}

References

The Compact Patricia Trie data structure can be found in

  • Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534*

The compound splitter used for generating features for document classification is described in

  • Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland*

The base form reduction step (for Norwegian) is described in

  • Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland*

License

Decompounder Analysis Plugin for Elasticsearch

Copyright © 2012 Jörg Prante

Copyright © 2005 Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Credits

GBI-Genios Deutsche Wirtschaftsdatenbank GmbH for adding the caching-functionality and the "Exact phrase matches"

About

Decompounding Plugin for Elasticsearch

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%