Skip to content

Elasticsearch migration July 2019 codesprint

François Prunayre edited this page Aug 9, 2019 · 40 revisions

Where?

Address:

  • Monday: Olivier, Florent in camptocamp / Jose, Francois at the farm
  • Tuesday/Wednesday: at the farm, 321, Route de la Mollière, Saint-Pierre-de-Genebroz, Chambéry, Savoie, Auvergne-Rhône-Alpes, France
  • Thursday: All in camptocamp
  • Friday: Olivier, Florent in camptocamp / Jose, Francois at the farm

Map: https://www.openstreetmap.org/search?query=321%20route%20de%20la%20molli%C3%A8re%2073360#map=19/45.46277/5.75553

Who?

  • Florent
  • Olivier
  • Jose
  • Francois
  • Pierre ?
  • Michel ?

Sponsors

  • EEA

Agenda

PR: https://github.com/geonetwork/core-geonetwork/pull/2830

Summary

During this sprint, we explored the benefits for GeoNetwork to use Elasticsearch as search engine. Moving to Elasticsearch will help in two ways:

  • Better and more efficient/flexible searches
  • Scalability.

We focused on 3 axes:

  • User interface with focus on EEA requirements eg. GEMET tree hierarchy classification, negative queries like not obsolete, better suggestions
  • CSW which is one of the main search services we need in order to make GeoNetwork 4 usable in production context
  • Wire all the application ie. make all the client application work again

The next section describes some of the significant search experience improvements that Elasticsearch can bring to the end user.

Some highlights

Facets

Facet can now be configured on the client application from the admin page (using JSON based on Elasticsearch API). User can configure the (ordered) list of fields, sorting, size, ...

This requires knowledge on the fields to be used but makes the list of facets easier to configure and customize.

GeoNetwork 3 already provides basic support of hierarchy of facet when using keywords from a hierarchical thesaurus like GEMET. This functionality was implemented with Elasticsearch but provides much more flexibility.

Hierarchy of facet is now supported using 2 approaches:

  • sub aggregation concept in Elasticsearch which allows to have nested aggregations eg. below on resource type > format

  • path hierarchy using a separator eg. below on GEMET thesaurus

The path hierarchy mode can also be applied at indexing time to non thesaurus elements in order to build a path for classification eg. {resourceType}/{format|serviceType}

Multiple facet choices can now be selected to make an OR query:

Selected choices are highlighted in green.

Then user can also negate a query eg. not a service

Choices are in red. Note that the permalink is not able to restore a negative query - this is something to improve.

Elasticsearch API also provides paging in facet values. With this, user can navigate in all values:

Different types of facets are provided in Elasticsearch API. We are mainly using TERM facets in GeoNetwork. But other types can be useful for better analysis (those also used in Kibana). For example, histogram aggregation support allows to build small charts for selecting date range:

Search suggestions, did you mean ? …

GeoNetwork 3 provides basic autocompletion with some bugs / limitations (eg. does not work on phrase suggestions, can show tokens of private records). Various Elasticsearch queries can be used to configure autocompletion where you define:

  • Which fields to search on / and how
  • Which fields to suggest

After different tests, we opted for a multi_match query on anytext returning the title

  • Supports partial word match
  • Support for phrase query

Autocompletion query can also be configured from the admin console:

By default, a multi_match on anytext + its ngram associated fields is configured in order to propose record titles based on analysis of partial word match.

Scoring

By default Lucene and Elasticsearch provides a default scoring mechanism based on term frequency. The scoring can be customized by custom functions. Some testing was done and for now are hardcoded (see https://github.com/geonetwork/core-geonetwork/commit/ff184e9402dfb181266074e89fe9f18dc6229ac9#diff-e5f71169531dd443a7dead183fbc8e52R42) eg.

  • Promote grid instead of vector (a bit dummy example)
  • Score down old records (eg. older than 200days)
  • Promote records with good rating!

Similar records

Based on similarity algorithm, Elasticsearch can provide similar records to the one user is currently viewing. The similarity can be built on specific fields. Similar records are presented in the record view and allows to navigate easily between records of same topics:

Search conclusion

We have now a working search interface with a full text field for search and much more advanced flexibility in the way we manage facets. Advanced form is for now completely removed. Search service is also decorated with privileges security filter and portal filter.

Some bugs still need to be fixed but we have good minimum of functionality already covered and usable.

We also managed to improve performances of the search by itself:

  • Only fields required to the client application are returned and this can be configured on a per search basis. Eg. only the title is returned in the response if only a list of record title is needed. GeoNetwork 3 was always returning the same search response.
  • Elasticsearch service is faster than the Lucene one we have

We can still do more on the service looking for related records which is slowing down the user interface in various places.

CSW

CSW service is now operational and provides better support of OR/AND and nested combination of filters that we use to support in GeoNetwork3. Spatial operators are also working.

Core application

The client application functionalities have been restored and is now usable for a good part of the 3 modules: search/edit/admin.

Among others, we restored:

  • Selection mechanism
  • ZIP export
  • PDF export
  • CSV export
  • Editing is operational with linking records together.

We investigated the possibility to only update one field in the index in order to improve performances. Currently GeoNetwork 3 index the full document when the rating changes for example. With more recent version of Lucene, only the rating field of a document in the index can be updated. This needs more work and could be applied to different cases eg. rating, privileges, popularity, status, category changes.

Conclusion

This codesprint was the opportunity to make significant improvements on the search application and CSW service.

The next phase is probably to:

  • Make all the core functionalities of the application works
    • Editing
    • Subtemplates
    • Multilingual support
  • Harvesting
  • Testing
    • Restore the unit test build
    • Integration test with a running Elasticsearch instance
  • Packaging
    • Docker setup
    • Installer build
  • Documentation

A demo server is available at:

Also, not really related to this task, but we have been discussing improving performances of the client applications and couple of ideas would need some support/funding/attention:

  • Bootstrap the map application only once requested (and not on startup)
  • Decrease number of watchers This could make the Angular application faster.

Annexes

Facet / Support tree

  • Index a field with a separator eg "Hydrologie/Salinité" (could be resourceType/(serviceType|spatialRepType)/(format)
  • Define field in index
"settings": {
    "analysis": {
      "analyzer": {
        "pathAnalyzer": {
          "tokenizer": "pathTokenizer"
        }
      },
      "tokenizer": {
        "pathTokenizer": {
          "type": "path_hierarchy",
          "delimiter": "/",
          "replacement": "/",
          "skip": 0,
          "reverse": false
        }
      }
    }
  },
...
  "mappings": {
    "dynamic_templates": [
        {
          "stringPathType": {
            "match": "ft_*_s_tree",
            "mapping": {
              "type":     "keyword",
              "fielddata": true,
              "analyzer": "pathAnalyzer",
              "search_analyzer": "keyword"
            }
          }
        },

Facet / Support negative switch

Eg.

Clone this wiki locally