Skip to content

Search Insights

René Reitmann edited this page Jun 16, 2021 · 6 revisions

Elasticsearch which is the search engine empowering our metadata search distinguishes between finding the documents which match a given query and sorting them according to how good they match aka scoring.

This document describes which fields of our domain model are used for matching and which fields are used for scoring.

For each domain object all fields (german and english ones) plus some fields from related domain objects (see Sub Documents) are used for matching. For our search the words the user types in have to match "exactly" (case insensitive) but not in order.

Sub Documents

The following list describes all fields of all sub documents. These fields are used for matching in addition to the fields of the domain object itself. (e.g. When you search for variables and type 'dzhw' it will match many variables cause the institution of most Data Package Sub Documents is 'DZHW' and is copied into each Variable of the data üpackage).

  • Data Package Sub Document:
    • id
    • dataAcquisitionProjectId
    • institution
    • sponsor
    • surveySeries
    • title
    • projectContributors
    • doi
    • surveyDesign
  • Survey Sub Document:
    • id
    • dataAcquisitionProjectId
    • number
    • population (only used for search of data packages, no other objects)
    • surveyMethod
    • title
    • fieldPeriod
    • sample
    • serialNumber
    • dataType
  • Instrument Sub Document:
    • id
    • dataAcquisitionProjectId
    • title
    • subtitle
    • description
    • number
    • surveyIds
  • Question Sub Document:
    • id
    • dataAcquisitionProjectId
    • instrumentId
    • instrumentNumber
    • number
    • questionText
    • topic
  • Data Set Sub Document:
    • id
    • dataAcquisitionProjectId
    • description
    • type
    • format
    • number
    • subDataSets
    • maxNumberOfObservations
    • accessWays
  • Variable Sub Document:
    • id
    • dataAcquisitionProjectId
    • name
    • label
    • dataSetId
    • dataSetNumber
  • Related Publication Sub Document:
    • id
    • doi
    • title
    • authors
    • language

Scoring

By now we have described how we find matches for the words the user typed into our search box. This section is about how we decide which matches are more relevant than others. Simplified a match receives a reward when it contains a word, which the user typed in, in a specific field. E.g.: If the user searches for data packages and types 'absolventen' then all Studies having 'absolventen' in the title receive a reward (of one Point in this case).

We have three levels of rewards:

  1. Super (10 Points): For fields like dataPackage.surveyDesign (search for 'dzhw querschnitt') which have to overtake other matches.
  2. Major (1 Point): For all fields used in the top section of our search result cards. These fields should be relevant to identify a domain object.
  3. Minor (0.1 Point): For all fields used in the lower section of our search result cards. These fields usually give further explanations.

Since we can find matches for english words even when the user is currently viewing the GUI in german and vice versa we have three additional levels for the language which is currently not used in the GUI: Super (0.01 Point), Major (0.001 Point) and Minor (0.0001 Point).

To summarize: For each word which the user typed in and which is present in one of the fields described below the match receives a reward.

Fields and Rewards for scoring

  • Study Search
    • title (major)
    • surveyDesign (super)
    • surveyDataType (super)
    • id (major)
    • projectContributors (major)
    • description (minor)
  • Survey Search
    • title (major)
    • dataType (major)
    • surveyMethod (major)
    • id (major)
    • population (minor)
    • sample (minor)
  • Instrument Search
    • description (major)
    • id (major)
    • type (major)
    • title (minor)
  • Question Search
    • instrument.description (major)
    • id (major)
    • number (super)
    • type (major)
    • questionText (minor)
  • Data Set Search
    • description (major)
    • id (major)
    • type (major)
    • surveys.title (minor)
    • subDataSets.accessWays (minor)
  • Variable Search
    • label (major)
    • name (major)
    • id (major)
    • dataType (major)
    • scaleLevel (major)
    • surveys.title (minor)
  • Related Publication Search
    • title (major)
    • authors (major)
    • id (major)
    • year (super)
    • sourceReference (minor)