Skip to content

Latest commit

 

History

History
135 lines (104 loc) · 7.24 KB

cord-19-queries.md

File metadata and controls

135 lines (104 loc) · 7.24 KB

Vespa powered search of the CORD-19 dataset

We have indexed the COVID-19 Open Research Dataset (CORD-19) in a Vespa instance running in the Vespa Cloud. There is a simple search frontend at https://cord19.vespa.ai/ and api access at https://api.cord19.vespa.ai/search/?query=covid-19&summary=short

Front end query language

The https://cord19.vespa.ai/ query interface supports the Vespa simple query language:

  • Use +query_term to specify that the result must include the term and -query_term for must not.
  • "viral transmissions" searches for the phrase
  • Example +"viral transmission" covid-19 -1918 Must have the phrase "viral transmission" and should have 'covid-19' and must not include the term '1918'
  • To search specific fields use fieldname:query_term, e.g +title:"reproduction number".
  • Use () for OR: +("SARS-COV-2" "coronavirus 2" "novel coronavirus") specifies that documents must match any of these three phrases.

Query examples:

Similar articles

See Similar articles for how similar articles works.

API Access

For using the Search Api of Vespa please see API documentation, YQL Query Language. For the full document definition see doc.sd.

High level field description

These are the most important fields in the dataset

field source in CORD-19 indexed/searchable summary (returned with hit) available for grouping matching Vespa type
default title + abstract yes no no tokenized and stemmed (match:text) fieldset
all title + abstract + body_text yes no no tokenized and stemmed (match:text) fieldset
title title from metadata or from contents of sha json file yes yes with bolding no tokenized and stemmed (match:text) string
abstract abstract yes yes with bolding and dynamic summary no tokenized and stemmed (match:text) string
body_text All body_text sections yes yes with bolding and dynamic summary no tokenized and stemmed (match:text) string
datestring datestring from metadata no yes yes no string
timestamp Epoch Unix time stamp parsed from datestring yes yes yes range and exact matching - can also be sorted on long
license license yes yes yes exact matching string
journal journal yes yes yes exact matching string
has_full_text has_full_text yes yes yes exact matching bool
doi https:// + doi from metadata no yes no no string
id row id from metadata.csv yes yes yes yes int
title_embedding SciBERT-NLI embedding from title yes (using nearestNeighbor()) no no yes tensor(x[768])
abstract_embedding SciBERT-NLI embedding from abstract yes (using nearestNeighbor()) no no yes tensor(x[768])
authors authors in metadata or authors from sha json if found yes using sameElement() yes yes yes array of struct

Ranking

See Vespa's Ranking documentation. There are 3 ranking profiles available

Ranking Description
default The default Vespa ranking function (nativeRank) which also uses term proximity for multi-term queries
bm25 Linear sum: bm25(title) + bm25(abstract) + bm25(body_text)
bm25fw Linear weighted sum: 0.6bm25(title) + 0.3bm25(abstract) + 0.1*bm25(body_text)
freshness By decreasing timestamp

See Vespa BM25 and Vespa nativeRank

The ranking profiles are defined in the document definition (doc.sd).

Example API queries

For using the Search Api of Vespa please see API documentation, YQL Query Language. In the below examples we use python with the requests api, using the POST search api.

import requests 

#Search for documents matching all query terms (either in title or abstract)
search_request_all = {
  'yql': 'select id,title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'timeout': '1.0s',
  'query': 'coronavirus temperature sensitivity',
  'type': 'all',
  'ranking': 'default'
}

#Search for documents matching any of query terms (either in title or abstract)
search_request_any = {
  'yql': 'select id,title, abstract, doi from sources * where userQuery();',
  'hits': 5,
  'summary': 'short',
  'timeout': '1.0s',
  'query': 'coronavirus temperature sensitivity',
  'type': 'any',
  'ranking': 'default'
}

#Restrict matching to abstract field and filter by timestamp
search_request_all_abstract = {
  'yql': 'select id,title, abstract, doi from sources * where userQuery() and has_full_text=true and timestamp > 1577836800;',
  'default-index': 'abstract',
  'hits': 5,
  'summary': 'short',
  'timeout': '1.0s',
  'query': '"sars-cov-2" temperature',
  'type': 'all',
  'ranking': 'default'
}

#Search authors which is an array of struct using sameElement operator
search_request_authors= {
  'yql': 'select id,authors from sources * where authors contains sameElement(first contains "Keith", last contains "Mansfield");',
  'hits': 5,
  'summary': 'short',
  'timeout': '1.0s',
}

#Sample request 
endpoint='https://api.cord19.vespa.ai/search/'
response = requests.post(endpoint, json=search_request_all)