We have indexed the COVID-19 Open Research Dataset (CORD-19) in a Vespa instance running in the Vespa Cloud. There is a simple search frontend at https://cord19.vespa.ai/ and api access at https://api.cord19.vespa.ai/search/?query=covid-19&summary=short
The https://cord19.vespa.ai/ query interface supports the Vespa simple query language:
- Use +query_term to specify that the result must include the term and -query_term for must not.
- "viral transmissions" searches for the phrase
- Example +"viral transmission" covid-19 -1918 Must have the phrase "viral transmission" and should have 'covid-19' and must not include the term '1918'
- To search specific fields use fieldname:query_term, e.g +title:"reproduction number".
- Use () for OR: +("SARS-COV-2" "coronavirus 2" "novel coronavirus") specifies that documents must match any of these three phrases.
Query examples:
- +covid-19 +temperature impact on viral transmission
- basic reproduction numbers for covid-19 in +"south korea"
- Impact of school closure to handle COVID-19 pandemic
- +title:"reproduction number" +abstract:MERS
- +authors.last:knobel
- +("SARS-COV-2" "coronavirus 2" "novel coronavirus")
- +chloroquine +(covid-19 coronavirus)
- authors.name:"Neil M Ferguson"
See Similar articles for how similar articles works.
- Frontend: https://cord19.vespa.ai/
- Full API access: https://api.cord19.vespa.ai/search/
- Sample Kaggle Notebook: Semantic Search Using Vespa.ai's CORD-19 index
For using the Search Api of Vespa please see API documentation, YQL Query Language. For the full document definition see doc.sd.
These are the most important fields in the dataset
field | source in CORD-19 | indexed/searchable | summary (returned with hit) | available for grouping | matching | Vespa type |
---|---|---|---|---|---|---|
default | title + abstract | yes | no | no | tokenized and stemmed (match:text) | fieldset |
all | title + abstract + body_text | yes | no | no | tokenized and stemmed (match:text) | fieldset |
title | title from metadata or from contents of sha json file | yes | yes with bolding | no | tokenized and stemmed (match:text) | string |
abstract | abstract | yes | yes with bolding and dynamic summary | no | tokenized and stemmed (match:text) | string |
body_text | All body_text sections | yes | yes with bolding and dynamic summary | no | tokenized and stemmed (match:text) | string |
datestring | datestring from metadata | no | yes | yes | no | string |
timestamp | Epoch Unix time stamp parsed from datestring | yes | yes | yes | range and exact matching - can also be sorted on | long |
license | license | yes | yes | yes | exact matching | string |
journal | journal | yes | yes | yes | exact matching | string |
has_full_text | has_full_text | yes | yes | yes | exact matching | bool |
doi | https:// + doi from metadata | no | yes | no | no | string |
id | row id from metadata.csv | yes | yes | yes | yes | int |
title_embedding | SciBERT-NLI embedding from title | yes (using nearestNeighbor()) | no | no | yes | tensor(x[768]) |
abstract_embedding | SciBERT-NLI embedding from abstract | yes (using nearestNeighbor()) | no | no | yes | tensor(x[768]) |
authors | authors in metadata or authors from sha json if found | yes using sameElement() | yes | yes | yes | array of struct |
See Vespa's Ranking documentation. There are 3 ranking profiles available
Ranking | Description |
---|---|
default | The default Vespa ranking function (nativeRank) which also uses term proximity for multi-term queries |
bm25 | Linear sum: bm25(title) + bm25(abstract) + bm25(body_text) |
bm25fw | Linear weighted sum: 0.6bm25(title) + 0.3bm25(abstract) + 0.1*bm25(body_text) |
freshness | By decreasing timestamp |
See Vespa BM25 and Vespa nativeRank
The ranking profiles are defined in the document definition (doc.sd).
For using the Search Api of Vespa please see API documentation, YQL Query Language. In the below examples we use python with the requests api, using the POST search api.
import requests
#Search for documents matching all query terms (either in title or abstract)
search_request_all = {
'yql': 'select id,title, abstract, doi from sources * where userQuery();',
'hits': 5,
'summary': 'short',
'timeout': '1.0s',
'query': 'coronavirus temperature sensitivity',
'type': 'all',
'ranking': 'default'
}
#Search for documents matching any of query terms (either in title or abstract)
search_request_any = {
'yql': 'select id,title, abstract, doi from sources * where userQuery();',
'hits': 5,
'summary': 'short',
'timeout': '1.0s',
'query': 'coronavirus temperature sensitivity',
'type': 'any',
'ranking': 'default'
}
#Restrict matching to abstract field and filter by timestamp
search_request_all_abstract = {
'yql': 'select id,title, abstract, doi from sources * where userQuery() and has_full_text=true and timestamp > 1577836800;',
'default-index': 'abstract',
'hits': 5,
'summary': 'short',
'timeout': '1.0s',
'query': '"sars-cov-2" temperature',
'type': 'all',
'ranking': 'default'
}
#Search authors which is an array of struct using sameElement operator
search_request_authors= {
'yql': 'select id,authors from sources * where authors contains sameElement(first contains "Keith", last contains "Mansfield");',
'hits': 5,
'summary': 'short',
'timeout': '1.0s',
}
#Sample request
endpoint='https://api.cord19.vespa.ai/search/'
response = requests.post(endpoint, json=search_request_all)