This service uses the RDF triple store of the ORKG to create papers dump, in which papers
are associated with their abstracts. The abstract data is fetched from
semanticscholar
and crossref
metadata registries, and therefore, is allowed to be dealt with as their usage policies allow
(see license).
The papers dump is stored in a csv
data format which makes it easy to be loaded with
a pandas.DataFrame
. The dump has the following columns' schema:
uri
: Paper URI in the ORKG.title
: Paper's title.doi
: Paper's DOI (Digital Object Identifier).abstract
: The abstract text exactly as we get from the sources.processed_abstract
: Processed abstract's text.- HTML tags are removed.
- Whitespace characters are removed.
- Lowercase is obtained.
research_field
: The research field(s) of the paper.s2_research_field
: Thes2FieldsOfStudy
field from SemanticScholar.publisher
: The name of the publisher.date
: The publication date.*_source
: These columns denotes to the source (semanticscholar
orcrossreff
) of the field mentioned in*
. E.g.abstract_source
.
Note: We use the token
<SEP>
to separate values of list-typed cells.
Example usage:
import pandas as pd
df = pd.read_csv('path/to/the/dump') # currently hosted under 'https://orkg.org/files/orkg_papers.csv'
print(df.columns)
and the output will be
Index(['uri', 'title', 'doi', 'abstract', 'abstract_source', 'date',
'date_source', 'processed_abstract', 'publisher', 'publisher_source',
'research_field', 'research_field_source', 's2_research_field',
's2_research_field_source'],
dtype='object')
The latest versions of the file, including a changelog, can be found at https://orkg.org/files/.
You can download the abstracts dump directly by clicking on this link, which you can also use for you downstream application to directly download the data without any authentication.
E.g.:
wget <the link here> -O orkg_papers.csv
NOTE: In case you're running the service for the first time and your ORKG_PAPERS_HOST_DATA_DIRECTORY
path is different from orkg-papers/data/
, we recommend to copy its content to your provided path, so that
you do not need to fetch the data from scratch. In case you are executing the script with docker-compose
you do not need to manually copy the data, since it is already configured if you set the
environment variables properly.
python >= 3.7.1
poetry
(instructions)
git clone https://gitlab.com/TIBHannover/orkg/orkg-abstracts.git
cd orkg-papers
then create an .env
file with proper environment variables and finally run
the service with executing:
// only if your system uses another base python version
poetry env use <path-to-your-python3.7-interpreter>
poetry install
poetry run python -m src.update
To mount the host data directory into the container's please consider setting the environment variables properly.
git clone https://gitlab.com/TIBHannover/orkg/orkg-abstracts.git
cd orkg-papers
docker-compose up
The following environment variables are required to run the service
and are defined in the .env
file.
Variable | Description |
---|---|
ORKG_TRIPLE_STORE |
URL to the ORKG triple store. |
ORKG_PAPERS_HOST_DATA_DIRECTORY |
Path to the directory in which the dump file will be/is already stored. |
ORKG_PAPERS_DOCKER_DATA_DIRECTORY |
Path to the directory in which the dump file will be/is already stored inside the docker container. Only required when running with Docker-Compose. |
Since this service only aggregates data from different sources, we therefore rely on their data licenses. Please check each of: