Skip to content
Justin Reese edited this page Jun 4, 2020 · 68 revisions

Knowledge Graph Hub concept

A Knowledge Graph Hub (KG Hub) is software to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML-driven way. The workflow is:

  • download data
  • transform data for each data source into two TSV files (edges.tsv and nodes.tsv) as specified here
  • merge the graphs for each data source of interest using KGX to produce a merged knowledge graph

To facilitate interoperability of datasets, biolink categories are added to nodes and biolink associations are added to edges during transformation.

KG-COVID-19 project

The KG-Covid-19 project is the first such KG Hub. Output is a Knowledge Graph Hub that downloads and transforms COVID-19/SARS-COV-2 and related data and emits a knowledge graph that can be loaded into KGX and used for machine learning or others uses, to produce actionable knowledge.

Download knowledge graph:

A merged knowledge graph comprised of data from all available transforms is here:

RDF

TSV

See here for a description of the KGX TSV format.

Summary of data (Apr 2020):

Summary of data ingested (as of Apr 2020)

A detailed summary of data in kg-covid-19 is here, with contents of the knowledge graph broken down by biolink categories and biolink associations for nodes and edges, respectively.

A few organizing principles:

  • UniprotKB IDs are used for genes and proteins when possible
  • For drug/compound IDs, these IDs are preferred, in descending order of preference: CHEBI > CHEMBL > DRUGBANK > PUBCHEM
  • Less is more: for each data source, we ingest only the subset of data that is most relevant to the KG-Hub in question (here KG-COVID-19)
  • We avoid ingesting data from a source that isn't authoritative for the data in question (e.g. do not ingest protein interaction data from a drug database)
  • Each ingest should make an effort to add provenance data by adding a provided_by column in each edge TSV file, populated with the source of each datum

People:

The code:

  • Here is the github repo for this project.

  • Here is the github repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs.

Installation:

git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
pip install .
pip install -r requirements.txt

Running the code:

python run.py download
python run.py transform
python run.py merge

Contributing:

  • Here is a more detailed description, and instructions on how to help.
Clone this wiki locally