-
Notifications
You must be signed in to change notification settings - Fork 26
Home
A Knowledge Graph Hub (KG Hub) is software to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML
-driven way. The workflow is:
- download data
- transform data for each data source into two TSV files (
edges.tsv
andnodes.tsv
) as specified here - merge the graphs for each data source of interest using KGX to produce a merged knowledge graph
To facilitate interoperability of datasets, biolink categories are added to nodes and biolink associations are added to edges during transformation.
The KG-Covid-19 project is the first such KG Hub. Output is a Knowledge Graph Hub that downloads and transforms COVID-19/SARS-COV-2 and related data and emits a knowledge graph that can be loaded into KGX and used for machine learning or others uses, to produce actionable knowledge.
Download knowledge graph:
A merged knowledge graph comprised of data from all available transforms is here:
See here for a description of the KGX TSV format.
Summary of data (Apr 2020):
A detailed summary of data in kg-covid-19 is here, with contents of the knowledge graph broken down by biolink categories and biolink associations for nodes and edges, respectively.
A few organizing principles:
- UniprotKB IDs are used for genes and proteins when possible
- For drug/compound IDs, these IDs are preferred, in descending order of preference: CHEBI > CHEMBL > DRUGBANK > PUBCHEM
- Less is more: for each data source, we ingest only the subset of data that is most relevant to the KG-Hub in question (here KG-COVID-19)
- We avoid ingesting data from a source that isn't authoritative for the data in question (e.g. do not ingest protein interaction data from a drug database)
- Each ingest should make an effort to add provenance data by adding a
provided_by
column in each edge TSV file, populated with the source of each datum
People:
- Justin Reese
- Deepak Unni
- Marcin Joachimiak
- Peter Robinson
- Chris Mungall
- Tiffany Callahan
- Luca Cappelletti
- Vida Ravanmehr
The code:
-
Here is the github repo for this project.
-
Here is the github repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs.
Installation:
git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
pip install .
pip install -r requirements.txt
Running the code:
python run.py download
python run.py transform
python run.py merge
Contributing:
- Here is a more detailed description, and instructions on how to help.