Skip to content

Commit

Permalink
Merge branch 'main' of github.com:elixir-europe/infectious-diseases-t…
Browse files Browse the repository at this point in the history
…oolkit into ppalagi-patch-1
  • Loading branch information
bedroesb committed Oct 1, 2024
2 parents 735ebb4 + 84eddb3 commit ad8b7ab
Show file tree
Hide file tree
Showing 13 changed files with 427 additions and 66 deletions.
1 change: 1 addition & 0 deletions .github/workflows/jekyll.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ name: Jekyll site CI

on:
push:
branches: [ master, main ]
pull_request:
branches: [ master, main ]
workflow_dispatch:
Expand Down
5 changes: 5 additions & 0 deletions _data/CONTRIBUTORS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -289,5 +289,10 @@ Reagon Karki:
email: [email protected]
orcid: https://orcid.org/0000-0002-1815-0037
affiliation: Fraunhofer ITMP/EU-OpenScreen
Francesco Messina:
orcid: 0000-0001-8076-7217
git: INMIbioinfo
affiliation: IRCCS (INMI)
Email: [email protected]


4 changes: 4 additions & 0 deletions _data/news.yml
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,7 @@
date: 2024-09-05
linked_pr: 339
description: A showcase page was added about an open source workflow, integrating biological databases for FAIR data compliant Knowledge Graphs, in the Showcase section. [Discover the page here](/showcase/knowledge-graph-generator)
- name: "New page: Data Analysis of Pathogen Characterisation data"
date: 2024-09-19
linked_pr: 308
description: Content was added to the Pathogen Characterisation page on Data Analysis. [Discover the page here](/data-analysis/pathogen-characterisation)
2 changes: 2 additions & 0 deletions _data/sidebars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ subitems:
subitems:
- title: Human biomolecular data
url: /data-analysis/human-biomolecular-data
- title: Pathogen characterisation
url: /data-analysis/pathogen-characterisation

- title: Data communication
url: /data-communication/
Expand Down
260 changes: 229 additions & 31 deletions _data/tool_and_resource_list.yml

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions about/contributors.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Contributors
custom_editme: _data/CONTRIBUTORS.yaml
toc: false
---

This project would not be possible without the many amazing community contributors. Infectious Diseases Toolkit is an open community project, and you are welcome to [join us](/contribute/)!
Expand Down
4 changes: 2 additions & 2 deletions about/editorial-board.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: Editorial board

## Meet the editorial board members

{% include contributor-carousel-selection.html custom="Bert Droesbeke, Eva Garcia Alvarez, Hedi Peterson, Katharina Lauer, Laura Portell Silva, Liane Hughes, Patricia Palagi, Rafael Andrade Buono, Rudolf Wittner, Martin Cook, Shona Cosgrove, Stian Soiland-Reyes, Romain David" %}
{% include contributor-carousel-selection.html custom="Bert Droesbeke, Eva Garcia Alvarez, Hedi Peterson, Katharina Lauer, Laura Portell Silva, Liane Hughes, Patricia Palagi, Rafael Andrade Buono, Rudolf Wittner, Shona Cosgrove, Stian Soiland-Reyes, Romain David" %}

## Responsibilities

Expand All @@ -19,7 +19,7 @@ title: Editorial board

In this section we would like to thank contributions of our past editorial members.

{% include contributor-tiles-all.html custom="Iris Van Dam" %}
{% include contributor-tiles-all.html custom="Iris Van Dam, Martin Cook" %}


## Contact
Expand Down
4 changes: 0 additions & 4 deletions attributing-credit/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,6 @@ toc: false
---



{% include section-navigation-tiles.html type="attributing_credit" except="index.md" %}


**We are still working on the content for this page.** If you are interested in adding to the page, then:

[Feel free to contribute](/contribute/){: class="btn btn-primary btn-lg rounded-pill"}
Expand Down
2 changes: 1 addition & 1 deletion data-analysis/human-biomolecular-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ There are several types of analysis that can be performed on human biomolecular
- *Interaction databases*: {% tool "biogrid" %} and {% tool "intact" %}
- *Network analysis*: {% tool "cytoscape" %} and {% tool "genemania" %}
- **Metabolomics analysis**: This involves measuring the levels of small molecules (metabolites) in biological samples and comparing them across different conditions or groups of samples. This can help to identify biomarkers of disease or drug response.
- *Data processing*: {% tool "xcms" %}, {% tool "mzmine" %} and {% tool "openms" %}
- *Data processing*: {% tool "xcms-online" %}, {% tool "mzmine" %} and {% tool "openms" %}
- *Statistical analysis*: {% tool "metaboanalyst" %} and {% tool "metsign" %}

## Postprocessing
Expand Down
188 changes: 172 additions & 16 deletions data-analysis/pathogen-characterisation.md

Large diffs are not rendered by default.

2 changes: 0 additions & 2 deletions data-communication/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ rdmkit:
url: https://rdmkit.elixir-europe.org/processing#what-is-data-processing
---

{% include section-navigation-tiles.html type="data_communication" except="index.md" %}

## Introduction

Data can only reach its full potential when communicated well to the audience. In a crisis situation, people are thirsty for information, and clear data communication becomes especially crucial. Communicating data as tables might be the easiest for data providers, but the trends and effects associated with infectious diseases are best shown using data visualisations.
Expand Down
12 changes: 6 additions & 6 deletions data-sources/human-biomolecular-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@ Please note that these considerations are general in nature and may vary dependi

### Existing approaches

- **Public databases:** Various publicly accessible databases serve as repositories for human biomolecular data, such as the National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) databases (e.g., {% tool "genbank" %}, {% tool "geo" %}, {% tool "sra" %} and European Bioinformatics Institute ({% tool "ebi" %}) databases (e.g., {% tool "european-nucleotide-archive" %}, {% tool "arrayexpress" %}).
- **Public databases:** Various publicly accessible databases serve as repositories for human biomolecular data, such as the {% tool "ncbi" %} databases (e.g., {% tool "genbank" %}, {% tool "geo" %}, {% tool "sra" %}) and European Bioinformatics Institute ({% tool "ebi" %}) databases (e.g., {% tool "european-nucleotide-archive" %}, {% tool "arrayexpress" %}).
- **Controlled access repositories:** Some data deposition platforms, like dbGaP ({% tool "dbgap" %}) and EGA ({% tool "ega" %}), adopt a controlled access model to protect sensitive human biomolecular data. Researchers interested in accessing the data need to request permission and comply with specific data usage policies.
- **Data integration platforms:** Platforms like the Global Alliance for Genomics and Health ([GA4GH](https://www.ga4gh.org/)) provide frameworks and standards for federated data access and integration across multiple repositories. These initiatives aim to facilitate the aggregation and analysis of human biomolecular data from diverse sources while maintaining data privacy and security.
- **Data integration platforms:** Platforms like the {% tool "ga4gh" %} provide frameworks and standards for federated data access and integration across multiple repositories. These initiatives aim to facilitate the aggregation and analysis of human biomolecular data from diverse sources while maintaining data privacy and security.
- **Data citation and DOI assignment:** To acknowledge and promote the contributions of researchers who deposit human biomolecular data, many repositories assign unique digital object identifiers (DOIs) to datasets. This enables proper citation and recognition of the deposited data, enhancing its visibility and impact.
- **Data submission portals:** Some repositories offer user-friendly web portals or submission systems that guide researchers through the process of depositing human biomolecular data. These portals often provide templates, validation checks, and step-by-step instructions to ensure the completeness and quality of the deposited data.
- **Consortium-specific databases:** Collaborative research initiatives often establish dedicated databases for sharing and depositing human biomolecular data, such as The Cancer Genome Atlas ({% tool "tcga" %}) for cancer genomics data or the Genotype-Tissue Expression ({% tool "gtex" %}) project for gene expression data across different tissues.
Expand Down Expand Up @@ -208,7 +208,7 @@ Consequently, we have compiled some of the main tools, portals, and data sharing
- {% tool "fega" %}, which provides secure controlled access sharing of sensitive patient and research subject data sets relating to COVID-19 while complying with stringent privacy national laws.
- {% tool "covid-19-data-portal" %}, which brings together and continuously updates relevant COVID-19 datasets and tools, will host sequence data sharing and will facilitate access to other SARS-CoV-2 resources.

You can find further information about the Covid-19 Data Portal in the link [here](https://rdmkit.elixir-europe.org/covid19_data_portal).
You can find further information about the Covid-19 Data Portal on [RDMkit](https://rdmkit.elixir-europe.org/covid19_data_portal).

## Data access and transfer

Expand All @@ -231,7 +231,7 @@ When looking for solutions to human biomolecular data access, you should conside
- **Scalability and Performance:** Look for solutions capable of efficiently handling large-scale biomolecular data sets while maintaining optimal performance, supporting advanced analysis tools for meaningful insights.
- **User-Friendly Interface:** Opt for solutions with intuitive interfaces and flexible access controls, enabling researchers of varying technical backgrounds to access, analyze, and interpret data effectively.

When looking for solutions to data transfer, you can check [this](https://rdmkit.elixir-europe.org/data_transfer) documentation.
When looking for solutions to data transfer, you can check [RDMkit](https://rdmkit.elixir-europe.org/data_transfer).

### Existing approaches

Expand All @@ -247,7 +247,7 @@ When looking for solutions to data transfer, you can check [this](https://rdmkit
- By depositing your data to one of the existing controlled access repositories, they will already show the data use conditions (e.g. [EGAD00001007777](https://ega-archive.org/datasets/EGAD00001007777))
- A data access committee (DAC) is a group responsible for reviewing and approving requests for access to sensitive data, such as human biomolecular data. Its role is to ensure that requests are in compliance with relevant laws and regulations, that data is being used for legitimate scientific purposes, and that privacy and security are being maintained. To know more about what is a DAC and how to become one, you can check the [European Genome-phenome Archive - Data Access Committee](https://ega-archive.org/submission/data_access_committee) website.

You can find further information about sharing human data [here](https://rdmkit.elixir-europe.org/human_data#sharing-and-reusing-of-human-data).
You can find further information about sharing human data on [RDMkit](https://rdmkit.elixir-europe.org/human_data#sharing-and-reusing-of-human-data).

## Data harmonisation

Expand All @@ -268,6 +268,6 @@ Thanks to the Sars-CoV-2 outbreak, the scientific community has established stan

### Existing approaches

* When looking for solutions to standards, schemas, ontologies and vocabularies, you can check [this](https://rdmkit.elixir-europe.org/metadata_management#how-do-you-find-appropriate-standard-metadata-for-datasets-or-samples) documentation.
* When looking for solutions to standards, schemas, ontologies and vocabularies, you can check [the RDMkit](https://rdmkit.elixir-europe.org/metadata_management#how-do-you-find-appropriate-standard-metadata-for-datasets-or-samples) for documentation.
* {% tool "fairsharing" %} is also a good resource to find metadata standards that can be useful for your research.

8 changes: 4 additions & 4 deletions showcase/knowledge-graph-generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,30 @@ title: Knowledge Graph Generator (KGG) - A fully automated workflow for creating
contributors: [Reagon Karki]
description: Open source workflow integrating biological databases for FAIR data compliant Knowledge Graphs
affiliations: [Fraunhofer ITMP, EU-OpenScreen]
page_id: knowledge-graph-generator
page_id: knowledge_graph_generator
---

## Introduction

Knowledge Graphs (KGs) are advanced forms of networks that capture the semantics of the constituent entities and the interactions among them. They facilitate ontology-driven data consolidation via integration/harmonization of heterogeneous data and serve as a graphical database. Such KGs in place have the potential to answer complex queries and form the basis of domain-specific analyses. In context of biomedicine and life sciences, KGs represent disease-associated biological and pathophysiological phenomena by systematically assembling various inter-related entities such as proteins and their biological processes, molecular functions and pathways, chemicals and their mechanism of actions and adverse effects and so on. They have been deployed in several use cases and downstream analyses related to healthcare, pharmaceutical and clinical settings. However, the process of creating KGs is expensive and time-consuming because it requires a lot of manual curation. Moreover, machine-aided methods such as text-mining workflows and Large Language Models (LLMs) have their own shortcomings and are improving gradually.

This showcase introduces a fully automated workflow, namely Knowledge Graph Generator (KGG), for creating KGs that represent chemotype and phenotype of diseases. The KGG embeds underlying schema of curated public databases to retrieve relevant knowledge which is regarded as the gold standard for high quality data. The KGG is leveraged on our previous contributions to the BY-COVID project where we developed workflows for identification of bio-active analogs for fragments identified in COVID-NMR studies ([Berg, H et al.](https://doi.org/10.1007/s00259-021-05215-4)) and representation of Mpox biology ([Karki, R et al.](https://doi.org/10.1093/bioadv/vbad045)). The programmatic scripts and methods for KGG are written in python (version 3.10) and are available ([here](https://github.com/Fraunhofer-ITMP/kgg)).
This showcase introduces a fully automated workflow, namely Knowledge Graph Generator (KGG), for creating KGs that represent chemotype and phenotype of diseases. The KGG embeds underlying schema of curated public databases to retrieve relevant knowledge which is regarded as the gold standard for high quality data. The KGG is leveraged on our previous contributions to the BY-COVID project where we developed workflows for identification of bio-active analogs for fragments identified in COVID-NMR studies ([Berg, H et al.](https://doi.org/10.1007/s00259-021-05215-4)) and representation of Mpox biology ([Karki, R et al.](https://doi.org/10.1093/bioadv/vbad045)). The programmatic scripts and methods for KGG are written in python (version 3.10) and are available [on GitHub](https://github.com/Fraunhofer-ITMP/kgg).

## Who is the showcase intended for?

The KGG is developed for a broad spectrum of researchers and scientists, especially for those who are into pre-clinical drug discovery, understanding disease mechanisms/comorbidity and drug-repurposing. Although KGG is a programmatic tool, it comes with a user-friendly interface to take just a couple of input from a user to run the underlying scripts and methods. Therefore, it is designed to enable researchers with minimal knowledge of programming to generate KGs at ease. The computer scientists can make maximum advantage of the workflow by modifying the scripts according to their needs.

## What is the showcase?

{% include image.html file="/kgg_showcase_overview.png" caption="Figure 1. A schematic representation of the KGG workflow depicting its three phases. The python-based workflow fetches real-time knowledge from curated databases and uses ([OpenBEL](https://doi.org/10.1016/j.drudis.2013.12.011)) framework to systematically encode the knowledge and relevant metadata." %}
{% include image.html file="/kgg_showcase_overview.png" caption="Figure 1. A schematic representation of the KGG workflow depicting its three phases. The python-based workflow fetches real-time knowledge from curated databases and uses the OpenBEL framework ([Slater, T](https://doi.org/10.1016/j.drudis.2013.12.011)) to systematically encode the knowledge and relevant metadata." %}

The automated workflow creating disease-specific KGs is subdivided into three phases and are described below:

Phase I: Disease lookup and identification - The KGG workflow uses standard disease identifiers from widely accepted ontologies such as EFO, OMIM, MeSH, MONDO and so on. Therefore, the identification of a proper disease identifier for a specific disease is the foremost task in the workflow. In order to facilitate this task, we have designed KGG in such a way that the users can search disease names as keywords which are eventually passed as queries to the Open Target Platforms’s API. This step of the KGG workflow is termed as disease lookup which yields a list of diseases and identifiers closest to the keyword search. The users are then prompted to identify their disease of interest and the process of generating a KG can be initiated by using the corresponding identifier.

Phase II: Real-time knowledge retrieval - The identified disease identifier from Phase I is used as a query for curated databases to retrieve relevant disease associated knowledge in real time. This is achieved by embedding the APIs of OTP, ChEMBL, UniProt, Integrated Interaction Database (IID) and GWAS Central into our programmatic scripts and methods.

Phase III: KG compilation and generation - The retrieved knowledge from Phase II is stored as semantic triples (i.e., subject-predicate-object) using OpenBEL framework, which are both human and computer-readable. The language enables systematic representation of biological and molecular interactions by enforcing usage of standard ontologies. The implementation was performed using the open-source ([PyBEL](https://doi.org/10.1093/bioinformatics/btx660)) framework. It is a resource developed to help with triples formation, meta-data annotation, data parsing, validation, compilation and visualization of KG. It also offers a wide-range of functions to explore, query, and analyze KGs. The KGs can be exported to various standard formats such as json, csv, sql, graphml, and Neo4j, allowing comparison and integration with other KGs.
Phase III: KG compilation and generation - The retrieved knowledge from Phase II is stored as semantic triples (i.e., subject-predicate-object) using {% tool "openbel" %} framework, which are both human and computer-readable. The language enables systematic representation of biological and molecular interactions by enforcing usage of standard ontologies. The implementation was performed using the open-source {% tool "pybel" %} framework. It is a resource developed to help with triples formation, meta-data annotation, data parsing, validation, compilation and visualization of KG. It also offers a wide-range of functions to explore, query, and analyze KGs. The KGs can be exported to various standard formats such as json, csv, sql, graphml, and Neo4j, allowing comparison and integration with other KGs.

## What can you use the tool for?

Expand Down

0 comments on commit ad8b7ab

Please sign in to comment.