diff --git a/.github/workflows/jekyll.yml b/.github/workflows/jekyll.yml index 59958f45..d3020426 100644 --- a/.github/workflows/jekyll.yml +++ b/.github/workflows/jekyll.yml @@ -2,6 +2,7 @@ name: Jekyll site CI on: push: + branches: [ master, main ] pull_request: branches: [ master, main ] workflow_dispatch: diff --git a/_data/CONTRIBUTORS.yaml b/_data/CONTRIBUTORS.yaml index caab7015..27f29a9b 100644 --- a/_data/CONTRIBUTORS.yaml +++ b/_data/CONTRIBUTORS.yaml @@ -289,5 +289,10 @@ Reagon Karki: email: reagon.karki@itmp.fraunhofer.de orcid: https://orcid.org/0000-0002-1815-0037 affiliation: Fraunhofer ITMP/EU-OpenScreen +Francesco Messina: + orcid: 0000-0001-8076-7217 + git: INMIbioinfo + affiliation: IRCCS (INMI) + Email: francesco.messina@inmi.it diff --git a/_data/news.yml b/_data/news.yml index 0a924bfb..c841df21 100755 --- a/_data/news.yml +++ b/_data/news.yml @@ -138,3 +138,7 @@ date: 2024-09-05 linked_pr: 339 description: A showcase page was added about an open source workflow, integrating biological databases for FAIR data compliant Knowledge Graphs, in the Showcase section. [Discover the page here](/showcase/knowledge-graph-generator) +- name: "New page: Data Analysis of Pathogen Characterisation data" + date: 2024-09-19 + linked_pr: 308 + description: Content was added to the Pathogen Characterisation page on Data Analysis. [Discover the page here](/data-analysis/pathogen-characterisation) \ No newline at end of file diff --git a/_data/sidebars/main.yml b/_data/sidebars/main.yml index 420d1def..70abf4f5 100644 --- a/_data/sidebars/main.yml +++ b/_data/sidebars/main.yml @@ -14,6 +14,8 @@ subitems: subitems: - title: Human biomolecular data url: /data-analysis/human-biomolecular-data + - title: Pathogen characterisation + url: /data-analysis/pathogen-characterisation - title: Data communication url: /data-communication/ diff --git a/_data/tool_and_resource_list.yml b/_data/tool_and_resource_list.yml index 5f3a5b2b..4cd77212 100644 --- a/_data/tool_and_resource_list.yml +++ b/_data/tool_and_resource_list.yml @@ -187,12 +187,6 @@ biotools: deseq2 tess: DESeq2 url: https://bioconductor.org/packages/release/bioc/html/DESeq2.html -- description: DMP online is an online planning tool to help you write an effective DMP based on an institutional or funder template. - id: dmp-online - name: DMP Online - registry: - tess: DMP Online - url: https://dmptool.org/ - description: Docker is a software for the execution of applications in virtualized environments called containers. It is linked to DockerHub, a library for sharing container images id: docker name: Docker @@ -205,8 +199,7 @@ id: dragen-gatk name: Dragen-GATK url: https://gatk.broadinstitute.org/hc/en-us/articles/360045944831 -- description: 'Dryad is an open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data. Dryad has a long-term data preservation strategy, and is a Core Trust Seal Certified Merritt repository with storage in US and EU at the San Diego Supercomputing Center, DANS, and Zenodo. While data is undergoing peer review, it is embargoed if the related journal requires / allows this. Dryad is an independent non-profit that works directly with: researchers to publish datasets utilising best practices for discovery and reuse; publishers to support the integration of data availability statements and data citations into their workflows; and institutions to enable scalable campus support for research data management best practices at low cost. Costs are covered by institutional, publisher, and funder members, otherwise a one-time fee of $120 for authors to cover cost of curation and preservation. Dryad also receives direct funder support through - grants.' +- description: Dryad is an open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data. id: dryad name: Dryad registry: @@ -233,7 +226,7 @@ fairsharing: mya1ff tess: European Genome-phenome Archive (EGA) url: https://ega-archive.org/ -- description: 'The European Language Social Science Thesaurus (ELSST) is a broad-based, multilingual thesaurus for the social sciences. It is owned and published by the Consortium of European Social Science Data Archives (CESSDA) and its national Service Providers. The thesaurus consists of over 3,300 concepts and covers the core social science disciplines: politics, sociology, economics, education, law, crime, demography, health, employment, information, communication technology, and environmental science. ELSST is used for data discovery within CESSDA and facilitates access to data resources across Europe, independent of domain, resource, language, or vocabulary. ELSST is currently available in 16 languages: Danish, Dutch, Czech, English, Finnish, French, German, Greek, Hungarian, Icelandic, Lithuanian, Norwegian, Romanian, Slovenian, Spanish, and Swedish' +- description: The European Language Social Science Thesaurus (ELSST) is a broad-based, multilingual thesaurus for the social sciences. It is owned and published by the Consortium of European Social Science Data Archives (CESSDA) and its national Service Providers. id: european-language-social-science-thesaurus name: European Language Social Science Thesaurus (ELSST) registry: @@ -251,12 +244,6 @@ id: ena-webin-cli name: ENA Webin CLI url: https://toolshed.g2.bx.psu.edu/repository?repository_id=dfa4f0fc31027b52 -- description: Functional Enrichment Analysis and Network Construction - id: enrichr - name: Enrichr - registry: - biotools: enrichr - url: https://github.com/guokai8/EnrichR - description: The Estonian Biobank has established a population-based biobank of Estonia with a current cohort size of more than 200,000 individuals (genotyped with genome-wide arrays), reflecting the age, sex and geographical distribution of the adult Estonian population. Considering the fact that about 20% of Estonia's adult population has joined the programme, it is indeed a database that is very important for the development of medical science both domestically and internationally. id: estonian-biobank name: Estonian Biobank @@ -273,14 +260,14 @@ fairsharing: dj8nt8 tess: European Nucleotide Archive (ENA) url: https://www.ebi.ac.uk/ena/browser/home -- description: FAIRsharing is a FAIR-supporting resource that provides an informative and educational registry on data standards, databases, repositories and policy, alongside search and visualization tools and services that interoperate with other FAIR-enabling resources. fairsharing guides consumers to discover, select and use standards, databases, repositories and policy with confidence, and producers to make their resources more discoverable, more widely adopted and cited. Each record in fairsharing is curated in collaboration with the maintainers of the resource themselves, ensuring that the metadata in the fairsharing registry is accurate and timely. Every record is manually reviewed at least once a year. Records can be collated into collections, based on a project, society or organisation, or Recommendations, where they are collated around a policy, such as a journal or funder data policy. +- description: FAIRsharing is a FAIR-supporting resource that provides an informative and educational registry on data standards, databases, repositories and policy, alongside search and visualization tools and services that interoperate with other FAIR-enabling resources. FAIRsharing guides consumers to discover, select and use standards, databases, repositories and policy with confidence, and producers to make their resources more discoverable, more widely adopted and cited. Each record in fairsharing is curated in collaboration with the maintainers of the resource themselves, ensuring that the metadata in the fairsharing registry is accurate and timely. id: fairsharing name: FAIRsharing registry: fairsharing: 2abjs5 tess: FAIRsharing url: https://fairsharing.org/ -- description: Figshare is a generalist, subject-agnostic repository for many different types of digital objects that can be used without cost to researchers. Data can be submitted to the central figshare repository (described here), or institutional repositories using the figshare software can be installed locally, e.g. by universities and publishers. Metadata in figshare is licenced under is CC0. figshare has also partnered with DuraSpace and Chronopolis to offer further assurances that public data will be archived under the stewardship of Chronopolis. figshare is supported through Institutional, Funder, and Governmental service subscriptions. +- description: Figshare is a generalist, subject-agnostic repository for many different types of digital objects that can be used without cost to researchers. Data can be submitted to the central figshare repository (described here), or institutional repositories using the figshare software can be installed locally, e.g. by universities and publishers. id: figshare name: Figshare registry: @@ -294,20 +281,20 @@ biotools: Flye tess: Flye url: https://github.com/fenderglass/Flye -- description: FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. +- description: freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read sequencing alignment. id: freebayes - name: FreeBayes + name: freebayes registry: biotools: freebayes - tess: FreeBayes + tess: freebayes url: https://github.com/freebayes/freebayes - description: The metadata model for GA4GH, an international coalition of both public and private interested parties, formed to enable the sharing of genomic and clinical data. id: ga4gh - name: GA4GH + name: Global Alliance for Genomics and Health (GA4GH) registry: biotools: ga4gh fairsharing: 2tpx4v - tess: GA4GH + tess: Global Alliance for Genomics and Health (GA4GH) url: https://github.com/ga4gh/schemas - description: Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. id: galaxy @@ -319,6 +306,8 @@ - description: The European Galaxy server. Provides access to thousands of tools for scalable and reproducible analysis. id: galaxy-europe name: Galaxy Europe + registry: + tess: Galaxy Europe url: https://usegalaxy.eu/ - description: The University of Tartu Galaxy instance. Enables local university users to run their analyses in the Galaxy environment. Was heavily used during the KoroGenoEST sequencing studies. id: galaxy-university-of-tartu @@ -369,7 +358,7 @@ id: gitlab name: GitLab registry: - fairsharing: 530e61 + fairsharing: 5.3e+63 tess: GitLab url: https://about.gitlab.com/ - description: GO is to perform enrichment analysis on gene sets. @@ -513,9 +502,9 @@ url: http://mzmine.github.io/ - description: The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. id: ncbi - name: NCBI + name: National Center for Biotechnology Information (NCBI) registry: - tess: NCBI + tess: National Center for Biotechnology Information (NCBI) url: https://www.ncbi.nlm.nih.gov/ - description: Nextflow is a framework for data analysis workflow execution id: nextflow @@ -672,14 +661,13 @@ registry: biotools: wtdbg2 url: https://github.com/ruanjue/wtdbg2 -- description: Metabolomic and lipidomic platform - id: xcms - name: XCMS +- description: A systems biology tool for analyzing metabolomic data. It automatically superimposes raw metabolomic data onto metabolic pathways and integrates it with transcriptomic and proteomic data. + id: xcms-online + name: XCMS Online registry: - biotools: xcms - tess: XCMS + biotools: xcms_online url: https://xcmsonline.scripps.edu/landing_page.php?pgcontent=mainPage -- description: Zenodo is a generalist research data repository built and developed by OpenAIRE and CERN. It was developed to aid Open Science and is built on open source code. Zenodo helps researchers receive credit by making the research results citable and through OpenAIRE integrates them into existing reporting lines to funding agencies like the European Commission. Citation information is also passed to DataCite and onto the scholarly aggregators. Content is available publicly under any one of 400 open licences (from opendefinition.org and spdx.org). Restricted and Closed content is also supported. Free for researchers below 50 GB/dataset. Content is both online on disk and offline on tape as part of a long-term preservation policy. Zenodo supports managed access (with an access request workflow) as well as embargoing generally and during peer review. The base infrastructure of Zenodo is provided by CERN, a non-profit IGO. Projects are funded through grants. +- description: Zenodo is a generalist research data repository built and developed by OpenAIRE and CERN. id: zenodo name: Zenodo registry: @@ -1012,3 +1000,213 @@ id: mbat name: Mouse Brain Alignment Tool (MBAT) url: https://github.com/Turku-BioImaging/mouse-brain-alignment-tool +- description: Pure Python package for parsing and handling biological networks encoded in the Biological Expression Language (BEL). + id: pybel + name: PyBEL + registry: + biotools: pybel + url: https://github.com/pybel/pybel +- description: The OpenBEL Framework is an open-platform technology for managing, publishing, and using biological knowledge represented using the Biological Expression Language (BEL). + id: openbel + name: OpenBEL + url: https://github.com/OpenBEL/openbel-framework +- description: Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. + id: velvet + name: Velvet + registry: + biotools: velvet + tess: Velvet + url: https://github.com/dzerbino/velvet +- description: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies + id: raxml + name: RAxML + registry: + biotools: raxml + url: https://github.com/stamatak/standard-RAxML +- description: IQ-TREE is designed to efficiently handle large phylogenomic datasets, utilize multicore and distributed parallel computing for faster analysis, and automatically resume interrupted analyses through checkpointing. + id: iqtree + name: IQtree + registry: + biotools: iqtree + url: https://github.com/iqtree/iqtree2 +- description: MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters. + id: mrbayes + name: MrBayes + registry: + biotools: mrbayes + url: https://nbisweden.github.io/MrBayes/ +- description: BEAST is a cross-platform program for Bayesian phylogenetic analysis, estimating rooted, time-measured phylogenies using strict or relaxed molecular clock models. It uses Markov chain Monte Carlo (MCMC) to average over tree space and includes a graphical user interface for setting up analyses and tools for result analysis. + id: beast + name: BEAST + registry: + biotools: beast + url: https://www.beast2.org/ +- description: Rapid haploid variant calling and core genome alignment. + id: snippy + name: SNippy + registry: + biotools: snippy + url: https://github.com/tseemann/snippy +- description: Convert ThermoFinningan RAW mass spectrometry files to the mzXML format. + id: readw + name: ReAdW + registry: + biotools: readw + url: https://github.com/PedrioliLab/ReAdW +- description: X! Tandem open source is software that can match tandem mass spectra with peptide sequences, in a process that has come to be known as protein identification. + id: x-tandem + name: X! Tandem + url: https://www.thegpm.org/TANDEM/ +- description: OMSSA (Open Mass Spectrometry Search Algorithm) is a tool to identify peptides in tandem mass spectrometry (MS/MS) data. The OMSSA algorithm uses a classic probability score to compute specificity. See also The NCBI C++ Toolkit and The NCBI C++ Toolkit Book. + id: omssa + name: OMSSA + registry: + biotools: omssa + url: https://ftp.ncbi.nlm.nih.gov/pub/lewisg/omssa/ +- description: MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. It is specifically aimed at high-resolution MS data. + id: maxquant + name: MAXQUANT + registry: + biotools: maxquant + tess: MAXQUANT + url: https://www.maxquant.org/ +- description: Absolute protein expression Quantitative Proteomics Tool, is a free and open source Java implementation of the APEX technique for the quantitation of proteins based on standard LC- MS/MS proteomics data. + id: apex + name: apex + registry: + biotools: apex + url: http://sourceforge.net/projects/apexqpt/ +- description: Framework for processing and visualization of chromatographically separated and single-spectra mass spectral data. + id: xcms + name: xcms + registry: + biotools: xcms + tess: xcms + url: http://bioconductor.org/packages/release/bioc/html/xcms.html +- description: A Meta-Search Peptide Identification Platform for Tandem Mass Spectra + id: peparml + name: PepArMl + registry: + biotools: peparml + url: https://peparml.sourceforge.net/ +- description: A commercial software package for NMR spectral processing that offers a semi-automated tool for spectral deconvolution, enabling interactive fitting of metabolite peaks to reference spectra and quantifying their concentrations. + id: chenomx + name: Chenomx + url: https://www.chenomx.com/ +- description: ResFinder identifies acquired genes and/or finds chromosomal mutations mediating antimicrobial resistance in total or partial DNA sequence of bacteria. + id: resfinder + name: ResFinder + registry: + biotools: resfinder + url: http://genepi.food.dtu.dk/resfinder +- description: Pathogenwatch provides species and taxonomy prediction for over 60,000 variants of bacteria, viruses, and fungi. + id: pathogenwatch + name: Pathogenwatch + url: https://pathogen.watch/ +- description: CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks. + id: celldesigner + name: CellDesigner + url: https://www.celldesigner.org/ +- description: 'A curated database containing nearly all published HIV RT and protease sequences: a resource designed for researchers studying evolutionary and drug-related variation in the molecular targets of anti-HIV therapy.' + id: hivdb-stanford + name: Stanford HIV Drug Resistance Database (HIVDB) + url: https://hivdb.stanford.edu/ +- description: Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data. + id: nextstrain + name: Nextstrain + registry: + biotools: nextstrain.org + tess: Nextstrain + url: http://nextstrain.org +- description: g:GOSt performs functional enrichment analysis, also known as over-representation analysis (ORA) or gene set enrichment analysis, on input gene list. + id: g-profiler + name: g:Profiler + registry: + biotools: gprofiler + tess: g:Profiler + url: https://biit.cs.ut.ee/gprofiler/gost +- description: EuroHPC Joint Undertaking is a joint initiative between the EU, European countries and private partners to develop a World Class Supercomputing Ecosystem in Europe. + id: eurohpc + name: EuroHPC + regsitry: + url: https://eurohpc-ju.europa.eu/ +- description: BEAUti is a graphical user-interface (GUI) application for generating BEAST XML files. + id: beauti + name: BEAUti + url: https://beast.community/beauti.html +- description: QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. + id: qiime2 + name: QIIME 2 + registry: + biotools: qiime2 + tess: QIIME 2 + regsitry: + url: https://docs.qiime2.org/ +- description: MEGAHIT is an ultra-fast and memory-efficient NGS assembler optimized for metagenomes. + id: megahit + name: MEGAHIT + registry: + biotools: megahit + url: https://github.com/voutcn/megahit +- description: A taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. + id: kraken2 + name: Kraken 2 + url: https://ccb.jhu.edu/software/kraken2/ +- description: The COVID-19 Disease Map is an assembly of molecular interaction diagrams, established based on literature evidence. + id: covid19map + name: COVID19 Disease Map + url: https://covid19map.elixir-luxembourg.org/ +- description: Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference). + id: freyja + name: Freyja + registry: + biotools: freyja + url: https://github.com/andersen-lab/Freyja +- description: The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. + id: cojac + name: COJAC + registry: + biotools: cojac + url: https://github.com/cbg-ethz/cojac +- description: Lineagespot is a framework written in R, and aims to identify SARS-CoV-2 related mutations based on a single (or a list) of variant(s) file(s). + id: lineagespot + name: Lineagespot + registry: + biotools: lineagespot + url: https://github.com/BiodataAnalysisGroup/lineagespot +- description: Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. + id: kallisto + name: Kallisto + registry: + biotools: kallisto + url: https://pachterlab.github.io/kallisto/about.html +- description: PiGx SARS-CoV-2 is a pipeline for analysing data from sequenced wastewater samples and identifying given lineages of SARS-CoV-2. + id: pigxs + name: PiGx SARS-CoV-2 Wastewater Sequencing Pipeline + url: https://github.com/BIMSBbioinfo/pigx_sars-cov-2 +- description: A GitHub repository from the CBG-ETHZ group offering tools for detecting SARS-CoV-2 variants in Switzerland. + id: cowwid + name: COWWID + url: https://github.com/cbg-ethz/cowwid +- description: A SARS-CoV-2 Contextual Data Specification from PHA4GE. + id: sars-pha4ge + name: SARS-CoV-2 Contextual Data Specification + url: https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification +- description: A data model to improve wastewater surveillance through interoperable data. + id: phes-odm + name: PHES-ODM + url: https://github.com/Big-Life-Lab/PHES-ODM +- description: A pipeline for lineage abundance estimation from wastewater sequencing data. + id: vlq + name: VLQ + url: https://github.com/baymlab/wastewater_analysis +- description: CFSAN Wastewater Analysis Pipeline to estimate the percentage of SARS-CoV-2 variants in a sample. + id: c-wap + name: C-WAP + url: https://github.com/CFSAN-Biostatistics/C-WAP +- description: Functional Enrichment Analysis and Network Construction + id: enrichr + name: Enrichr + registry: + biotools: enrichr + url: https://github.com/guokai8/EnrichR diff --git a/about/contributors.md b/about/contributors.md index c9dc6fde..4eb048d4 100644 --- a/about/contributors.md +++ b/about/contributors.md @@ -1,6 +1,7 @@ --- title: Contributors custom_editme: _data/CONTRIBUTORS.yaml +toc: false --- This project would not be possible without the many amazing community contributors. Infectious Diseases Toolkit is an open community project, and you are welcome to [join us](/contribute/)! diff --git a/about/editorial-board.md b/about/editorial-board.md index 7e8a4423..fdfa0ca0 100644 --- a/about/editorial-board.md +++ b/about/editorial-board.md @@ -4,7 +4,7 @@ title: Editorial board ## Meet the editorial board members -{% include contributor-carousel-selection.html custom="Bert Droesbeke, Eva Garcia Alvarez, Hedi Peterson, Katharina Lauer, Laura Portell Silva, Liane Hughes, Patricia Palagi, Rafael Andrade Buono, Rudolf Wittner, Martin Cook, Shona Cosgrove, Stian Soiland-Reyes, Romain David" %} +{% include contributor-carousel-selection.html custom="Bert Droesbeke, Eva Garcia Alvarez, Hedi Peterson, Katharina Lauer, Laura Portell Silva, Liane Hughes, Patricia Palagi, Rafael Andrade Buono, Rudolf Wittner, Shona Cosgrove, Stian Soiland-Reyes, Romain David" %} ## Responsibilities @@ -19,7 +19,7 @@ title: Editorial board In this section we would like to thank contributions of our past editorial members. -{% include contributor-tiles-all.html custom="Iris Van Dam" %} +{% include contributor-tiles-all.html custom="Iris Van Dam, Martin Cook" %} ## Contact diff --git a/attributing-credit/index.md b/attributing-credit/index.md index e2eae782..54abfdff 100644 --- a/attributing-credit/index.md +++ b/attributing-credit/index.md @@ -4,10 +4,6 @@ toc: false --- - -{% include section-navigation-tiles.html type="attributing_credit" except="index.md" %} - - **We are still working on the content for this page.** If you are interested in adding to the page, then: [Feel free to contribute](/contribute/){: class="btn btn-primary btn-lg rounded-pill"} diff --git a/data-analysis/human-biomolecular-data.md b/data-analysis/human-biomolecular-data.md index 49d9f913..1d544d90 100644 --- a/data-analysis/human-biomolecular-data.md +++ b/data-analysis/human-biomolecular-data.md @@ -112,7 +112,7 @@ There are several types of analysis that can be performed on human biomolecular - *Interaction databases*: {% tool "biogrid" %} and {% tool "intact" %} - *Network analysis*: {% tool "cytoscape" %} and {% tool "genemania" %} - **Metabolomics analysis**: This involves measuring the levels of small molecules (metabolites) in biological samples and comparing them across different conditions or groups of samples. This can help to identify biomarkers of disease or drug response. - - *Data processing*: {% tool "xcms" %}, {% tool "mzmine" %} and {% tool "openms" %} + - *Data processing*: {% tool "xcms-online" %}, {% tool "mzmine" %} and {% tool "openms" %} - *Statistical analysis*: {% tool "metaboanalyst" %} and {% tool "metsign" %} ## Postprocessing diff --git a/data-analysis/pathogen-characterisation.md b/data-analysis/pathogen-characterisation.md index 2c2018bc..7774f385 100644 --- a/data-analysis/pathogen-characterisation.md +++ b/data-analysis/pathogen-characterisation.md @@ -1,27 +1,183 @@ --- title: Pathogen characterisation -description: Generic workflows for different data types. -contributors: [] -no_robots: true -search_exclude: true -sitemap: false +description: Analysing Pathogen related data. +contributors: [Eva Garcia Alvarez, Francesco Messina, Fotis Psomopoulos, Rafael Andrade Buono] page_id: pc_data_analysis redirect_from: /pathogen-characterisation/data-analysis -rdmkit: - - name: - url: +related_pages: + showcase: [covid19_galaxy_project] training: - - name: - registry: - url: -# More information on how to fill in this metadata section can be found here https://www.infectious-diseases-toolkit.org/contribute/page-metadata + - name: SARS-CoV-2 data analysis + registry: Carpentries + url: https://gallantries.github.io/video-library/modules/covid-analysis + - name: SARS-CoV-2, viruses and bacteria data analysis + registry: Carpentries + url: https://gallantries.github.io/video-library/modules/one-health + - name: Pathway analysis with the MINERVA Platform + registry: Other + url: https://gxy.io/GTN:T00437 +rdmkit: + - name: Data Analysis + url: https://rdmkit.elixir-europe.org/data_analysis --- +## Introduction + +Data analysis for pathogen characterization allows us to understand the evolution of pathogens, and the relationship among different strains and provides insights on host-pathogen interactions and drug resistance. The tasks can involve processing data collected from a diverse spectrum of sources, from both clinical and environmental samples. As in every data analysis procedure, the general workflow involves: + +- Preprocessing: Includes the initial steps required to prepare data, genomics and not, for further analysis. + +- Analysis: Is the core stage where the actual detection and characterization of pathogens occur. This stage employs many techniques for pathogen characterization, such as Next-Generation Sequencing (NGS). + +- Postprocessing: Includes interpreting and validating the data obtained from the analysis stage, as well as integrating it into broader contexts. Moreover, this is often followed by reporting and communication, and archiving and data management. + + +Each stage is crucial for the accurate and comprehensive characterisation of pathogens, from the initial handling of samples to the final reporting and data management, and will be detailed below. +Scalable and reproducible data analysis activities enable rapid surveillance of infectious epidemics of emerging and re-emerging pathogens in foodborne, hospital settings, and local community outbreaks. Ensuring reproducibility is critical for the usability of the analysis results. Following community-recognised best practices and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) is fundamental for guaranteeing the trustworthiness of the results and enabling collaboration and sharing of information. + + +### General considerations + +When analysing pathogen data involved in a health emergency or epidemic outbreak are: +- Define the pathogen and specific aspects to be investigated, e.g. genomic features of interest +- Collect the suitable reference data about the pathogen of interest, preferentially from community-accepted repositories, e.g. {% tool "european-nucleotide-archive" %} and {% tool "gisaid" %}. It is worth noting that the right reference should be chosen taking into account mutation features, time of isolation, classification, phenotype, and genomic structure. +- Before analysing the data, define which specific aspect of the pathogen’s variability will be investigated. For example, if your aim is to describe the whole variability along the genome, the data should be compared with the whole reference genome. +- Define the type of data you are using, e.g. DNA or RNAseq for viral genome characterisation +- Select the tools best suited for the analysis of your data +- Estimate the computing resources needed +- Define which computing infrastructure is most suitable, e.g. cluster or cloud +- Ensure to follow the FAIR principles when handling data +- Guarantee findability of the data and tools for all collaborators for reproducibility by providing your: + - Code + - Execution environment + - Workflows + - Data analysis execution, including parameters used + - Accompanied by documentation that lists all parameters and other relevant information to reproduce the findings + + +### Existing approaches +- **Container and environments**: Consider using containers and environments to collect and isolate dependencies for tools and pipelines. Environment management systems, such as Conda, help with reproducibility but are not inherently portable across platforms. Containers provide a higher level of portability, being able to encapsulate both the software and its dependencies. +- **Web-based code collaboration platform**: Consider using a centralised location for software developers to store, manage, collaborate, and share their code. For instance, {% tool "github" %}, {% tool "gitlab" %}, or {% tool "bitbucket" %}. +- **Workflow management systems**: Allow you to formalise your workflows in a standardised format and execute them locally or on a remote computer infrastructure. Popular systems are {% tool "nextflow" %} and {% tool "snakemake" %}. +- **Workflow platforms**: Allow users to manage data, run formalised workflows, and review their results. Platforms, such as {% tool "galaxy" %}, may offer multiple interfaces, e.g. web, GUI, and APIs. +- **Reference databases**: Collect the suitable reference data about pathogens to be investigated. {% tool "european-nucleotide-archive" %} and {% tool "gisaid" %} are examples of genomic databases to which researchers share their data. In this context, the European Pathogens Portal aggregates databases relating to pathogens, as well as hosts and their vectors. Other countries host their own instance of the {% tool "pathogens-portal" %}, e.g. see the {% tool "swedish-pathogens-portal" %}Swedish Pathogens Portal [showcase](https://www.infectious-diseases-toolkit.org/showcase/swedish-pathogens-portal). +- **Workflow registries**: Register workflows in platforms, such as {% tool "workflowhub" %}, that facilitate sharing, versioning, and authorship attribution of the pipelines. + + +For more general information and solutions on data analysis, you may have a look at the content available on the [RDMkit data analysis page] +(https://rdmkit.elixir-europe.org/data_analysis#what-are-the-best-practices-for-data-analysis). +While the examples on this page focus on the genomic characterisation of pathogens, similar principles apply to other data types. + +## Preprocessing + +Data preprocessing is an initial step in data analysis involving the preparation of raw data for the main analysis. It is an important factor in quality control, and involves steps for the cleaning of the data, with the identification of inconsistencies, errors, and missing values. Preprocessing may also include data conversion and transformation steps to get the data in a format compatible with the expected inputs of the chosen analysis pipelines. + +### Considerations + +Some typical considerations involved in this step: +- **Data cleaning**: Finds and corrects errors in the data. For example, eliminating duplicates, removing too short genomic reads, and trimming not useful information such as contaminating host data. +- **Quality control checks**: Should be conducted at each step to ensure that the data is suitable for the intended analysis. +- **Exclusion of low-quality samples**: Samples with low-quality scores should be marked and removed. In genomics studies, samples with missing values, low sequencing depth, and contaminations might be removed. + +### Existing approaches + +Preprocessing steps may depend on the technology used and the pathogen being studied and thus should be adjusted accordingly. Some common approaches in genomics studies include: + +- Raw sequences quality check: {% tool "fastqc" %} +- Trimming out adapters and low-quality sequences: {% tool "trimmomatic" %} +- Quality checks: further information can be found on the [Quality control - Pathogen characterisation](/quality-control/pathogen-characterisation) page. + +## Analysis + +The analysis of data to characterise a pathogen of interest can involve methodologies from different fields. While genomics approaches are of common interest, analysis of other data types, such as proteomics and metabolomics, and their combination can be of special importance. + +### Considerations + +- **The computational resources**: Verify that the appropriate computational resources are available. Depending on the volume and complexity of the data, you might need to make use of large computing clusters or cloud computing resources. +- **The location of your data**: Ensure that the chosen computing infrastructure and platforms have access to the data. It is important to consider the distance between the data storage and computing, as it can significantly impact transfer times and costs. +- **Document the steps**: Report every step of the data analysis process. Including software versions employed, parameters utilised, the computing environment employed, reference genome used, as well as any “manual” data curation steps. More information on recording provenance can be found on the [Provenance pages](/provenance/) +- **Collaborative analysis**: it is important that partners have access to the data, tools, and workflows. It is crucial that systems are in place to track changes to the tools and workflows used, and that the history of modifications is accessible to all collaborators. + +### Existing approaches + +There are several types of analysis that can be performed on pathogen-related data, depending on the specific research question and type of data being analysed. Here are some solutions: +- Consider using the available computational infrastructure to scale up your analysis capabilities. This may include applying for access to large computing cluster resources with e.g. {% tool "eurohpc"%} or making use of public Galaxy servers such as {% tool "galaxy-europe" %}. +- **Genomic analysis**: Including whole genome sequencing (WGS), this analysis allows the interpretation of genetic information encoded along the genome (DNA or RNA). Genomic analysis can be used for a wide range of applications to characterise many aspects of pathogen variability, such as Variants of Concern (VOC) and antimicrobial resistance profiles in bacteria (AMR). Examples of tools that allow us to take into account the genomic characteristics of pathogens (e.g. genomic structure and size, gene annotations, mobile genetic elements) are: + - Sequence Alignment: {% tool "bowtie2" %}, {% tool "bwa" %} and {% tool "samtools" %} + - Genome Assembly: {% tool "canu"%}, {% tool "velvet" %} and {% tool "spades" %} + - Phylogenetic Analysis: {% tool "clustalw" %}, {% tool "muscle" %}, {% tool "mafft" %}, {% tool "raxml" %} and {% tool "iqtree" %} + - Molecular Clock: {% tool "mrbayes" %}, {% tool "beast" %} and {% tool "beauti" %} + - Variant calling: {% tool "dragen-gatk" %}, {% tool "freebayes" %} and {% tool "varscan" %} + - Annotation: {% tool "annovar" %}, {% tool "snpeff" %}, {% tool "vep" %} and {% tool "dbnsfp" %} + - All-in-one Bioinformatic Tools: {% tool "snippy" %} + +- **Metagenomics analysis**: Sequencing all genetic material in a sample can provide comprehensive data about the composition of the microbial community. In the context of infectious diseases, it can aid in identifying multiple pathogens simultaneously in clinical, as well as environmental samples. Examples of tools in this type of analysis are: + - 16S rRNA sequencing: {% tool "qiime2" %} + - Shotgun sequencing: {% tool "spades" %}, and {% tool "megahit" %} + - Assigning taxonomic labels: {% tool "kraken2" %} + +- **Proteomics analysis**: Proteomics, primarily utilising mass spectrometry techniques, offers a powerful tools for examining proteins and their interplay. This can provide valuable insights into irregularities associated with infectious diseases and potentially uncover mechanisms of drug resistance. Examples of tools in this type of analysis are: + - Mass Spectrometry Data Extraction Software: {% tool "readw" %} + - Search Algorithms: {% tool "x-tandem" %}, {% tool "omssa" %} and {% tool "maxquant" %} + - Statistical Validation: {% tool "peparml" %} + - Quantitative Tools: {% tool "apex" %} and {% tool "maxquant" %} + +- **Metabolomics analysis**: This involves measuring the levels of small molecules (metabolites) produced by specific pathogens in biological samples, comparing them across different conditions or groups of samples. Examples of tools in this type of analysis are: + - Mass Spectrometry Software: {% tool "xcms" %} and {% tool "metaboanalyst" %} + - NMR Spectroscopy Software: {% tool "chenomx" %} + - Data Processing: {% tool "xcms" %}, {% tool "mzmine" %} and {% tool "openms" %} + +## Postprocessing +In pathogen characterisation, the postprocessing steps are crucial to evaluate and interpret the results. These steps are important to identify strain relationships and specific molecular variation patterns linked to peculiar phenotypes of pathogens (e.g. drug resistance, virulence, and transmission rate). Such results must be biologically meaningful and reproducible, considering also the clinical aspects and treatment implications. + +### Considerations + +Some considerations about postprocessing steps in pathogen characterization include: +- **Interpretation**: it is important to interpret them in a biologically meaningful context. This should consider the following aspects: report the variability of specific pathogens; find out new strains that could become concerning; identify specific genes or mutations associated with pathogenic variation. +- **Transformation**: Consider having postprocessing steps to ensure that outputs are transformed or converted into interoperable and open formats. This ensures that subsequent pipelines and collaborators can readily make use of the results. +- **Visualisation**: To allow a clear interpretation of the clinical practice, it is important to visualise the results clearly, to make the results clear also to all professionals involved. + +### Existing approaches + +- **Spatial-temporal analysis and visualisation**: using a combined approach of phylogenetic, spatial distribution, and molecular clock, this approach aids in designing strategies to control and prevent the spread of infectious diseases, as well as in the development of effective treatments, and vaccines. + - Spatial distribution of strain: {% tool "nextstrain" %} +- **Drug resistance characterisation**: genomic analysis can be used to characterise pathogens for specific resistance against drugs and help develop strategies to fight the spread of drug-resistant strains. + - Antimicrobial resistance (AMR): {% tool "resfinder" %} and {% tool "pathogenwatch" %} + - Viral drug resistance: {% tool "hivdb-stanford" %} +- **Interaction analysis and functional enrichment analysis**: placing the identified protein interactions and regulatory networks in the context of the affected biological pathways allows for a better understanding of disease mechanisms and potential drug targets. + - Network analysis: {% tool "cytoscape" %} and {% tool "celldesigner" %} + - Gene enrichment analysis: {% tool "enrichr" %}, {% tool "go" %} and {% tool "g-profiler" %} + - Interaction Databases: {% tool "biogrid" %} and {% tool "intact" %} + - Integrative diagrams: + - A [disease map](https://disease-maps.org/) can be used to represent a conceptual model of the molecular mechanisms of a disease. An example is the {% tool "covid19map" %}. + +## Data analysis of wastewater surveillance for infectious diseases + +Wastewater surveillance has emerged as a valuable tool for monitoring infectious diseases, providing a non-invasive method to track the spread of pathogens within communities. This approach has gained significant attention during the COVID-19 pandemic, particularly for detecting and analysing SARS-CoV-2 variants. By analysing wastewater samples, researchers can identify the presence and prevalence of infectious agents, offering insights into public health trends. Here we focus on the analysis of wastewater with an emphasis on SARS-CoV-2. + +### Considerations + +Even though the considerations for this specific field are very similar to the ones described in the previous paragraphs, there are some approaches that are used in the context of wastewater surveillance. + +### Existing approaches + +Several tools and workflows have been developed or adapted for the analysis of wastewater data, especially in the context of SARS-CoV-2 surveillance: + - **Specific Tools for SARS-CoV-2**: Certain tools (such as {% tool "freyja"%}, {% tool "cojac"%}, and {% tool "lineagespot" %}) are specifically designed for analysing SARS-CoV-2 data, providing capabilities such as variant detection and lineage tracking. + - **Repurposed Tools**: Originally developed for other types of genomic data, tools like {% tool "kallisto" %} or {% tool "kraken2" %}, have been successfully applied to wastewater data analysis, offering high performance in read alignment and taxonomic classification. +- In addition, here are **several bioinformatics protocols and solutions** that could be used in the context of wastewater next-generation sequencing (NGS) data analysis. + - {% tool "pigxs" %}: provides a comprehensive solution for sequencing and analysing SARS-CoV-2 in wastewater. + - Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples [medRxiv](https://www.medrxiv.org/content/10.1101/2021.01.08.21249379v2): COWWID: A GitHub repository from the CBG-ETHZ group offering tools for detecting SARS-CoV-2 variants in Switzerland + - [CDC Module 2.7](https://www.cdc.gov/amd/training/covid-toolkit/module2-7.html): Wastewater based variant tracking for SARS-CoV-2 + - The Public Health Alliance for Genomic Epidemiology GitHub organization makes available a mapping to the {% tool "european-nucleotide-archive" %}: {% tool "sars-pha4ge" %} + - {% tool "phes-odm" %} as an open data model for wastewater surveillance + - Viral Lineage Quantification (VLQ), Kallisto-Approach: [Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02805-9) and corresponding repository at {% tool "vlq" %} + - [Performance benchmark of tools](https://peerj.com/articles/14596/), evaluating tools like Kraken2, Kallisto, Freyja, implemented in {% tool "c-wap" %} + - Wastewater quality control workflow in GalaxyTrakr [(SSquAWK4)](dx.doi.org/10.17504/protocols.io.kxygxzk5dv8j/v9). Further quality control aspects are discussed in the [Quality Control - Pathogen Characterisation page](/quality-control/pathogen-characterisation) + - ECDC [Guidance document](https://www.ecdc.europa.eu/sites/default/files/documents/Guidance-for-representative-and-targeted-genomic-SARS-CoV-2-monitoring-updated-with%20erratum-20-May-2021.pdf) for representative and targeted genomic SARS-CoV-2 monitoring + + + -**We are still working on the content for this page.** If you are interested in adding to the page, then: -[Feel free to contribute](/contribute/){: class="btn btn-primary btn-lg rounded-pill"} -This is a community-driven website, so contributions are welcome! You will, of course, be listed as a contributor on the page. -New content is announced on the [home page](/) and [news page](/about/news), so please check for updates there. You can also watch for changes on this page by using a free service like [Visual Ping](https://visualping.io/) or [Distill Web Monitor](https://distill.io/), or by using a [browser add-on](https://chrome.google.com/webstore/detail/distill-web-monitor/inlikjemeeknofckkjolnjbpehgadgge?hl=en). diff --git a/data-communication/index.md b/data-communication/index.md index aa5b9694..76aafef3 100755 --- a/data-communication/index.md +++ b/data-communication/index.md @@ -12,8 +12,6 @@ rdmkit: url: https://rdmkit.elixir-europe.org/processing#what-is-data-processing --- -{% include section-navigation-tiles.html type="data_communication" except="index.md" %} - ## Introduction Data can only reach its full potential when communicated well to the audience. In a crisis situation, people are thirsty for information, and clear data communication becomes especially crucial. Communicating data as tables might be the easiest for data providers, but the trends and effects associated with infectious diseases are best shown using data visualisations. diff --git a/data-sources/human-biomolecular-data.md b/data-sources/human-biomolecular-data.md index 7d1e0440..9c7e34fd 100644 --- a/data-sources/human-biomolecular-data.md +++ b/data-sources/human-biomolecular-data.md @@ -70,9 +70,9 @@ Please note that these considerations are general in nature and may vary dependi ### Existing approaches -- **Public databases:** Various publicly accessible databases serve as repositories for human biomolecular data, such as the National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) databases (e.g., {% tool "genbank" %}, {% tool "geo" %}, {% tool "sra" %} and European Bioinformatics Institute ({% tool "ebi" %}) databases (e.g., {% tool "european-nucleotide-archive" %}, {% tool "arrayexpress" %}). +- **Public databases:** Various publicly accessible databases serve as repositories for human biomolecular data, such as the {% tool "ncbi" %} databases (e.g., {% tool "genbank" %}, {% tool "geo" %}, {% tool "sra" %}) and European Bioinformatics Institute ({% tool "ebi" %}) databases (e.g., {% tool "european-nucleotide-archive" %}, {% tool "arrayexpress" %}). - **Controlled access repositories:** Some data deposition platforms, like dbGaP ({% tool "dbgap" %}) and EGA ({% tool "ega" %}), adopt a controlled access model to protect sensitive human biomolecular data. Researchers interested in accessing the data need to request permission and comply with specific data usage policies. -- **Data integration platforms:** Platforms like the Global Alliance for Genomics and Health ([GA4GH](https://www.ga4gh.org/)) provide frameworks and standards for federated data access and integration across multiple repositories. These initiatives aim to facilitate the aggregation and analysis of human biomolecular data from diverse sources while maintaining data privacy and security. +- **Data integration platforms:** Platforms like the {% tool "ga4gh" %} provide frameworks and standards for federated data access and integration across multiple repositories. These initiatives aim to facilitate the aggregation and analysis of human biomolecular data from diverse sources while maintaining data privacy and security. - **Data citation and DOI assignment:** To acknowledge and promote the contributions of researchers who deposit human biomolecular data, many repositories assign unique digital object identifiers (DOIs) to datasets. This enables proper citation and recognition of the deposited data, enhancing its visibility and impact. - **Data submission portals:** Some repositories offer user-friendly web portals or submission systems that guide researchers through the process of depositing human biomolecular data. These portals often provide templates, validation checks, and step-by-step instructions to ensure the completeness and quality of the deposited data. - **Consortium-specific databases:** Collaborative research initiatives often establish dedicated databases for sharing and depositing human biomolecular data, such as The Cancer Genome Atlas ({% tool "tcga" %}) for cancer genomics data or the Genotype-Tissue Expression ({% tool "gtex" %}) project for gene expression data across different tissues. @@ -208,7 +208,7 @@ Consequently, we have compiled some of the main tools, portals, and data sharing - {% tool "fega" %}, which provides secure controlled access sharing of sensitive patient and research subject data sets relating to COVID-19 while complying with stringent privacy national laws. - {% tool "covid-19-data-portal" %}, which brings together and continuously updates relevant COVID-19 datasets and tools, will host sequence data sharing and will facilitate access to other SARS-CoV-2 resources. - You can find further information about the Covid-19 Data Portal in the link [here](https://rdmkit.elixir-europe.org/covid19_data_portal). + You can find further information about the Covid-19 Data Portal on [RDMkit](https://rdmkit.elixir-europe.org/covid19_data_portal). ## Data access and transfer @@ -231,7 +231,7 @@ When looking for solutions to human biomolecular data access, you should conside - **Scalability and Performance:** Look for solutions capable of efficiently handling large-scale biomolecular data sets while maintaining optimal performance, supporting advanced analysis tools for meaningful insights. - **User-Friendly Interface:** Opt for solutions with intuitive interfaces and flexible access controls, enabling researchers of varying technical backgrounds to access, analyze, and interpret data effectively. -When looking for solutions to data transfer, you can check [this](https://rdmkit.elixir-europe.org/data_transfer) documentation. +When looking for solutions to data transfer, you can check [RDMkit](https://rdmkit.elixir-europe.org/data_transfer). ### Existing approaches @@ -247,7 +247,7 @@ When looking for solutions to data transfer, you can check [this](https://rdmkit - By depositing your data to one of the existing controlled access repositories, they will already show the data use conditions (e.g. [EGAD00001007777](https://ega-archive.org/datasets/EGAD00001007777)) - A data access committee (DAC) is a group responsible for reviewing and approving requests for access to sensitive data, such as human biomolecular data. Its role is to ensure that requests are in compliance with relevant laws and regulations, that data is being used for legitimate scientific purposes, and that privacy and security are being maintained. To know more about what is a DAC and how to become one, you can check the [European Genome-phenome Archive - Data Access Committee](https://ega-archive.org/submission/data_access_committee) website. -You can find further information about sharing human data [here](https://rdmkit.elixir-europe.org/human_data#sharing-and-reusing-of-human-data). +You can find further information about sharing human data on [RDMkit](https://rdmkit.elixir-europe.org/human_data#sharing-and-reusing-of-human-data). ## Data harmonisation @@ -268,6 +268,6 @@ Thanks to the Sars-CoV-2 outbreak, the scientific community has established stan ### Existing approaches -* When looking for solutions to standards, schemas, ontologies and vocabularies, you can check [this](https://rdmkit.elixir-europe.org/metadata_management#how-do-you-find-appropriate-standard-metadata-for-datasets-or-samples) documentation. +* When looking for solutions to standards, schemas, ontologies and vocabularies, you can check [the RDMkit](https://rdmkit.elixir-europe.org/metadata_management#how-do-you-find-appropriate-standard-metadata-for-datasets-or-samples) for documentation. * {% tool "fairsharing" %} is also a good resource to find metadata standards that can be useful for your research. diff --git a/showcase/knowledge-graph-generator.md b/showcase/knowledge-graph-generator.md index 365c4048..27ef87c1 100644 --- a/showcase/knowledge-graph-generator.md +++ b/showcase/knowledge-graph-generator.md @@ -3,14 +3,14 @@ title: Knowledge Graph Generator (KGG) - A fully automated workflow for creating contributors: [Reagon Karki] description: Open source workflow integrating biological databases for FAIR data compliant Knowledge Graphs affiliations: [Fraunhofer ITMP, EU-OpenScreen] -page_id: knowledge-graph-generator +page_id: knowledge_graph_generator --- ## Introduction Knowledge Graphs (KGs) are advanced forms of networks that capture the semantics of the constituent entities and the interactions among them. They facilitate ontology-driven data consolidation via integration/harmonization of heterogeneous data and serve as a graphical database. Such KGs in place have the potential to answer complex queries and form the basis of domain-specific analyses. In context of biomedicine and life sciences, KGs represent disease-associated biological and pathophysiological phenomena by systematically assembling various inter-related entities such as proteins and their biological processes, molecular functions and pathways, chemicals and their mechanism of actions and adverse effects and so on. They have been deployed in several use cases and downstream analyses related to healthcare, pharmaceutical and clinical settings. However, the process of creating KGs is expensive and time-consuming because it requires a lot of manual curation. Moreover, machine-aided methods such as text-mining workflows and Large Language Models (LLMs) have their own shortcomings and are improving gradually. -This showcase introduces a fully automated workflow, namely Knowledge Graph Generator (KGG), for creating KGs that represent chemotype and phenotype of diseases. The KGG embeds underlying schema of curated public databases to retrieve relevant knowledge which is regarded as the gold standard for high quality data. The KGG is leveraged on our previous contributions to the BY-COVID project where we developed workflows for identification of bio-active analogs for fragments identified in COVID-NMR studies ([Berg, H et al.](https://doi.org/10.1007/s00259-021-05215-4)) and representation of Mpox biology ([Karki, R et al.](https://doi.org/10.1093/bioadv/vbad045)). The programmatic scripts and methods for KGG are written in python (version 3.10) and are available ([here](https://github.com/Fraunhofer-ITMP/kgg)). +This showcase introduces a fully automated workflow, namely Knowledge Graph Generator (KGG), for creating KGs that represent chemotype and phenotype of diseases. The KGG embeds underlying schema of curated public databases to retrieve relevant knowledge which is regarded as the gold standard for high quality data. The KGG is leveraged on our previous contributions to the BY-COVID project where we developed workflows for identification of bio-active analogs for fragments identified in COVID-NMR studies ([Berg, H et al.](https://doi.org/10.1007/s00259-021-05215-4)) and representation of Mpox biology ([Karki, R et al.](https://doi.org/10.1093/bioadv/vbad045)). The programmatic scripts and methods for KGG are written in python (version 3.10) and are available [on GitHub](https://github.com/Fraunhofer-ITMP/kgg). ## Who is the showcase intended for? @@ -18,7 +18,7 @@ The KGG is developed for a broad spectrum of researchers and scientists, especia ## What is the showcase? -{% include image.html file="/kgg_showcase_overview.png" caption="Figure 1. A schematic representation of the KGG workflow depicting its three phases. The python-based workflow fetches real-time knowledge from curated databases and uses ([OpenBEL](https://doi.org/10.1016/j.drudis.2013.12.011)) framework to systematically encode the knowledge and relevant metadata." %} +{% include image.html file="/kgg_showcase_overview.png" caption="Figure 1. A schematic representation of the KGG workflow depicting its three phases. The python-based workflow fetches real-time knowledge from curated databases and uses the OpenBEL framework ([Slater, T](https://doi.org/10.1016/j.drudis.2013.12.011)) to systematically encode the knowledge and relevant metadata." %} The automated workflow creating disease-specific KGs is subdivided into three phases and are described below: @@ -26,7 +26,7 @@ Phase I: Disease lookup and identification - The KGG workflow uses standard dise Phase II: Real-time knowledge retrieval - The identified disease identifier from Phase I is used as a query for curated databases to retrieve relevant disease associated knowledge in real time. This is achieved by embedding the APIs of OTP, ChEMBL, UniProt, Integrated Interaction Database (IID) and GWAS Central into our programmatic scripts and methods. -Phase III: KG compilation and generation - The retrieved knowledge from Phase II is stored as semantic triples (i.e., subject-predicate-object) using OpenBEL framework, which are both human and computer-readable. The language enables systematic representation of biological and molecular interactions by enforcing usage of standard ontologies. The implementation was performed using the open-source ([PyBEL](https://doi.org/10.1093/bioinformatics/btx660)) framework. It is a resource developed to help with triples formation, meta-data annotation, data parsing, validation, compilation and visualization of KG. It also offers a wide-range of functions to explore, query, and analyze KGs. The KGs can be exported to various standard formats such as json, csv, sql, graphml, and Neo4j, allowing comparison and integration with other KGs. +Phase III: KG compilation and generation - The retrieved knowledge from Phase II is stored as semantic triples (i.e., subject-predicate-object) using {% tool "openbel" %} framework, which are both human and computer-readable. The language enables systematic representation of biological and molecular interactions by enforcing usage of standard ontologies. The implementation was performed using the open-source {% tool "pybel" %} framework. It is a resource developed to help with triples formation, meta-data annotation, data parsing, validation, compilation and visualization of KG. It also offers a wide-range of functions to explore, query, and analyze KGs. The KGs can be exported to various standard formats such as json, csv, sql, graphml, and Neo4j, allowing comparison and integration with other KGs. ## What can you use the tool for?