Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain page Human Pathogen Genomics #1263

Merged
merged 26 commits into from
Aug 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
0191f48
Create human_pathogen_genomics.md
wna-se May 22, 2023
0c29ba3
last changed at May 10, 2023 10:46 AM, pushed by Wolmar Nyberg Åkerström
May 10, 2023
ff747aa
Add Human pathogen genomics to navigation
wna-se May 22, 2023
c726488
Conclusion of day RDMkit hackathon day 1
May 22, 2023
3c3c169
Conclusion of day 2 RDMkit hackathon activities
May 23, 2023
f3932d9
Merge branch 'elixir-europe:master' into master
wna-se May 24, 2023
65e0745
Conclusion of day 3 RDMkit hackathon activities
May 24, 2023
9531b9a
Fixed issues with rendering the “pipe” | character
May 26, 2023
216a8ba
Post-contentathon follow-up 1
May 31, 2023
02e8838
Post-contentathon work session 2
Jun 2, 2023
f622e04
Update before WP9 meeting
Jun 12, 2023
d511efb
Attempt to fix links to related_pages
Jun 13, 2023
c92818b
Attempt to fix links to related_pages
Jun 13, 2023
31c722d
Added GenEpiO Consortium
Jun 13, 2023
b721494
Added PHA4GE
Jun 13, 2023
a42cc6c
Attempt at adding Hypothes.is for comments
wna-se Jun 16, 2023
0d59576
General phrasing about protecting host related and sensitive contextu…
Jun 17, 2023
2be3777
Structural update to several sections to make the text more consisten
Jun 19, 2023
2471dd4
Restructured information about contextual information during data col…
Jun 19, 2023
d40d7af
Spelling and grammar check
Jun 19, 2023
7e33b54
Added references to GA4GH GDPR Briefs
Jun 19, 2023
6e50513
Fix: Link to GDPR Briefs is incompatible with markdown
Jun 19, 2023
1f9052a
Minor updates to punctuation and emphasis
Jun 20, 2023
fac6c73
Revert to origin
wna-se Jun 21, 2023
715f570
Fixed inconsistencies in the headings
Jun 28, 2023
613f50b
Added page description
wna-se Jul 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _data/sidebars/data_management.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ subitems:
url: /epitranscriptome_data
- title: Human data
url: /human_data
- title: Human pathogen genomics
url: /human_pathogen_genomics
- title: Intrinsically disordered proteins
url: /intrinsically_disordered_proteins
- title: Marine metagenomics
Expand Down
2 changes: 1 addition & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -207,4 +207,4 @@ <h2 class="mt-5 mb-3">RDMkit in numbers</h2>

});

</script>
</script>
158 changes: 158 additions & 0 deletions pages/your_domain/human_pathogen_genomics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
title: Human pathogen genomics
description: Data management solutions for Human pathogen genomics
contributors: [Diana Pilvar, Espen Åberg, Wolmar Nyberg Åkerström, Rafael Andrade Buono]
Copy link
Collaborator

@floradanna floradanna Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, add 2 contributors (Diana Pilvar and Espen Åberg) to the contributors.yaml file.

page_id: human_pathogen_genomics
related_pages:
your_tasks:
- data_brokering
- metadata_management
- data_transfer
- data_protection
- data_quality
tool_assembly:
- covid19_data_portal
your_domain:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, delete your_domain from related pages. Domain page cannot list other domain pages as related.

- rare_disease_data
- human_data
# More information on which page id you can use can be found at https://rdmkit.elixir-europe.org/website_overview
#training:
# - name:
# registry:
# url:
# More information on how to fill in this metadata section can be found here https://rdmkit.elixir-europe.org/page_metadata
---

## Introduction
The human pathogen genomics domain focuses on studying the genetic code of organisms that cause disease in humans. Studies to identify and understand pathogens are conducted across different types of organisations ranging from research institutes to regional public health authorities. The aims can include urgent outbreak response, prevention measures, and developing remedies such as treatments and vaccines.

Data management challenges in this domain include the potential urgency of data sharing and secondary use of data across initiatives emerging from research, public health and policy. While the pathogenic organisms are the object of interest, there are many considerations to account for when dealing with samples collected from patients, pathogen surveillance, and human research subjects.

The genomic data can represent anything from the genetic sequence of a single pathogen isolate to various fragments of genetic materials from a flora of pathogens in larger population. Other data can represent a wide range of contextual information about the human host, the disease, and various environmental factors.

## Planning a study with pathogen genomic data

### Description
While the object of interest in this domain are pathogens, the data is usually derived from samples originating from patients and human research subjects. This means that you must plan to either remove or to handle [human data](human_data) during your study.

### Considerations

* What legal and ethical aspects do you need to consider?
* Can you separate pathogen and human host material and data?
* What data protection measures should be implemented in contracts and procedures dealing with suppliers and collaborators?
* What is the appropriate scope for the legal and ethical agreements necessary for the study?
* How should statements related to data processing be phrased to allow timely and efficient data sharing?
* How much time would be required to negotiate access to the samples and data for the study?
* What public health and research initives should you consider aligning with?
* What data could be shared with or reused from other initiatives during the project?
* How will you align your practices with these initiatives to maximise the impact of the data and insight generated by the project?
* How will you share data with your collaborators and other initiatives?
* What conventions will you adopt when planning your study?
* What existing protocols should you consider adopting for sample preparation, sequencing, variant calling etc?
* What conventions should you adopt for documenting your research?

### Solutions

#### Working with human data
* Ensure that the project’s procedures conform with good practices for handling [human data](human_data). In particular, the following sections of the RDMkit:
* [Planning for projects with human data](human_data#planning-for-projects-with-human-data)
* [Processing and analysing human data](human_data#processing-and-analysing-human-data)

#### Isolate pathogen from host information
* Depending on the pathogen, how it interacts with the host, or the methods applied, it can be possible to generate clean isolates that do not contain host related material. Data produced from a clean isolate could potentially be handled with few restrictions, while other data will be considered to be personal and [sensitive](sensitive_data) that need [protection](data_protection).

#### Public health initiatives
* National and international recommendations from public health authorities, epidemic surveillance programs and research data communities should be considered when planning a new study or surveillance programme. In particular, you could consult conventions for relevant surveillance programs while considering widely adopted guidelines for research documentation, and instructions from the data sharing platforms.
* [European Centre for Disease Prevention and Control (ECDC)](https://www.ecdc.europa.eu/en) coordinates [Disease and laboratory networks](https://www.ecdc.europa.eu/en/about-ecdc/what-we-do/partners-and-networks/disease-and-laboratory-networks) and also issues [Surveillance and reporting protocols](https://www.ecdc.europa.eu/en/search?s=protocol) and other [Technical guidance on sequencing](https://www.ecdc.europa.eu/en/search?s=protocol).
* [WHO genomic surveillance strategy](https://www.who.int/initiatives/genomic-surveillance-strategy) and [guidance on implementation for maximum impact on public health](https://apps.who.int/iris/handle/10665/338480) and there are published reports that advise on [Implementing Quality Management Systems in Public Health Laboratories](https://doi.org/10.1128/jcm.00261-19).
* The US Centers for Disease Control and Prevention (CDC) offers guidance on [Pathogen genomics](https://www.cdc.gov/genomics/pathogen/index.htm) for its work in monitoring, investigating, and controlling infectious diseases.
* Refer to [National resources](national_resources) for information on regional authorities and national considerations.

#### Sequencing experiments
* Good practices for genome experiments suggest that the documentation, at a minimum, should describe the design of the study or surveillance program, the collected specimens and how the samples were prepared, the experimental setup and protocols, and the analysis workflow.
* Adopt recommendations specifically for genomics and pathogen genomics such as [Ten simple rules for annotating sequencing experiments](https://doi.org/10.1371/journal.pcbi.1008260).
* Refer to general guidance on how to provide [documentation and metadata](Documentation_and_metadata) during your project.
* Adopt standards, conventions and robust protocols to maximise the reuse potential of the data in parallel initiatives and your future projects.
* The Genomic Standards Consortium (GSC) develops and maintains the [GSC Minimum Information about any Sequence (MIxS)](https://fairsharing.org/FAIRsharing.9aa0zp) set of core and extended descriptors for genomes and metagenomes with associated samples and their environment to guide scientists on how to capture the metadata essential for high quality research.
* The GenEpiO Consortium develops and maintains the [Genomic Epidemiology Application Ontology (GenEpiO)](https://doi.org/10.25504/FAIRsharing.y1mmbv) to support data sharing and integration specifically for foodborne infectious disease surveillance and outbreak investigations.
* The [Public Health Alliance for Genomic Epidemiology (PHA4GE)](https://pha4ge.org/) supports openness and interoperability in public health bioinformatics. The [Data Structures working group](https://pha4ge.org/working-groups/) develops, adapts and standardises data models for microbial sequence data, contextual metadata, results and workflow metrics, such as the [SARS-CoV-2 contextual data specification](https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification).
* ISO (the International Organization for Standardization) has issued standards that can be referenced when designing or commissioning genomic sequencing and informatics services, such as
* [ISO 20397-1:2022 Biotechnology — Massively parallel sequencing — Part 1: Nucleic acid and library preparation](https://www.iso.org/standard/74054.html)
* [ISO 20397-2:2021 Biotechnology — Massively parallel sequencing — Part 2: Quality evaluation of sequencing data](https://www.iso.org/standard/67895.html)
* [ISO/TS 20428:2017 Health informatics — Data elements and their metadata for describing structured clinical genomic sequence information in electronic health records](https://www.iso.org/standard/67981.html)
* [ISO/TS 22692:2020 Genomics informatics — Quality control metrics for DNA sequencing](https://www.iso.org/standard/73693.html)
* [ISO 23418:2022 Microbiology of the food chain — Whole genome sequencing for typing and genomic characterization of bacteria — General requirements and guidance](https://www.iso.org/obp/ui/#iso:std:iso:23418:ed-1:v1:en)

## Collecting and processing pathogen genomic data

### Considerations

* What information should you consider recording when collecting data?
* What should you note when collecting, storing and preparing the samples?
* How will you capture information about the configuration and quality of the sequencing results?
* How will you ensure that the information captured is complete and correct?
* What data and file formats should you consider for your project?
* What are the *de-facto* standards used for the experiment type and down-stream analysis-pipelines?
* Where are the instrument specific aspects for the data and files formats documented?
* What existing data will you integrate or use as a reference in your project?
* What reference genome(s) will you need access to?
* What is the recommended citation for the data and their versions?

### Solutions

#### Filtering genomic reads corresponding to human DNA fragments

* Data files with reads produced by sequencing experiments sometimes contain fragments of the host organism’s DNA. When the host is a human research subject or patient, these fragments can be masked or removed to produce files that could potentially be handled with fewer restrictions. The approach chosen to mask the host associated reads leads to different trade-offs. Make sure to include this as a factor in your risk assessment.
* Mapping to (human) host reference genomes, [can inadvertently leave some host associated reads unmasked](https://doi.org/10.1099%2Fmgen.0.000393).
* Mapping to pathogens reference genomes can inadvertently mask some pathogen associated reads and still leave some host associated reads unmasked
* [Removal of human reads from SARS-CoV-2 sequencing data \| Galaxy training](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/human-reads-removal/tutorial.html)


#### Contextual information about the sample

* Information about the host phenotype, context and disease is often necessary to answer questions in a research study or policy perspective. Other contextual information can include non-host related environmental factors, such as interactions with other pathogens, drugs and geographic proliferation. It can also include information about the sampled material and how it was processed for sequencing.
* Adopt common reporting checklists, data dictionaries, terms and vocabularies to simplify data sharing across initiatives.
* ENA hosts a selection of [sample checklists](https://www.ebi.ac.uk/ena/browser/checklists) that can be used to annotate sequencing experiements, including checklists derived from the [MIxS consortium](http://w3id.org/mixs). The [ENA virus pathogen reporting standard checklist](https://www.ebi.ac.uk/ena/browser/view/ERC000033) has been widely used for SARS-CoV-2 genomic studies.
* Reuse terms and definitions from existing vocabularies, such as the [Phenotypic QualiTy Ontology](https://www.ebi.ac.uk/ols4/ontologies/pato), [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy), [Disease Ontology](https://disease-ontology.org), [Chemical Entities of Biological Interest](https://bioportal.bioontology.org/ontologies/CHEBI/?p=summary), and [UBER anatomy ONtology](https://bioportal.bioontology.org/ontologies/UBERON).
* The [PHA4GE SARS-CoV-2 contextual data specification](https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification) is a comprehensive example including a reporting checklist, related protocols, and mappings to relevant vocabularies and data sharing platforms.

#### Generating genomic data
* Establish protocols and document the steps taken in the lab to process the sample and in the computational workflow to prepare the resulting data. Make sure to keep information from quality assurance procedures and strive to make your labwork and computational process as reproducible as possible.
* [High-Throughput Sequencing \| LifeScienceRDMLookUp](https://elixir.no/rdm-lookup/sequencing)
* [The Beyond One Million Genomes (B1MG)](https://b1mg-project.eu) project provides guidelines that cover the minimum [quality requirements](https://zenodo.org/record/5018495) for the generation of genome sequencing data.
* Data repositories generally have information about recommended [data file formats](Data_publication) and [metadata](metadata_management)
* The [FAIR Cookbook](https://faircookbook.elixir-europe.org/content/home.html) provides instructions on [validation of file formats](https://faircookbook.elixir-europe.org/content/recipes/interoperability/fastq-file-format-validators.html)
* A good place to look for scientific and technical information about data quality validation software tools for pathogenomics is [Bio.Tools](https://bio.tools/t?page=1&q=validation&sort=score&topicID=%22topic_3168%22).
* The [Infectious Diseases Toolkit (IDTk)](https://www.infectious-diseases-toolkit.org/) has a showcase on [An automated SARS-CoV-2 genome surveillance system built around Galaxy](https://www.infectious-diseases-toolkit.org/showcase/covid19-galaxy)
* The Galaxy Training Network provides free on-line [training materials on quality control](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html).


## Sharing and preserving pathogen genomic data

### Considerations

* What data need to be preserved by the project and for how long?
* What is preserved by others and how would someone find and access the data?
* What databases should I use to share human pathogen genomics data?
* What other research information (such as protocols, computational tools, samples) can the project share?


### Solutions

#### Sharing host related and other contextual information
* Some host related information can be personal and/or sensitive and care should be taken when storing and sharing it. Apply data masking and aggregation techniques to pseudonymise or anonymise the contextual information and take measures to separate personal and sensitive information from the pathogen data when possible.
* Adopt solutions for federated analysis to support distributed analyses on information that could otherwise not be shared, such as establishing contractual agreements with suitable regional or international data infrastructures.
* [GA4GH (Global Alliance for Genomics and Health)](https://www.ga4gh.org/what-we-do/) is a global organisation that frames policy and builds standards to meet the real-world needs of the genomics and health community. Its [GDPR & International Health Data Sharing Forum](https://www.ga4gh.org/product/gdpr-international-health-data-sharing-forum/) shares *GDPR Briefs* that represent a consensus position among its Forum Members (not legal advice) regarding the current understanding of the GDPR and its implications for genomic and health-related research, such as
* [GDPR Brief: data protection implications of publishing metadata to enable discovery](https://www.ga4gh.org/news_item/ga4gh-gdpr-brief-data-protection-implications-of-publishing-metadata-to-enable-discovery/)
* [GDPR Brief: federated analysis for responsible data sharing under the GDPR](https://www.ga4gh.org/news_item/ga4gh-gdpr-brief-federated-analysis-for-responsible-data-sharing-under-the-gdpr/)

#### Sharing pathogen genomic data
* You should adopt good practices for [data sharing](sharing) and identify which data sharing platforms to use to reach the relevant stakeholders. You can use more than one platform but care should be taken to make sure that data is interconnected where possible to enable deduplication in downstream analyses.
* European healthcare surveillance systems is administered and used by public health authorities such as [ECDC’s TESSy/EpiPulse](https://www.ecdc.europa.eu/en/publications-data/epipulse-european-surveillance-portal-infectious-diseases)
* International research data exchanges such as [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena/browser/submit) for non-sensitive genomic data and the [Federated EGA](https://ega-archive.org/federated) network for sensitive data.
* There are also pathogen specifc initiatives, such as [EMBL-EBI Pathogens](https://www.ebi.ac.uk/ena/pathogens/home) and [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens/). And initiatives focusing specifically on viruses, certain pathogens or certain data types, such as [GISAID (Global Initiative on Sharing All Influenza Data)](https://gisaid.org/) for observations and assembled consensus sequences on a selection of pathogens.
* Investigate if there are [national resources](national_resources) or a [data brokering](data_brokering) organisation available to facilitate data sharing.
* [EBI Pathogens data hubs](https://www.ebi.ac.uk/ena/pathogens/v2/)
* [Submit new data \| European COVID-19 platform](https://www.covid19dataportal.org/submit-data)