Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping proteins to functions: method and benchmark development #1

Open
pverscha opened this issue Sep 5, 2019 · 2 comments
Open

Comments

@pverscha
Copy link

pverscha commented Sep 5, 2019

Abstract

Unipept is an ecosystem of tools for the taxonomic and functional analysis of (meta)proteomics datasets. The Unipept project aims to be an easy-to-use and very accessible tool by providing users with a web-based application. A command-line interface (CLI) and API are also provided, allowing users to process more samples and increasing the analysis throughput. Unipept started with a taxonomic analysis pipeline which was recently expanded with a new functional analysis pipeline with support for GO-terms and EC-numbers. Functional annotations are directly linked to proteins, for which taxonomic information is also available. This link allows researchers to reveal which functions are performed by which organisms and vice-versa. This project proposal aims at improving Unipept’s functional analysis pipeline with annotations for metabolic pathways. If wanted, we could also take a more generic approach and explore how to best link different functional annotations, how to map them to the available pathway data sources, and how to visualise them. The results of this can be useful for other projects as well.

Work plan

A first step would be to identify and score the available functional annotations and how they could be combined. We already have experience with GO terms, EC numbers and InterPro annotations within Unipept.

Next, the same needs to be done for metabolic pathway resources. These should each be scored on data quality, coverage, availability and ease-of-use. We will start by building a small prototype for each of the different resources and benchmark them. Candidates that we would like to include in our comparison are Reactome, KEGG, MetaCyc, and BioCyc.

After identifying suitable candidates, we can create a higher-level proof of concept workflow which starts from a list of peptides or proteins and ends with a list of interesting pathways. The Unipept API can be used to query for some of these annotations.

Technical details

Currently, we have a set of bash scripts and Java tools to extract information from UniProt. The Unipept framework is created using Ruby on Rails, but its APIs can be queried from every programming language and return standard JSON. Our current visualisations tools are written in JavaScript and Typescript.

Contact information

Bart Mesuere - Ghent University (Belgium) - [email protected]
Pieter Verschaffelt - Ghent University (Belgium) - [email protected]

@RalfG
Copy link
Member

RalfG commented Oct 8, 2019

This hackathon project will be merged with #5 by @rababerladuseladim:

Abstract

Metaproteomics is the analysis of proteins in samples composed of multiple organisms. One major use case is the investigation of the functional composition of a sample. Multiple tools can connect identified sequences with functional information (e.g. Unipept, Prophane, MetaGOmics). Unfortunately, the performance of these tools is not easy to assess, due to a lack of data with known ground-truth at the functional level. The target benchmark dataset would consist of a diverse range of peptides/proteins with high-quality, experimentally validated functional annotations. The obstacles that need to be overcome for the creation of such a dataset are: (1) the further complicated protein inference issue in metaproteomics compared to single-organism proteomics (peptides can match to homologues in the same and multiple organisms) and (2) low annotation levels of proteins in the metaproteomic context (many proteins have no function - not even an assumed one - assigned to them). We plan to develop a concept on how the ideal gold standard dataset should be composed and generate it accordingly. Based on this dataset, a functional benchmark of the aforementioned tools can be initiated.

Work plan

  • Compile sequence database of proteins with validated functions
  • generate simulated peptide identification lists based on the database, closely resembling result characteristics in metaproteomics
  • specify benchmarking criteria
  • (Potentially) benchmark existing tools against generated data

Technical details

  • datasets are derived from reference databases such as SwissProt
  • tools for benchmarking:
    • Unipept
    • Prophane
    • MetaGOmics

Contact information

Henning Schiebenhoefer - Robert Koch-Institut (Germany) - [email protected]

@RalfG RalfG changed the title Mapping proteins to pathways Mapping proteins to functions: method and benchmark development Oct 8, 2019
@pverscha
Copy link
Author

pverscha commented Oct 12, 2019

Abstract

Metaproteomics is the analysis of proteins in samples composed of multiple organisms. One major use case is the investigation of the functional composition of a sample. A multitude of functional annotation databases are available, which vary strongly in level of quality, price and accessibility. Multiple tools can connect identified sequences with functional information (e.g. Unipept, Prophane, MetaGOmics). One of these tools, Unipept, was recently expanded with a basic functional analysis pipeline. Functional annotations are directly linked to proteins, for which taxonomic information is also available. This link allows researchers to reveal which functions are performed by which organisms and vice-versa. By expanding the Unipept functional analysis pipeline with support for metabolic pathways, we can further increase the insight of researchers into the complex processes taking place in an environment. To achieve this, we can choose out of several functional annotation and pathway databases. To determine the best way forward, we need to overcome a couple of challenges: (1) ideally build a prototype for each data source and (2) benchmark each of these prototypes against a golden standard database. Due to a lack of data with known ground-truth at the functional level, no such golden standard exists at this point, making it very hard to assess the performance of each pipeline and compare tools with each other.

This project proposal aims at developing a concept on how the ideal gold standard dataset should be composed and generate it accordingly. We could then use it to evaluate several tools and potential annotation sources for Unipept.

Work plan

  • Compile sequence database of proteins with validated functions.
  • Generate simulated peptide identification lists based on the database, closely resembling result characteristics in metaproteomics.
  • Specify benchmarking criteria.
  • Generalise and build a prototype for different metabolic pathway data sources.
  • Benchmark each of the prototypes.

Technical details

Currently, Unipept consists of a set of bash scripts and Java tools to extract information from UniProt. The Unipept framework is created using Ruby on Rails, but its APIs can be queried from every programming language and return standard JSON. The Unipept visualisations tools are written in JavaScript and Typescript.

To allow us to construct a golden standard benchmarking database, we will use reference databases such as SwissProt from which datasets are derived. We will benchmark Unipept, Prophane and metaGOmics on our new database.

Contact information:

Bart Mesuere - Ghent University (Belgium) - [email protected]
Henning Schiebenhoefer - Robert Koch-Institut (Germany) - [email protected]
Pieter Verschaffelt - Ghent University (Belgium) - [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants