Mapping proteins to functions: method and benchmark development #1

pverscha · 2019-09-05T15:12:24Z

Abstract

Unipept is an ecosystem of tools for the taxonomic and functional analysis of (meta)proteomics datasets. The Unipept project aims to be an easy-to-use and very accessible tool by providing users with a web-based application. A command-line interface (CLI) and API are also provided, allowing users to process more samples and increasing the analysis throughput. Unipept started with a taxonomic analysis pipeline which was recently expanded with a new functional analysis pipeline with support for GO-terms and EC-numbers. Functional annotations are directly linked to proteins, for which taxonomic information is also available. This link allows researchers to reveal which functions are performed by which organisms and vice-versa. This project proposal aims at improving Unipept’s functional analysis pipeline with annotations for metabolic pathways. If wanted, we could also take a more generic approach and explore how to best link different functional annotations, how to map them to the available pathway data sources, and how to visualise them. The results of this can be useful for other projects as well.

Work plan

A first step would be to identify and score the available functional annotations and how they could be combined. We already have experience with GO terms, EC numbers and InterPro annotations within Unipept.

Next, the same needs to be done for metabolic pathway resources. These should each be scored on data quality, coverage, availability and ease-of-use. We will start by building a small prototype for each of the different resources and benchmark them. Candidates that we would like to include in our comparison are Reactome, KEGG, MetaCyc, and BioCyc.

After identifying suitable candidates, we can create a higher-level proof of concept workflow which starts from a list of peptides or proteins and ends with a list of interesting pathways. The Unipept API can be used to query for some of these annotations.

Technical details

Currently, we have a set of bash scripts and Java tools to extract information from UniProt. The Unipept framework is created using Ruby on Rails, but its APIs can be queried from every programming language and return standard JSON. Our current visualisations tools are written in JavaScript and Typescript.

Contact information

Bart Mesuere - Ghent University (Belgium) - [email protected]
Pieter Verschaffelt - Ghent University (Belgium) - [email protected]

RalfG · 2019-10-08T15:46:07Z

This hackathon project will be merged with #5 by @rababerladuseladim:

Abstract

Metaproteomics is the analysis of proteins in samples composed of multiple organisms. One major use case is the investigation of the functional composition of a sample. Multiple tools can connect identified sequences with functional information (e.g. Unipept, Prophane, MetaGOmics). Unfortunately, the performance of these tools is not easy to assess, due to a lack of data with known ground-truth at the functional level. The target benchmark dataset would consist of a diverse range of peptides/proteins with high-quality, experimentally validated functional annotations. The obstacles that need to be overcome for the creation of such a dataset are: (1) the further complicated protein inference issue in metaproteomics compared to single-organism proteomics (peptides can match to homologues in the same and multiple organisms) and (2) low annotation levels of proteins in the metaproteomic context (many proteins have no function - not even an assumed one - assigned to them). We plan to develop a concept on how the ideal gold standard dataset should be composed and generate it accordingly. Based on this dataset, a functional benchmark of the aforementioned tools can be initiated.

Work plan

Compile sequence database of proteins with validated functions

generate simulated peptide identification lists based on the database, closely resembling result characteristics in metaproteomics

specify benchmarking criteria

(Potentially) benchmark existing tools against generated data

Technical details

datasets are derived from reference databases such as SwissProt

tools for benchmarking:

Unipept

Prophane

MetaGOmics

Contact information

Henning Schiebenhoefer - Robert Koch-Institut (Germany) - [email protected]

pverscha · 2019-10-12T06:21:07Z

Abstract

Metaproteomics is the analysis of proteins in samples composed of multiple organisms. One major use case is the investigation of the functional composition of a sample. A multitude of functional annotation databases are available, which vary strongly in level of quality, price and accessibility. Multiple tools can connect identified sequences with functional information (e.g. Unipept, Prophane, MetaGOmics). One of these tools, Unipept, was recently expanded with a basic functional analysis pipeline. Functional annotations are directly linked to proteins, for which taxonomic information is also available. This link allows researchers to reveal which functions are performed by which organisms and vice-versa. By expanding the Unipept functional analysis pipeline with support for metabolic pathways, we can further increase the insight of researchers into the complex processes taking place in an environment. To achieve this, we can choose out of several functional annotation and pathway databases. To determine the best way forward, we need to overcome a couple of challenges: (1) ideally build a prototype for each data source and (2) benchmark each of these prototypes against a golden standard database. Due to a lack of data with known ground-truth at the functional level, no such golden standard exists at this point, making it very hard to assess the performance of each pipeline and compare tools with each other.

This project proposal aims at developing a concept on how the ideal gold standard dataset should be composed and generate it accordingly. We could then use it to evaluate several tools and potential annotation sources for Unipept.

Work plan

Compile sequence database of proteins with validated functions.
Generate simulated peptide identification lists based on the database, closely resembling result characteristics in metaproteomics.
Specify benchmarking criteria.
Generalise and build a prototype for different metabolic pathway data sources.
Benchmark each of the prototypes.

Technical details

Currently, Unipept consists of a set of bash scripts and Java tools to extract information from UniProt. The Unipept framework is created using Ruby on Rails, but its APIs can be queried from every programming language and return standard JSON. The Unipept visualisations tools are written in JavaScript and Typescript.

To allow us to construct a golden standard benchmarking database, we will use reference databases such as SwissProt from which datasets are derived. We will benchmark Unipept, Prophane and metaGOmics on our new database.

Contact information:

Bart Mesuere - Ghent University (Belgium) - [email protected]
Henning Schiebenhoefer - Robert Koch-Institut (Germany) - [email protected]
Pieter Verschaffelt - Ghent University (Belgium) - [email protected]

RalfG mentioned this issue Oct 8, 2019

Building a Gold-Standard Protein Sequence Dataset for Functional Annotation #5

Closed

RalfG changed the title ~~Mapping proteins to pathways~~ Mapping proteins to functions: method and benchmark development Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping proteins to functions: method and benchmark development #1

Mapping proteins to functions: method and benchmark development #1

pverscha commented Sep 5, 2019

RalfG commented Oct 8, 2019

Abstract

Work plan

Technical details

Contact information

pverscha commented Oct 12, 2019 •

edited

Loading

Mapping proteins to functions: method and benchmark development #1

Mapping proteins to functions: method and benchmark development #1

Comments

pverscha commented Sep 5, 2019

Abstract

Work plan

Technical details

Contact information

RalfG commented Oct 8, 2019

Abstract

Work plan

Technical details

Contact information

pverscha commented Oct 12, 2019 • edited Loading

Abstract

Work plan

Technical details

pverscha commented Oct 12, 2019 •

edited

Loading