BIP

Pipeline for improving BRAKER2 gene predictions with MS/MS data. This is work in progress and the current version is experimental.

Description

The main idea of the pipeline is to expand the protein search space with a relaxed BRAKER2 prediction, to find MS supported proteins, to define BRAKER2 transcript score cutoffs based on supported proteins, and finally to combine supported and high-scoring proteins into the final prediction (Fig. 1).

Figure 1: The workflow for the improvement of the default BRAKER2 prediction with MS/MS data. The yellow boxes stand for gene predictions, green boxes for peptide sets, the blue box for a number. While supported proteins from the relaxed prediction are selected using gene-specific and protein-specific peptides, only protein-specific peptides are considered in case of the default prediction. The sets of supported proteins and highly supported proteins are then utilised to define BRAKER2 score cutoffs and, subsequently, to select high-scoring proteins. Finally, supported and high-scoring proteins from both predictions are united into the final prediction.

Prerequisites

Python3 with the following modules: pandas, re, sys, os.
Unix

Running

Make a directory X. Make a directory scripts and inputs inside the directory X. The directory scripts should contain the following scripts: make_tx_scores_tsv.py, find_highly_supp_prot.py, find_gene_spec_pept.py, select_supp_prot.py, find_and_apply_score_filter.py, unite_gtf.py, run_bip.sh. The directory inputs should contain the following files without headers:

Two .gtf files with the default and relaxed BRAKER2 predictions. The files must be named default_pred.gtf and relaxed_pred.gtf. The instructions for producing a relaxed BRAKER2 prediction can be found in additional_files/BRAKER2_sensitive_instructions.txt
Two .tsv files with transcript BRAKER2 scores. There should be following columns in the .tsv files: 1) protein id (=transcriptid from .gtf files); 2) BRAKER2 transcript score. The files should be named tx_scores_default.tsv and tx_scores_relaxed.tsv. If a .gtf file contains transcript scores, this file can be produced by running:

  python3 /path/to/directory_X/scripts/make_tx_scores_tsv.py \
          /path/to/directory_X/inputs/pred_file.gtf \
          /path/to/directory_X/inputs/output_file.tsv

Two directories containing .tsv files with peptide mapping data. Each .tsv file stands for one tissue and should be named accordingly. There should be following columns in the .tsv files: 1) peptide sequence, 2) protein id (=transcript_id from .gtf files), 3) + if peptide is unique (found in one protein), - if not. These two directories should be named mapped_default and mapped_relaxed.

Go to the directory X and run the pipeline.

  bash /path/to/directory_X/scripts/run_bip.sh

The final file with the improved prediction is named bip.gtf.

Contributors

Author: Kateryna Neishsalo.
Supervisors: Prof. Dmitrij Frishman, Prof. Mathias Wilhelm, Dr. Nils Rugen.
Additional support: Prof. Bernhard Küster, Prof. Mark Borodovsky, Dr. Tomáš Brůna.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
additional_files		additional_files
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIP

Description

Prerequisites

Running

Contributors

About

Releases

Packages

Languages

License

katriken/BIP

Folders and files

Latest commit

History

Repository files navigation

BIP

Description

Prerequisites

Running

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages