PrecisionProDB (Precision protein database), a tool improving the proteomics performance for precision medicine.
PrecisionProDB is a Python package for proteogenomics, which can generate a customized protein database for peptide search in mass spectrometry.
PrecisionProDB v2 has been developed to maintain compatibility with the previous version of the software. Notably, it introduces the option of utilizing SQLite files for the storage of intermediate data. This innovation has significantly reduced the runtime of the program, particularly when handling smaller VCF files.
PrecisionProDB v2 now supports TSV input files, accommodating both single and multiple samples. Additionally, it can process multiple VCF files simultaneously. When using VCF input, the software supports multiple samples as well. Users can employ the "--sample ALL_SAMPLES" option to create a population proteomic database similar to ProHap. Alternatively, the "--sample ALL_VARIANTS" option allows for the generation of a database that focuses solely on variants, disregarding genotype information across different samples.
With a pre-build sqlite file, it is very fast to check effect of variants in string format like "chr1-942451-T-C,1-6253878-C-T,1-2194700-C-G,1-1719406-G-A"
.
It is updated to support the human Genome assembly T2T-CHM13v2.0 and its annotation in RefSeq.
For more help information, visit the wiki page.
For more running examples, visit the example folder and the test_output folder.
PrecisionProDB is now available here for faster checking of effect of variants in protein sequences. Users may also upload a small tsv/tsv.gz file to use the pre-defined gene models.
- Description
- Installation
- Citing PrecisionProDB
- Usage Information
- Outputs
- Benchmark
- PrecisionProDB_references
- Contact Information
The major goal of PrecisionProDB is to generate personized protein sequences for protein identification in mass spectrometry (MS). Main features:
- Supports multithreading, which improves the speed of the program. A typical customized human protein database can be generated in 15 to 20 mins using 8 threads.
- Optimized for several widely used human gene models, including:
- GENCODE: PrecisionProDB can download the latest version from
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human
, as shown inhttps://www.gencodegenes.org/human/
. - RefSeq: PrecisionProDB can download the latest version from
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/current
- Ensembl: PrecisionProDB can download the latest version from:
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/
- UniProt: PrecisionProDB can download the latest version from:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/
- The files are
UP000005640/UP000005640_9606.fasta.gz
andUP000005640/UP000005640_9606_additional.fasta.gz
, which may change in the future.
- GENCODE: PrecisionProDB can download the latest version from
- The non-standard codons and rare amino acids (e.g. Selenocysteine (Sec or U)) in the human genome can be properly incorporated.
- PrecisionProDB stands out by utilizing the codons derived directly from the input protein FASTA sequences, rather than relying on standard or reference codon sets. We believe that gene annotation sources such as GENCODE, RefSeq, and other genomic databases use non-standard codons for a reason—reflecting unique biological contexts and potentially crucial variations. By preserving these non-standard codons in our analysis, PrecisionProDB offers a more accurate, context-sensitive interpretation of protein sequences, ensuring that the nuances of the original data are maintained for more reliable downstream applications.
- Internal stops (*) in proteins were reserved.
- Supports variant file in text or VCF format.
- All input files can be in compressed gzip (.gz) format.
- Supports user generated gene models in GTF/GFF format. Species other than human are also supported.
- For user-generated GTF files, protein annotations generated by TransDecoder was tested.
- We provided an example of running TransDecoder with example files.
Click to see the scheme of how PrecisionProDB works on the wiki page
PrecisionProDB is tested under the base
enviroment of Anaconda. It requires Python3, Biopython and Pandas.
If Anaconda is installed, only Biopython need to be installed:
conda install -c conda-forge biopython
Otherwise, it is recommended to use the conda to control the packages and virtual environement. Install required packages:
conda install numpy
conda install pandas
conda install -c conda-forge biopython
If conda
is not installed, pip
(or pip3
as Python3 is required) can be used. pip
is already installed if you are using Python3 >=3.4 downloaded from python.org.
pip3 install numpy
pip3 install pandas
pip3 install biopython
If the user has no root previlige on the system, the packages can be installed using the "--user" option:
pip3 install numpy --user USER
pip3 install pandas --user USER
pip3 install biopython --user USER
USER
is the user name on the operating system to install these packages.
To install the latest developments:
git clone https://github.com/ATPs/PrecisionProDB.git
To install other verisons, download from the release page directly.
We provided the running examples without install PrecisonProDB to $PYTHONPATH
. However, if you would like to run it directly from the console (version after v2.0.0), after change working directory to PrecisonProDB path, you may run
pip install -e .
It will create a precisionprodb.egg-link to $PYTHONPATH
and you may run
PrecisionProDB --help
downloadHuman --help
vcf2mutation --help
from the console directly. Do not remove the PrecisionProDB folder after installation since precisionprodb.egg-link is a symbolic link to this folder.
Note:
- the contents in folder
examples
andtest_output
were not included in the installation package from the release page. - If you want to test the with the examples or check more running examples and results, use
git clone https://github.com/ATPs/PrecisionProDB.git
.
python Path_of_PrecisionProDB/src/PrecisionProDB_test.py -h
the output of running the test files were shown in folder test_output
Xiaolong Cao, Jinchuan Xing, PrecisionProDB: improving the proteomics performance for precision medicine, Bioinformatics, Volume 37, Issue 19, October 2021, Pages 3361–3363, https://doi.org/10.1093/bioinformatics/btab218
Note: python
in the example scripts below are Python3. If you are unsure about the version of your python, use python --version
to show the version. In some systems you might need to use python3
to specify Python3, or use the full name of Pythons
(e.g., /home/xcao/p/anaconda3/bin/python3.7
), if multiple versions of Python
exist in the system or Python
is not in the system PATH.
We suppose that in most cases, users will have a variant file in VCF format. If there is only one sample in the VCF file, the simplest command will be:
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output
-m Name_of_variant_file
defines the input variant file (include the full path if the input file is not in the current folder). If the variant file ends with '.vcf' (case ignored), it will be treated as a VCF file. In all other cases a file will be treated as a TSV file. Files end with '.gz' (e.g., '.vcf.gz' or '.tsv.gz') will be treated as gzip compressed files.-D GENCODE
defines the annotation reference to be used. In this example, personalized protein sequences based on theGENCODE
annotation will be generated. PrecisionProDB will download required files of GENCODE models automatically. To use gene models in other supported resources,GENCODE
could be changed toRefSeq
,Ensembl
orUniprot
.-o Prefix_of_output
defines the prefix of the output filenames.
If there are multiple samples in the VCF file, the -s
option should be used to specify the sample name to be used in the VCF file.
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output -s Sample_name
If there is a local version of gene annotation files from Ensembl, the command will be:
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -o Prefix_of_output -s Sample_name -g Ensembl_Genome -p Ensembl_protein -f Ensembl_gtf -a Ensembl_GTF
Ensembl_Genome, Ensembl_protein, and Ensembl_gtf are the locations of the Ensembl genome, protein, and GTF files, respectively. These files can be downloaded from Ensembl website as metioned previously, or use the downloadHuman
module in the package.
python Path_of_PrecisionProDB/src/downloadHuman.py -d Ensembl -o Output_folder
Output_folder is the path of output folder to store the downloaded files.
If the variant file is in the tab-separated values (TSV) format,
-
it needs to include a header row, with at least four columns:
chr
,pos
,ref
,alt
. There is no requirement for the order of these columns, aspandas
was used to parse the file. -
additional columns are allowed, but will be ignored.
-
the
chr
,pos
,ref
andalt
columns were coded in the VCF format. This means that for deletions, it should be written aschr1 10146 AC A
, rather thanchr1 10147 C .
. Also, thepos
is 1-based like in the VCF file, not 0-based (in bed file). -
The most simple text file looks like:
chr pos ref alt 1 10146 AC A 1 15274 A G 1 28563 A G 1 49298 T C 1 52238 T G
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output
Name_of_variant_file
is the name of the variant file. If the variant file ends with '.vcf' (case ignored), it will be treated as a VCF file, as described above. In all other cases a file will be treated as a TSV file. Files end with '.gz' (e.g., '.vcf.gz' or '.tsv.gz') will be treated as gzip compressed files.- For text file format input,
-s
option will be ignored as there is only one sample. - Here,
-D
is set to beGENCODE
. GENCODE related files will be downloaded.
We tested GTF annotation generated by TransDecoder.
Run TransDecoder in the starting from a genome-based transcript structure GTF file mode.
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -o Prefix_of_output -s Sample_name -g TransDecoder_Genome -p TransDecoder_protein -f TransDecoder_gtf -a gtf
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m celline.vcf.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o vcf_variant
Five files will be generated in the examples
folder.
vcf_variant.pergeno.aa_mutations.csv
: annotations of amino acid changes.vcf_variant.pergeno.protein_all.fa
: all proteins after incoporating the variants.vcf_variant.pergeno.protein_changed.fa
: all proteins which are different from the input protein sequences after incoporating the variants.vcf_variant.vcf2mutation_1.tsv
: variant file extracted from the VCF file in text format, the first alternative alleles.vcf_variant.vcf2mutation_2.tsv
: variant file extracted from the VCF file in text format, the second alternative alleles.
Note:
- For altered proteins,
__1
,__2
,__12
will be added to the ID of the protein.__1
and__2
mean that the alleles of the protein is from the first and the second variant file, respectively.__12
means that the the altered protein sequence are the same for the first and the second alleles.- e.g.,
>ENSP00000308367.7|ENST00000312413.10|ENSG00000011021.23|OTTHUMG00000002299|-|CLCN6-201|CLCN6|847__12 changed
,ENSP00000263934.6|ENST00000263934.10|ENSG00000054523.18|OTTHUMG00000001817|OTTHUMT00000005103.1|KIF1B-201|KIF1B|1770__2 changed
,ENSP00000332771.4|ENST00000331433.5|ENSG00000186510.12|OTTHUMG00000009529|OTTHUMT00000026326.1|CLCNKA-201|CLCNKA|687__1 changed
,ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094|OTTHUMT00000003223.1|OR4F5-202|OR4F5|326 unchanged
.
- The variant file looks like
chr pos ref alt chr1 52238 T G chr1 53138 TAA T chr1 55249 C CTATGG chr1 55299 C T chr1 61442 A G
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m gnomAD.variant.txt.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o text_variant
Three files will be generated in the examples
folder.
text_variant.pergeno.aa_mutations.csv
: amino acid change annotationstext_variant.pergeno.protein_all.fa
: all proteins after incoporating the variants.text_variant.pergeno.protein_changed.fa
: all proteins which are different from the input protein sequences after incoporating the variants.
Note:
- Protein names and descriptions in the fasta file are the same as in the input protein file, and the
Tab
symbol (\t
) +changed
orunchanged
were added to indicate if the protein sequence is altered. - e.g.,
ENSP00000328207.6|ENST00000328596.10|ENSG00000186891.14|OTTHUMG00000001414|OTTHUMT00000004085.1|TNFRSF18-201|TNFRSF18|255 unchanged
,ENSP00000424920.1|ENST00000502739.5|ENSG00000162458.13|OTTHUMG00000003079|OTTHUMT00000368044.1|FBLIM1-210|FBLIM1|144 changed
.
There are several files in the src
folder. Each of them were designed in a way that can be run independently. To get help, run
python Path_of_PrecisionProDB/src/MODULE_NAME.py -h
where MODULE_NAME
should be the files in the src
folder, without the .py
extension.
To get help for the main program, run
python Path_of_PrecisionProDB/src/PrecisionProDB.py -h
The printed message were provided on the wiki page, where further explanations can be found.
Notes:
-
If the chromosome name in gtf file and mutation file is different,
-a RefSeq
is needed to do a match, and the-k
need to be adjusted, to match the name in the protein file and in the gtf file. For ORFanage translation with RefSeq-CHM13 model with mutations with chromosme "chr" in the mutation file, the parameter should be like-a RefSeq -k transcript_id
-
The input protein id should not contain the symbol '__' which is double underscore symbols.
For information, visit the wiki page. https://github.com/ATPs/PrecisionProDB/wiki/Outputs-of-PrecisionProDB
The number of altered proteins will be shown during running PrecisonProDB. In the header line of "PREFIX.pergeno.protein_all.fa", a word "changed" or "unchanged" is at the end of the fasta header, and users may count the number of changed proteins based on this annotation.
Generally, users may found annotations for variants in the "PREFIX.pergeno.aa_mutations.csv" file. Users may get the effects of different variants including AA subsitutions, insertions, deletions, stop-loss, stop-gain, and frame-changes.
Users may use tools like https://github.com/pwilmart/fasta_utilities to further compare the difference of trypsin digested peptides.
The Genome Aggregation Database (gnomAD) project, provide variant allele frequencies in different populations based on genomes and exomes of hundreds of thousands of individuals and this information can be integrated into a protein database. We applied PrecisionProDB to alleles from different populations from gnomAD 3.1 data. Results can be found at https://github.com/ATPs/PrecisionProDB_references.
Please leave comments on the issue tab.