DeepFilter is a metaproteomics-filtering tool based on deep learning model. It is aimed at improving the improving peptide identifications of microbial communities from a collection of tandem mass spectra. The details are available in https://arxiv.org/pdf/2009.11241.pdf
- python == 3.6
- numpy == 1.17.2
- scikit-learn >= 0.21.3
- pytorch(gpu version) >= 1.4.0
- CUDA Version 10.2
- Linux operation system
- GPU memory should be more than 8 Gb for inference mode otherwise the batchsize should be adjusted
- GPU memory should be more than 20 Gb for training mode
The toy example given is to help getting a quick start. The files of toy example include:
- OSU_D2_FASP_Elite_02262014_01.ms2 -> experimental tandem mass spectrum data
- OSU_D2_FASP_Elite_02262014_1.pin -> database searching results by Comet
- temp_model/ directory -> include three models, the file "benchmark.pt" is the pre-trained model for inference
- The fasta file for filtering is attached in the link https://myunt-my.sharepoint.com/:u:/r/personal/xuan_guo_unt_edu/Documents/Shichao/Metaproteomics%20Deep%20Learning/testdata.fasta.zip?csf=1&web=1&e=c8as9q The file inference.sh is to rescore the PSM from exsisting database searching results, the use is:
#!/bin/bash
./inference.sh -in OSU_D2_FASP_Elite_02262014_01.ms2 -s OSU_D2_FASP_Elite_02262014_1.pin -m temp_model/benchmark.pt -o test.rescore.txt
The list of processing files include:
- test.rescore.txt -> The rescore results for PSMs
- testidx.txt, testcharge.txt, testpeptide.fasta are processing files to generate isotope distribution
- test.expEncode.txt -> results of grouping observed spectrum by charge
- test.theoryEncode.txt -> results of grouping the isotope distribution of peptide sequence by charge and ion type
- test.feature.txt -> results of 11 extra features extracted from the initial PSM score, the observed spectrum, and the peptide sequence
Execute the filtering.py file as:
python filtering.py test.rescore.txt OSU_D2_FASP_Elite_022252014_1.pin test.psm.txt test.pep.txt
The first arguement is the rescore results file generated by deep learning model inference mode, the second argument is the results from database searching engine (Comet), the third and forth arguments are output files which are defined by users. The output files contain the protein identification results at PSM and peptide level winthin FDR equals to 1% respectively
Execute the sipros_peptides_assembling.py file as:
python sipros_peptides_assembling.py
The output file "test.pro.txt" contains the protein identification results at protein level within FDR equals to 1%.
- train_process.py: this script is used for the charge detection of observed mass spectrumm. the first argument is the ms2 file of observed mass spectrum and the second argument is the results after charge detection. The usage:
python train_process.py OSU_D2_FASP_Elite_02262014_01.ms2 expEncode.txt
- theory_process.py and Sipros_OpenMP: the python script and the binary file are combined togethoer to generate the isotope distribution of the PSM candidates. The usage:
python theory_process.py OSU_D2_FASP_Elite_022252014_1.pin idx.txt charge.txt peptide.fasta feature.txt
./Sipros_OpenMP -i1 idx.txt -i2 charge.txt -i3 peptide.fasta -i4 theoryEncode.txt
- Label_process.py: this script is to annotated the PSM candidates for training model. The first and second arguments are target and decoy PSMs files which are generated by executing Percolator program, the third argument is the prefix for annotation file, the last argument is the number of files the user want to annotate. The usage:
python Label_process.py percolator_results_target.csv percolator_results_decoy.csv Label 1
- train.py: this script is used to train the DeepFilter model. The first and second arguments are the prefix of the files which contain the processed observed spectrum and the istope distribution. The third and forth arguments are the prefix for the 11 extra feature files and the annotation files. The final argument is the number of file which is used for training. The usage:
python train.py expEncode.txt theoryEncode.txt feature.txt Label 1