Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce MS-GF+ search time #108

Open
Jokendo-collab opened this issue Sep 3, 2020 · 7 comments
Open

How to reduce MS-GF+ search time #108

Jokendo-collab opened this issue Sep 3, 2020 · 7 comments
Labels

Comments

@Jokendo-collab
Copy link

I am running MS-GF+ and it take days to finish running. How can I shorten the sequence database search time? I have increased the number of threads to 32 and RAM to 32 GB but the search time has not reduced as I expected. Could you kindly help figure out how this can be realized? I have 70 raw files which has taken five days to run on HPC

@alchemistmatt
Copy link
Collaborator

Search times are dependent on three things:

  • Fully tryptic vs. partially search
  • FASTA file size
  • Number of dynamic modifications

I suspect you are performing a partially tryptic search on a large FASTA (200 MB or larger) and using several dynamic mods. I suggest you change your search to be a fully tryptic search (ntt = 2) and run a test search on one of your 70 .raw files that already finished. Compare the results: did the partially tryptic search reveal more than ~3% additional identifications?

I am, of course, just guessing here. You'll need to tell us:

  1. The number of MS/MS scans in one of your representative .raw files
  1. The size of your FASTA file, in MB
  2. What arguments you're using for searching, especially NTT
  3. Which dynamic modifications you're searching for (mod name and affected residues)

@Jokendo-collab
Copy link
Author

Hi,
Below is my code and I am using only two modifications (fixed and dymanic). The FASTA file is 1.5GB in size.
`msgfplus=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus.jar

mods=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus_Mods1.txt

fastadb=/scratch/oknjav001/bal_mzML_raw_files/humanDatabase/fullmicribiome.fasta

#============baseline==================================
cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/baseline

for mzml in *.mzML

do

java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1

done;

#============================bcg=================================
cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/bcg

for mzml in *.mzML

do

java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1

done;
`

@alchemistmatt
Copy link
Collaborator

You are using -ntt 2 so that's good. Please paste the contents of searchEngine/MSGFPlus_Mods1.txt here

The big problem is that 1.5 GB FASTA file. I'm not sure that 16 GB is enough for it; hopefully it is. Provided Java does not report an out-of-memory exception, there really isn't much that can be done to speed up the search time: a 1.5 GB FASTA file is very large and will take time to search The only option would be to remove any dynamic mods in MSGFPlus_Mods1.txt (which is why I'm curious what it has).

Splitting the 1.5 GB FASTA file into smaller chunks (using https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter ) is an option, but that won't speed up the overall search time; it's really only useful if either Java is running out of memory, or if you're able to run multiple copies of MS-GF+ simultaneously, ideally on different systems

@alchemistmatt
Copy link
Collaborator

alchemistmatt commented Sep 5, 2020

Ah, I just noticed in #10 that the software is, in fact, crashing, and you need a copy of the
Fasta-File-Splitter binary (which does work on Linux via Mono -- I just tested it).

Here you go:

Note that the Fasta-File-Splitter is a VB.NET program (while most of our software is C#). Thus, you need a new enough version of Mono that supports VB.NET (it's had support for 6+years, but package managers for older Linux distros might have an old version of mono). See https://www.mono-project.com/download/stable/

You will split the FASTA file (probably into 10 parts), then run MS-GF+ 10 times for each .mzML file. Once you have the .mzid files from all of the searches, you will need to re-combine them and re-compute EValues. For that, use the MzIdMerger:

@Jokendo-collab
Copy link
Author

@alchemistmatt this is the information in my modification file.
NumMods=2

C2H3N1O1,C,fix,any,Carbamidomethyl # Fixed Carbamidomethyl C

Variable Modifications (default: none)

O1,M,opt,any,Oxidation # Oxidation M

#15.994915,M,opt,any,Oxidation # Oxidation M (mass is used instead of CompositionStr)
#H-1N-1O1,NQ,opt,any,Deamidated # Negative numbers are allowed.
#C2H3NO,,opt,N-term,Carbamidomethyl # Variable Carbamidomethyl N-term
#H-2O-1,E,opt,N-term,Glu->pyro-Glu # Pyro-glu from E
#H-3N-1,Q,opt,N-term,Gln->pyro-Glu # Pyro-glu from Q
#C2H2O,
,opt,Prot-N-term,Acetyl # Acetylation Protein N-term
#C2H2O1,K,opt,any,Acetyl # Acetylation K
#CH2,K,opt,any,Methyl # Methylation K
#HO3P,STY,opt,any,Phospho # Phosphorylation STY
~

@ATPs
Copy link

ATPs commented Dec 6, 2020

comet can runs fast after indexing the database. The indexed database includes those modifications. I think msgf+ can be much faster if in the index step modifications were included, and sorted properly, I guess...

@FarmGeek4Life
Copy link
Collaborator

@ATPs Implementing such an idea would be a significant amount of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants