Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data format matters? #142

Open
hxxhust163 opened this issue Nov 25, 2022 · 4 comments
Open

data format matters? #142

hxxhust163 opened this issue Nov 25, 2022 · 4 comments

Comments

@hxxhust163
Copy link

Dear Mrs

Thanks for your work of MSGF+, it is really an wonderful search engine. But I encountered confusion in the data format used for searching. I used MSGF+ in immunopeptidomics, at first, I converted the raw file into centroid mzML, 64G RAM and 64 threads used in searching and it took nearly 6 hours to finished for only one mzML file. Because of the long time, I converted the same raw file into centroid mgf, searched again using the same parameters and finished search in about 20 minutes for one file. However, the output mzid files of the two searches are not the same. Exactly, the identified PSMs of the mzML file is double of that identified of the mgf file. Why is the result? It seems very strange.

@hxxhust163
Copy link
Author

I used Comet search engine to do the same thing and I got two same identification results, which means the data format has nothing to do with the identification results.

@FarmGeek4Life
Copy link
Collaborator

  1. Are you sure you used the same parameters for the MS-GF+ mzML search and MGF searches? Generally in our testing the results do match, or are very similar (see below). But, you may need to ensure that the search parameters are the exact same by overriding some defaults that change a parameter based on information available in an mzML file that is not available in an MGF file. See https://msgfplus.github.io/msgfplus/MSGFPlus.html, in particular '-m FragmentMethodID'
  2. One potential reason for a difference in results in MS-GF+ for mzML input vs. other formats is particular for Thermo Orbitrap instruments: MS-GF+ does read the 'Thermo Trailer Extra: Monoisotopic m/z' value that Proteowizard/MSConvert writes to the mzML file when converting Thermo Orbitrap files from .raw, and uses that instead of the 'selected m/z' value that is used for all other instruments. But, this should only lead to minor differences.

@hxxhust163
Copy link
Author

Thanks very much! I figure out the problem. The data used in my search was acquired from a Q Exactive instrument and fragmented by HCD. In my initial run, 'FragmentationMethodID=0, InstrumentID=3' was used in my parameters. So, when I searched the mzML file, it will read the fragmentation(HCD) info from the file. But, when I searched the mgf file used the same parameters, it will recognize the info as CID by default, as there was no fragmentation info in mgf file. So, that is the reason for the problem.
By the way, I have another question. For example, for the same mzML data mentioned above, 'FragmentationMethodID=0, InstrumentID=3' and 'FragmentationMethodID=3, InstrumentID=3' which is better to use? As I ran use the params separately, and got two different results.
Thanks in advance!

@alchemistmatt
Copy link
Collaborator

In theory, the only way you should get different results for FragmentationMethodID=0 vs. FragmentationMethodID=3 is if the file has a mix of HCD and non-HCD spectra; see the comment in this example parameter file:
https://github.com/MSGFPlus/msgfplus/blob/master/docs/ParameterFiles/MSGFPlus_PartTryp_MetOx_20ppmParTol.txt#L15

In reality, there might be some unexpected side effect that I don't know about. I would suggest using the option that gives more filter-passing results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants