Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

ypriverol · 2024-08-16T22:06:48Z

PXD001819 Analysis

Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator

Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819-id-ms2rescore/).

Total number of PMSs

Comet only + Percolator: 495306
Comet + MSGF + Percolator: 572496 (15.58% increase)
Comet + MSGF + ms2rescore: 589200 (18.95% increase)
Comet + MSGF + (SNR + ms2rescore): 587972 (18.71% increase)
Comet + MSGF + SAGE + (SNR + ms2rescore): 592918 (19.68% increase)

Total number of PSMs by RAW file and combination

Currently, the combination of ms2rescore alone has more PSMs identifications, followed by ms2rescore + SNR.

The following questions would be interesting to understand:

When the spectrum quality metrics are introduced, are the PSMs more high-quality meaning that while we have fewer PSMs for ms2rescore + SNR they have more quality than ms2rescore?
Do we see the same results in other datasets?
What is the impact at peptide level?

The text was updated successfully, but these errors were encountered:

ypriverol · 2024-08-17T17:53:50Z

PXD014415

Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator

Each of these combinations can be turned off. We used the dataset PXD014415 to benchmark the peptide identifications with some of the combinations:

Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD014415-id-ms2rescore/).

Combinations & PSMs counts:

Comet only + Percolator: 1401471
Comet + MSGF + Percolator: 1576657 (12.50% increase)
Comet + MSGF + ms2rescore: 1620560 (15.63% increase)
Comet + MSGF + (SNR + ms2rescore): 1617000 (15.38% increase)
Comet + MSGF + SAGE + (SNR + ms2rescore): 1646795 (17.50% increase)

Total number of PSMs by RAW file and combination

Currently, the combination of ms2rescore (Comet + MSGF + SAGE) and SNR has more PSMs identifications.

jpfeuffer · 2024-08-18T09:35:59Z

What is "non-sage"? Comet?

ypriverol · 2024-08-18T11:11:21Z

Sorry the non-sage is COMET+MSGF

jpfeuffer · 2024-08-18T11:38:05Z

Sage comes on top or as replacement?

How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.

Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.

Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?

ypriverol · 2024-08-19T07:47:30Z

Sage comes on top or as replacement?

On top.

How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.

It is really fast, then there is no urgent need for improvements.

Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.

We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py

Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?

Im listening to suggestions. I would love to evaluate if this 5% increase in the PSMs in some way affects the FDR? Also, Im listening to suggestions on how to evaluate the difference between SNR+MS2rescore and MS2rescore. I have manually checked some IDs (in proteogenomics - https://www.biorxiv.org/content/10.1101/2024.05.24.595489v1) and I know that ms2rescore in the low-quality spectra can save (identified) some low-quality spectra that is the reason why we added the SNR. Would be nice to have some benchmark to prove it.

ypriverol · 2024-08-20T17:25:53Z

I was reading today the MSAmanda + ms2rescore and the % increase in PSMs is 6%.

RalfG · 2024-08-21T11:30:17Z

Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).

Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).

ypriverol · 2024-08-23T08:42:54Z

Thanks @RalfG for this response:

Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).

Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).

How do you test this? Distribution of the PEP scores or the original scores for targets and decoys?

RalfG · 2024-08-23T10:49:09Z

Usually just by plotting the amount of confidently identified PSMs at each FDR threshold, as in figure 1 of doi:10.1016/j.mcpro.2021.100076.

jonasscheid · 2024-09-10T09:33:53Z

I'm a bit curious about

We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py

Did you check the feature weights of percolator for this feature? I would guess that the Comet Xcorr implicitly penalizes for high SNR: https://willfondrie.com/2019/02/an-intuitive-look-at-the-xcorr-score-function-in-proteomics/

Would be great to see how high search engine scores / predicted features weights are in percolator!

daichengxin · 2024-09-22T07:50:24Z

Thanks for your suggestions. There are latest benchmark results from PXD001819 and PXD014415. The percolator top20 weights are shown in figure3 and figure4 (Top panel is comet, bottom panel is msgf). And the SNR features are plotted in figure5 and figure6. (a) is percolator method, (b) is ms2rescore and (c) is ms2rescore+snr.

I think we can get some conclusion:

multiple search engines improved identification by >10%.
Adding MS2rescore features enhanced the separation between true and false PSMs, which means that increase the specificity to 0.1% FDR. (increase 3%)
Peptide length and Comet:spscore have such a significant weight. For peptide length, I think that when the peptide is longer, the more key ions are produced and matched, and therefore it is easier to differentiate between false PSMs and true PSMs. The weight of XCorr is positive, which indicative of a high xcorr gives a better hit. The weight of the absdM is negative, which indicative that large differences between observed and calculated mass gives a worse score. These results are same as https://github.com/percolator/percolator/wiki/Example.
After adding MS2rescore features, The weights distribution changed. RT difference and ion intensity difference features occupy an important position. For example, the weight of ionb_mse_norm is negative, which indicative that large differences between observed and predicted b ion intensity gives a worse score.
After adding snr features, the weight of quantms:snr is positive, which indicative that high snr gives a better score. And the weight of quantms:SpectralEntrpy and quantms:quantms:FracTICinTop10Peaks are positive in PXD001819, which indicative that high signals distribution gives a better score. But this may be different across search engines and datasets.
PXD001819:

PXD014415:

PXD001819:

PXD014415:

Looking forward to your feedback!

ypriverol added the enhancement New feature or request label Aug 16, 2024

ypriverol assigned jpfeuffer and daichengxin Aug 16, 2024

ypriverol mentioned this issue Sep 6, 2024

additional_scores as Record in parquet instead of list of String bigbio/quantms.io#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

ypriverol commented Aug 16, 2024 •

edited

Loading

ypriverol commented Aug 17, 2024 •

edited

Loading

jpfeuffer commented Aug 18, 2024

ypriverol commented Aug 18, 2024

jpfeuffer commented Aug 18, 2024

ypriverol commented Aug 19, 2024

ypriverol commented Aug 20, 2024

RalfG commented Aug 21, 2024

ypriverol commented Aug 23, 2024 •

edited

Loading

RalfG commented Aug 23, 2024

jonasscheid commented Sep 10, 2024

daichengxin commented Sep 22, 2024 •

edited

Loading

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

Comments

ypriverol commented Aug 16, 2024 • edited Loading

PXD001819 Analysis

Total number of PMSs

Total number of PSMs by RAW file and combination

ypriverol commented Aug 17, 2024 • edited Loading

PXD014415

Total number of PSMs by RAW file and combination

jpfeuffer commented Aug 18, 2024

ypriverol commented Aug 18, 2024

jpfeuffer commented Aug 18, 2024

ypriverol commented Aug 19, 2024

ypriverol commented Aug 20, 2024

RalfG commented Aug 21, 2024

ypriverol commented Aug 23, 2024 • edited Loading

RalfG commented Aug 23, 2024

jonasscheid commented Sep 10, 2024

daichengxin commented Sep 22, 2024 • edited Loading

ypriverol commented Aug 16, 2024 •

edited

Loading

ypriverol commented Aug 17, 2024 •

edited

Loading

ypriverol commented Aug 23, 2024 •

edited

Loading

daichengxin commented Sep 22, 2024 •

edited

Loading