-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine and finalize metadata and CV terms #7
Comments
My list of attributes to define at the level of the library, I will add an * if I think is mandatory.
|
I'm not sure this is the right place for this comment but note NIST have the MSP but also the SDF format that stores their spectral information (I do not see this mentioned here yet) - although MSP seems to be the more common exchange one, the SDF has the advantage that the full structure AND the spectrum can be in it ... and this paper is a great example of using SDF to do NMR exchange: |
Might be better to keep SDF - structure - separate as it should cover both proteomics and metabolomics or other maybe potentially other MS based applications |
See parallel conversation for similar comments! MassBank/MassBank-web#110 |
I'm not sure if this is addressed in the Massbank format or other formats, but one things we try to track on the GNPS side is the provenance filename and scan number of where the reference spectrum came from. Though, its not perfect in the record and maybe it is more appropriate to be tracked externally (which is done at GNPS) and those records are referenced through an accession number. |
Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together? My quick thoughts below. At library level, we need: Format version (e.g. mzl 1.0) A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018) Publisher/source, including Contact (e.g. NIST) Publishing date (or library version or serial number) Library name/descriptor Software generating the library and version Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all) All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here) Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?) Comments Provenance For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this. |
Re: organisms - it should be designed flexibly to allow extra metadata, but not be too biologically focused, for instance. There are a lot of people who use spectral libraries who do not have any organism context. The MassBank requirement for a "natural / not natural" tag has caused many headaches for us environmental people because we never have the context (caffeine is eg a natural product but for us a chemical found in the environment) and such classifications are extremely hard to auto-classify from the wrong context... (ie please do not force people to provide information they may not have and force them instead to fill in "something" that is likely incorrect just to fill a field).
On Sun, Apr 22, 2018 at 9:36 AM +0200, "henryhlam" <[email protected]<mailto:[email protected]>> wrote:
Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together?
My quick thoughts below.
At library level, we need:
Format version (e.g. mzl 1.0)
A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018)
Publisher/source, including Contact (e.g. NIST)
Publishing date (or library version or serial number)
Library name/descriptor
Software generating the library and version
Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all)
All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here)
Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?)
Comments
Provenance
(From my experience, it is often necessary to modify/merge/filter libraries to create new custom ones. It would be nice to have a place to keep track of what has been done to the library. (e.g. This library is created from the NIST 2014 one by filtering for all tryptic peptides, and merged it with a decoy library...) This can be put into the Comments field, but it may be useful to have a separate "Provenance" field.)
For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#7 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD4a_cyQU8frF-C_rK0_4eWDHcQGJ2C3ks5trDMEgaJpZM4TcduF>.
|
The google document is this one: This issue is to discuss the metadata at the level of the library, we have another issue #9 to discuss the metadata to the individual spectra |
My comment stands for both the individual spectrum and library level ... many spectral libraries will likewise not come from an organism ... although some may and in this case it would be valuable information to be captured. A more generic description may be more flexible? |
@schymane The idea of the organisms, instruments, and modifications at the library metadata is for dedicated libraries where for example you the library has been created/filtered for those properties. If is not the case, then those properties should be captured at the spectrum level because it can be huge the number of species, instruments and especially modifications in one library. |
@henryhlam @edeutsch @sneumann @schymane I have updated the document with the new fields provided by you guys that are needed to capture at the level of the library. Please have a look here https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit |
When a spectral library is searched, there are two fragment tolerances to consider: the user specifies the tolerance of the input data, and something must specify the tolerance of peaks in the library spectra. These two tolerances could easily differ (e.g. search 10ppm data against a 0.5Da library). It would be nice if fragment tolerance were a file-level attribute. Also, it's very valuable to allow library entries to override file-level attributes. This allows specifying defaults (e.g. default instrument or default organism). It reduces metadata clutter and still lets you mix entries from different sources in the same file. |
yes indeed. I put in precursor mass accuracy and fragment mass accuracy as desirable attributes in the library. Mass tolerance strikes me more as a software parameter and a user preference than an inherent property of a library or a spectrum. But I can see other opinions, too. |
Hi everyone, I have updated the document to reflect notes that I have been taking and the overall direction of the document, which is metadata at all levels, not just the spectrum level or library level. I defined FOUR levels of metadata:
The spectrum level is somewhat further divided into merged spectra, individual spectra, and common to both merged and individual spectra. I reorganized the document a little. I hope I didn't mess up anything in anyone's view. Have a look and see what you think. |
I'm glad to see chimeric spectra taken into account. What about intact crosslinked peptides, e.g. disulfide bonds, or looplinked or cyclic peptides? Protein-level data is inherently problematic in a peptide-centric spectral library. Example problems: there could be no parent protein (e.g. de novo identification); accession format must be restricted (if it's not restricted, it's a free text field); peptide sequence could appear in several locations in a protein; peptide is usually found in more than one protein; consensus spectra could have conflicting parent proteins; accession formats could differ between library entries and between libraries (and will differ if you compare or merge results with a database search). Shouldn't you also record the FASTA name and version if it was a database search? What about the protein description? I realise this may be an unpopular view, but how about prohibiting protein-level metadata? Or at least move them out of peptide metadata and into the experiment-level metadata. Protein attributes help explain how the peptide was identified rather than being an inherent part of a peptide identification. |
I second @vrkosk view that the protein level information should not be added directly. It is to be expected that spectral libraries will be merged quite heavily. Then, this will definitely cause issues. If we do not add the protein level information from the start, users and search engines will expect that they need to supply a FASTA database. In my opinion, this is the cleanest solution. |
I guess all information should remain at peptide level for biological entities, and not protein information. I also agree. |
I agree that we should make sure that cross-linked peptides are supported. I think it is mostly already there with multiple simultaneous identifications already supported. But I suppose we need a flag to distinguish cases when the multiple peptides are chimeric vs. cross-linked. I will add that. Regarding the encoding of proteins, I would disagree with the prevailing thought that we should prohibit protein information. I certainly agree that it should not be required, and I agree it could get a bit complex. But I suspect some people would like to encode that information and it seems to me that providing a standardized optional way of doing that is a better choice than attempting prohibition. |
All,
My idealistic and pedantic side would agree with all of you that protein
information should not be included. I also struggled with these
inconsistency issues when I developed SpectraST.
However, I am with Eric in that keeping the protein information as an
optional field is the way to go. In designing a format for everyone to use,
we should value continuity and practical utility of the format over
semantic purity. Many users of these formats are less into these issues and
would just want a format that serves their needs. After all, that's why we
started off in last year's PSI deciding that let's see how we can evolve an
existing format to something better, rather than tearing up it up and
starting from philosophical principles about what a library should be.
The fear is that if we define the "perfect" format that is too far from the
existing ones that no one wants to rewrite all existing codes just to fit
the new format. So I would advocate a more flexible format with many
optional fields, which can accommodate most use cases, and let all existing
tools have an easier time switching over. This means we want it to capture
most of the useful features of the existing formats, and not so easily
dismissed them.
For instance, I am sure NIST puts all those hard-to-decipher fields in
there for a reason. They are there to support some functionalities in their
tools. If we tell them, sorry, you can't have them any more because we are
not supposed to be there, they will just not use our formats, or they will
find all kinds of back-door ways to stuff the information back in there.
That's not what we want to see.
Back to the specific point of the protein field. The argument for having a
protein field is for convenience and efficiency. Efficiency is important!
Typically, users who search a peptide spectral library will want to know
what proteins their IDs map to. If the search step is followed by another
tool which will do the peptide-protein mapping, then all is well. (This
step would require the user to supply a FASTA file.) But sometimes it is
not. Remember sequence search engine will naturally provide that protein
information, and that's the benchmark that spectral library engines are
held up to. From my point of view as the developer of SpectraST, I cannot
really tell users that no, a library search is not supposed to tell you
that, you need to install another tool. So practically speaking, the
spectral search engine will need it do the mapping post-search every time a
search is done, not to mention the awkwardness of asking the user to always
specify a FASTA file to accompany the library.
The other use is for filtering. Often a user would want to filter his/her
library by protein(s). If a protein field is present, then it is a simple
thing. If not, then again the user has to look up the protein sequence, get
all the possible peptide sequences of that protein, and then do a search by
peptide.
By the way, SpectraST already has a function to re-map all library entries
to proteins, based on a given FASTA file. If user downloads a library but
would like to use their own set of protein identifiers, it can do the
re-mapping. It can be used to fixed errors in the mapping, or update to a
new FASTA file. But if you don't allow me to store the protein somewhere in
the library file, then I have to do this mapping every time a search is
done!
The reality is that most peptides map to a small number of proteins, and
the mapping is quite stable. We are here to deal with 95% of the cases, not
the 5%. As long as the field is not mandatory, and we allow multiple
proteins, it will serve all purposes and not break anything. Ultimately we
have to trust the tools to use these fields wisely. It will make the tools
run faster and minimize unnecessary repeated tasks.
I understand the argument that the protein is really not part of the
analyte -- it is merely where it occurs in the natural world -- so it
should not be stored with the library entry. We are saying, essentially,
the source or any auxiliary information about the analyte should not be
stored. But then what about organism? What about target/decoy? (The tool
can figure that out from trying to map it to the FASTA! No need to store
that field either.) What about natural/synthetic (the metabolomics people
will want this field)? Oh, look up in some online database instead -- none
of the business of the library. Synonyms of metabolites? Too messy, just
store the InChI key and let the user look it up themselves. None of this
has anything to do with the one-to-one correspondence between the analyte
and its characteristic fragmentation pattern, which is, in a pure sense,
what a library entry should be about. Are we really going to go down the
road of cutting out anything that should not be part of this correspondence?
I think our overriding concern, at this point of the exercise, should be to
ensure that all existing tools are willing to switch over. If we make it
too hard on the tool developer or the user, then we may have a beautiful
and well-designed format that no one will use.
Henry
…On Thu, Apr 26, 2018 at 4:13 AM, Eric Deutsch ***@***.***> wrote:
I agree that we should make sure that cross-linked peptides are supported.
I think it is mostly already there with multiple simultaneous
identifications already supported. But I suppose we need a flag to
distinguish cases when the multiple peptides are chimeric vs. cross-linked.
I will add that.
Regarding the encoding of proteins, I would disagree with the prevailing
thought that we should prohibit protein information. I certainly agree that
it should not be required, and I agree it could get a bit complex. But I
suspect some people would like to encode that information and it seems to
me that providing a standardized optional way of doing that is a better
choice than attempting prohibition.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Aa0goiqw4IQmTIsTGLVNjUIYpgfGuUN8ks5tsNjugaJpZM4TcduF>
.
--
Henry H. N. Lam
Associate Professor
Department of Chemical and Biological Engineering
Hong Kong University of Science and Technology
Phone: 2358-7133
Fax: 2358-0054
Email: [email protected]
|
I agree wholeheartedly. We should be flexible, be able to store all existing information from existing libraries (even if we don't necessarily see the use) and by allowing apppropriate optional (not mandatory) fields we should be able to achieve this. |
By design, every file format from PSI has the options to add more additional properties by using in some cases CVParams, UsersPArams (mzIdentML, mzML); or optional columns (mzTab). In the current document, we are specifying which are the fields we want to capture, with their cardinality and we should guarantee that every section has a mechanism to add additional fields as CVParams. In the specification document, we can add a section about how to report protein information. By design, we should have the flexibility to add the information and protein information is one of those cases. |
In my view, there would need to be a way to encode protein level information, since some people/tools may need it. What I would avoid is to capture all the underlying complexity related to protein inference. That would ideally go somewhere else. |
To update this issue: All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit. When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on |
Hi,
can you open the Document in "Comment" Mode for all with the link ?
That should allow "Suggest mode" editing. And or respond to
the "request permission" notification I sent ?
Thanks, Yours, Steffen
…________________________________________
Von: Ralf Gabriels <[email protected]>
Gesendet: Montag, 18. November 2019 11:10
An: HUPO-PSI/SpectralLibraryFormat
Cc: Neumann, Steffen; Mention
Betreff: Re: [HUPO-PSI/SpectralLibraryFormat] We need to capture the metadata around the Spectral library (#7)
To update this issue:
All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit.
When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on View > Mode > Suggesting.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#7>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABPWOJMHC6XIHIMDOWXZT3QUJS3DANCNFSM4E3R3OCQ>.
|
Switched document to all with link can suggest, as requested. |
Newer document: https://docs.google.com/document/d/1o11m7grfHvMzfbTozvDY0twJ1g2dzk6I/edit But @RalfG is working on a system to encode this in JSON. Next when we have @RalfG on the Friday call, we should spend some time with this table of information. |
The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:
The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.
The following fields can be reused from mzTab:
Can we add to this issue all the fields we think are interesting or important to trace?
The text was updated successfully, but these errors were encountered: