Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch XIC and spectra extraction in ThermoRawFileParser #7

Open
caetera opened this issue Sep 19, 2019 · 2 comments
Open

Batch XIC and spectra extraction in ThermoRawFileParser #7

caetera opened this issue Sep 19, 2019 · 2 comments

Comments

@caetera
Copy link

caetera commented Sep 19, 2019

Abstract

ThermoRawFileParser is open-source cross-platform software to convert raw files from Thermo MS instruments to open data formats. Common open MS data formats are either "heavy" (XML-based formats, such as mzML) or too simple to include all necessary metadata (text-based formats, such as MGF). There is a need to include more "light-weight" data representation, that can be used in web services and applications. Moreover, it is often necessary to obtain specific information, such as a set of eXtracted Ion Chromatograms (XICs), or spectra with certain properties from a data file rather than converting it completely. This project aims to resolve this issue by developing a tool for batch retrieval of XICs and spectra in JSON format using existing codebase of ThermoRawFileParser.

Work plan

1. Group discussion/brain-storming

The following issues have to be discussed, however, the list can be extended during the discussion. The brain-storming can begin before the meeting.

  • What kind of information is necessary to be present in the input (for example, m/z + tolerance for XICs, filter string for spectra)?
  • What is the exact representation of it (units for m/z, chemical formula format, spectra selection parameters, etc)?
  • Which format this information has to be provided in (XML, JSON, text, command-line arguments, etc)?
  • What is the main priority - human-readability or ease-of-parsing?
  • What kind (meta)data has to be included in the output (for example, units of time in XICs, representation of numbers, etc)? We should take special care regarding the representation of mass spectra, i.e. it should be easy to extend with new metadata, etc
  • Design concepts for the tool (for example, parallelization, processing multiple raw files, etc).

As a result, we should come to a detailed specification of input and output and key design concepts for the tool to be developed.

2. Drafting the roadmap of development

We should start with review/refresh the existing codebase of ThermoRawFileParser. Later, using the specification developed earlier, we should develop the roadmap for the features to be implemented, starting from the most important (easy to implement) to least important (complicated). For example, as a start, we will focus on batch retrieval of m/z based XICs, with the future plan to include chemical formula based ones.

3. Building a working prototype.

We start with a prototype tool that will implement the most important features. Depending on the number of participants the work can be done in parallel with small groups focusing on isolated tasks, such as input parsing, output formatting, XIC creation, spectra filtering and retrieval.

4. Improving the prototype.

Depending on the available time/resources, we will continue adding new features to the tool according to the roadmap. Additionally, we have to agree on how to continue with the development after the end of the hackathon.

5. (Bonus) Working on JSON representation of mass spectral data

If time will allow, we can discuss/draft the format for JSON representation of the complete raw file. Partially (representation of mass spectra), this should be discussed during stage 1, we should build on it to formulate a draft for JSON-based MS data format. It is unlikely that we will be able to provide the complete specification during the hackathon, however, we can present it as draft open for public discussion.

The results of the hackathon, i.e. the specifications from stage 1, roadmap from stage 2, and code from stages 3 and 4, will be published on GitHub, possibly as a separate branch inside ThermoRawFileParser repository.

The draft of JSON-based MS data format should be published as a separate repository available for comments and suggestions.

Technical details

  • Main programming language: C# (Mono and .NET)
  • Raw files from different MS instrument (Orbitrap-based) will be used for testing

Contact information

Vladimir Gorshkov, University of Southern Denmark, vgor(at)bmb.sdu.dk
Niels Hulstaert, Ghent University, niels.hulstaert(at)ugent.vib.be
Yasset Perez-Riverol, EMBL-EBI, ypriverol(at)gmail.com

@cpanse
Copy link

cpanse commented Nov 4, 2019

We (@rolivella and me) developed a simple prototype performing this task during a core4life micro hackathon (about 4 h). The idea was to have proof-of-concept code for feeding XICs into the http://qcloud2.crg.eu/ system.

https://github.com/coreforlife/c4lProteomics/tree/master/RawFileReader-XIC-json

@caetera
Copy link
Author

caetera commented Nov 5, 2019

Hi @cpanse, thank you for letting us know about your prototype. We will look into your code to have some inspiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants