Parquet needs some standarization #19

ypriverol · 2018-09-21T08:22:26Z

We need to do some standardization for the Parquet format that enables other people to understand the file format.

bgruening · 2021-02-13T12:21:41Z

Yeah, that would be nice and give it a proper name :)

sorenwacker · 2021-05-06T17:54:25Z

I like 'parquet', as it is pretty clear what library to use to open it.

Regarding column names. I had a few thoughts:

The column name Mass or Masses is technically wrong as it is M/Z values. Or do you convert the M/Z values into masses internally?
Intensities could be Intensity even if it is an array.
RetentionTime was used in mzXML files, in mzML files I have seen it as ScanTime which is a bit more general and may be more accurate. It would not imply that a chromatographic step was used.
Things like TIC are maybe convenient, but also somewhat redundant and it could be calculated easily in one line of code if the data would be in long format.

df_long.groupby('scan_time_min').sum().plot(y='intensity')

I am quite new to metabolomics/proteomics thought. I am looking at the problem more from a data science Python-biased perspective.

ypriverol added the enhancement New feature or request label Sep 21, 2018

ypriverol self-assigned this Sep 21, 2018

bernt-matthias mentioned this issue Feb 12, 2021

make ThermoRawFileParser run parallel galaxyproteomics/tools-galaxyp#560

Merged

Provide feedback