Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet needs some standarization #19

Open
ypriverol opened this issue Sep 21, 2018 · 2 comments
Open

Parquet needs some standarization #19

ypriverol opened this issue Sep 21, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ypriverol
Copy link
Collaborator

We need to do some standardization for the Parquet format that enables other people to understand the file format.

@bgruening
Copy link
Contributor

Yeah, that would be nice and give it a proper name :)

@sorenwacker
Copy link

sorenwacker commented May 6, 2021

I like 'parquet', as it is pretty clear what library to use to open it.

Regarding column names. I had a few thoughts:

  • The column name Mass or Masses is technically wrong as it is M/Z values. Or do you convert the M/Z values into masses internally?

  • Intensities could be Intensity even if it is an array.

  • RetentionTime was used in mzXML files, in mzML files I have seen it as ScanTime which is a bit more general and may be more accurate. It would not imply that a chromatographic step was used.

  • Things like TIC are maybe convenient, but also somewhat redundant and it could be calculated easily in one line of code if the data would be in long format.

    df_long.groupby('scan_time_min').sum().plot(y='intensity')

I am quite new to metabolomics/proteomics thought. I am looking at the problem more from a data science Python-biased perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants