Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: writer for mz_parquet v0.2 format #188

Open
wants to merge 1 commit into
base: dotnetcore
Choose a base branch
from

Conversation

lazear
Copy link
Contributor

@lazear lazear commented Nov 15, 2024

This PR implements a new writer for the parquet output format, in line with the specification used here: https://github.com/lazear/mz_parquet

This is a "long" format for mass spectrometry data, where each row in the output file represents an ion with an associated m/z & intensity in the raw data. This enables highly efficient downstream applications for XIC generation, searching data, etc, since filters can be constructed for one 'column' (e.g. mz, or precursor_mz) at a time and leverage SIMD.

The other benefit of the long format is that it is flexible - additional columns can be added by writers, and simply ignored (never read from disk) by downstream software if they don't need that column.

Note that this implementation (raw -> mz_parquet) produces mzparquet files with slightly different results than raw -> mzML -> mz_parquet converter, likely because of differences in how centroid streams are created. Please feel free to suggest the best way to pull out the raw data, and I'm happy to change it.

@timosachsenberg
Copy link

@ypriverol would be interesting to benchmark XIC extraction (e.g. randomly placed m/z + rt ranges).

@lazear
Copy link
Contributor Author

lazear commented Nov 16, 2024

@ypriverol would be interesting to benchmark XIC extraction (e.g. randomly placed m/z + rt ranges).

# github.com/lazear/mz_parquet commit: ef5c124f4890c0f025544f8fe7c6e09a33a813e8
# long format: cargo run --release LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML --long
# wide format: cargo run --release LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML
import polars as pl
import time

mz = 284.0106
tol = mz * 20E-6
rt = 12.7

t0 = time.time()
df = pl.scan_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.long").filter(
    pl.col("rt").is_between(rt - 1, rt + 1)
    & pl.col("mz").is_between(mz - tol, mz + tol)
).collect()
t1 = time.time()

print(f"long: {t1 - t0:0.4}s {len(df)} rows")

t0 = time.time()
df = pl.scan_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.wide").filter(
    pl.col("scan_start_time").is_between(rt - 1, rt + 1)
    & pl.col("mz")
        .list.eval(pl.element().is_between(mz - tol, mz + tol))
        .list.any()
).collect()
t1 = time.time()

print(f"wide: {t1 - t0:0.4}s {len(df)} rows")
long: 0.01192s 1493 rows
wide: 0.6345s 1493 rows

The files end up being the same size, since the physical layout of bytes in the parquet files is nearly identical between long and wide (mz column is a list of m/z values for that scan), but the ability to construct more efficient predicate pushdown filters dramatically speeds up reading from disk. Writing a low-level parquet writer/reader for the long format is also much simpler. Working with repetition/arrays/maps is pretty complicated if you're directly working with dremel encoding.

If you load the files into memory and repeatedly query random m/z and RT ranges, the long format still wins (much less work to do)

import polars as pl
import time
import numpy as np

all_long = 0
all_wide = 0

long = pl.read_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.long")
wide = pl.read_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.wide")

for _ in range(100):
    mz = np.random.uniform(200, 2000)
    tol = mz * 20E-6
    rt = np.random.uniform(0, 200)

    t0 = time.time()
    df = long.filter(
        pl.col("rt").is_between(rt - 1, rt + 1)
        & pl.col("mz").is_between(mz - tol, mz + tol)
    )
    t1 = time.time()
    all_long += t1 - t0

    t0 = time.time()
    df = wide.filter(
        pl.col("scan_start_time").is_between(rt - 1, rt + 1)
        & pl.col("mz")
            .list.eval(pl.element().is_between(mz - tol, mz + tol))
            .list.any()
    )
    t1 = time.time()
    all_wide += t1 - t0

print(all_long / 100)
print(all_wide / 100)

long: 0.0079762601852417
wide: 0.13979199886322022

@timosachsenberg
Copy link

timosachsenberg commented Nov 18, 2024

Stupid beginner question, but would you also need a collect() in the later benchmark?
Update: never mind. got it :)

var msOrder = raw.GetScanEventForScanNumber(scanNumber).MSOrder;

if (msOrder == MSOrderType.Ms)
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third (else) branch seems to do the same as the second one (else if). Is this intended?


CentroidStream centroidStream = new CentroidStream();

//check for FT mass analyzer data
// Pull out m/z and intensity values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always want to have centroids written to the file? Should we use the same approach as other file formats? By default only centroids, and profile data if -p is provided


// map last (msOrder - 1) -> scan number (e.g. mapping precursors)
// note, this assumes time dependence of MS1 -> MS2 -> MSN
var last_scans = new Dictionary<int, uint>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to detect precursor scan the same way as for the other formats or keep this (much simpler) way?

throw new NotImplementedException();
if ((int)msOrder > 1)
{
var rx = scanFilter.GetReaction(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only read the first reaction, i.e. for anything other than MS2 without the supplemental activation it won't be accurate. Is this intended?


//var output = outputDirectory + "//" + Path.GetFileNameWithoutExtension(sourceRawFileName);
// this assumes symmetrical quad window
isolation_lower = (float)(rx.PrecursorMass - rx.IsolationWidth / 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep the result consistent with other formats?

// {
// scan.Noises = dummyVal;
// }
var trailer = raw.GetTrailerExtraInformation(scanNumber);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to use Trailer object, similarly to mzML implementation

@caetera caetera added the enhancement New feature or request label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants