feat: writer for mz_parquet v0.2 format #188

lazear · 2024-11-15T21:47:28Z

This PR implements a new writer for the parquet output format, in line with the specification used here: https://github.com/lazear/mz_parquet

This is a "long" format for mass spectrometry data, where each row in the output file represents an ion with an associated m/z & intensity in the raw data. This enables highly efficient downstream applications for XIC generation, searching data, etc, since filters can be constructed for one 'column' (e.g. mz, or precursor_mz) at a time and leverage SIMD.

The other benefit of the long format is that it is flexible - additional columns can be added by writers, and simply ignored (never read from disk) by downstream software if they don't need that column.

Note that this implementation (raw -> mz_parquet) produces mzparquet files with slightly different results than raw -> mzML -> mz_parquet converter, likely because of differences in how centroid streams are created. Please feel free to suggest the best way to pull out the raw data, and I'm happy to change it.

timosachsenberg · 2024-11-16T08:02:22Z

@ypriverol would be interesting to benchmark XIC extraction (e.g. randomly placed m/z + rt ranges).

lazear · 2024-11-16T17:17:08Z

@ypriverol would be interesting to benchmark XIC extraction (e.g. randomly placed m/z + rt ranges).

# github.com/lazear/mz_parquet commit: ef5c124f4890c0f025544f8fe7c6e09a33a813e8
# long format: cargo run --release LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML --long
# wide format: cargo run --release LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML
import polars as pl
import time

mz = 284.0106
tol = mz * 20E-6
rt = 12.7

t0 = time.time()
df = pl.scan_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.long").filter(
    pl.col("rt").is_between(rt - 1, rt + 1)
    & pl.col("mz").is_between(mz - tol, mz + tol)
).collect()
t1 = time.time()

print(f"long: {t1 - t0:0.4}s {len(df)} rows")

t0 = time.time()
df = pl.scan_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.wide").filter(
    pl.col("scan_start_time").is_between(rt - 1, rt + 1)
    & pl.col("mz")
        .list.eval(pl.element().is_between(mz - tol, mz + tol))
        .list.any()
).collect()
t1 = time.time()

print(f"wide: {t1 - t0:0.4}s {len(df)} rows")

long: 0.01192s 1493 rows
wide: 0.6345s 1493 rows

The files end up being the same size, since the physical layout of bytes in the parquet files is nearly identical between long and wide (mz column is a list of m/z values for that scan), but the ability to construct more efficient predicate pushdown filters dramatically speeds up reading from disk. Writing a low-level parquet writer/reader for the long format is also much simpler. Working with repetition/arrays/maps is pretty complicated if you're directly working with dremel encoding.

If you load the files into memory and repeatedly query random m/z and RT ranges, the long format still wins (much less work to do)

import polars as pl
import time
import numpy as np

all_long = 0
all_wide = 0

long = pl.read_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.long")
wide = pl.read_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.wide")

for _ in range(100):
    mz = np.random.uniform(200, 2000)
    tol = mz * 20E-6
    rt = np.random.uniform(0, 200)

    t0 = time.time()
    df = long.filter(
        pl.col("rt").is_between(rt - 1, rt + 1)
        & pl.col("mz").is_between(mz - tol, mz + tol)
    )
    t1 = time.time()
    all_long += t1 - t0

    t0 = time.time()
    df = wide.filter(
        pl.col("scan_start_time").is_between(rt - 1, rt + 1)
        & pl.col("mz")
            .list.eval(pl.element().is_between(mz - tol, mz + tol))
            .list.any()
    )
    t1 = time.time()
    all_wide += t1 - t0

print(all_long / 100)
print(all_wide / 100)

long: 0.0079762601852417
wide: 0.13979199886322022

timosachsenberg · 2024-11-18T09:12:58Z

Stupid beginner question, but would you also need a collect() in the later benchmark?
Update: never mind. got it :)

caetera · 2024-11-18T19:26:58Z