-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: writer for mz_parquet v0.2 format #188
base: dotnetcore
Are you sure you want to change the base?
Conversation
@ypriverol would be interesting to benchmark XIC extraction (e.g. randomly placed m/z + rt ranges). |
# github.com/lazear/mz_parquet commit: ef5c124f4890c0f025544f8fe7c6e09a33a813e8
# long format: cargo run --release LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML --long
# wide format: cargo run --release LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML
import polars as pl
import time
mz = 284.0106
tol = mz * 20E-6
rt = 12.7
t0 = time.time()
df = pl.scan_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.long").filter(
pl.col("rt").is_between(rt - 1, rt + 1)
& pl.col("mz").is_between(mz - tol, mz + tol)
).collect()
t1 = time.time()
print(f"long: {t1 - t0:0.4}s {len(df)} rows")
t0 = time.time()
df = pl.scan_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.wide").filter(
pl.col("scan_start_time").is_between(rt - 1, rt + 1)
& pl.col("mz")
.list.eval(pl.element().is_between(mz - tol, mz + tol))
.list.any()
).collect()
t1 = time.time()
print(f"wide: {t1 - t0:0.4}s {len(df)} rows")
The files end up being the same size, since the physical layout of bytes in the parquet files is nearly identical between long and wide ( If you load the files into memory and repeatedly query random m/z and RT ranges, the long format still wins (much less work to do) import polars as pl
import time
import numpy as np
all_long = 0
all_wide = 0
long = pl.read_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.long")
wide = pl.read_parquet("LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzparquet.wide")
for _ in range(100):
mz = np.random.uniform(200, 2000)
tol = mz * 20E-6
rt = np.random.uniform(0, 200)
t0 = time.time()
df = long.filter(
pl.col("rt").is_between(rt - 1, rt + 1)
& pl.col("mz").is_between(mz - tol, mz + tol)
)
t1 = time.time()
all_long += t1 - t0
t0 = time.time()
df = wide.filter(
pl.col("scan_start_time").is_between(rt - 1, rt + 1)
& pl.col("mz")
.list.eval(pl.element().is_between(mz - tol, mz + tol))
.list.any()
)
t1 = time.time()
all_wide += t1 - t0
print(all_long / 100)
print(all_wide / 100) long: 0.0079762601852417 |
Stupid beginner question, but would you also need a collect() in the later benchmark? |
var msOrder = raw.GetScanEventForScanNumber(scanNumber).MSOrder; | ||
|
||
if (msOrder == MSOrderType.Ms) | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third (else) branch seems to do the same as the second one (else if). Is this intended?
|
||
CentroidStream centroidStream = new CentroidStream(); | ||
|
||
//check for FT mass analyzer data | ||
// Pull out m/z and intensity values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always want to have centroids written to the file? Should we use the same approach as other file formats? By default only centroids, and profile data if -p
is provided
|
||
// map last (msOrder - 1) -> scan number (e.g. mapping precursors) | ||
// note, this assumes time dependence of MS1 -> MS2 -> MSN | ||
var last_scans = new Dictionary<int, uint>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to detect precursor scan the same way as for the other formats or keep this (much simpler) way?
throw new NotImplementedException(); | ||
if ((int)msOrder > 1) | ||
{ | ||
var rx = scanFilter.GetReaction(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only read the first reaction, i.e. for anything other than MS2 without the supplemental activation it won't be accurate. Is this intended?
|
||
//var output = outputDirectory + "//" + Path.GetFileNameWithoutExtension(sourceRawFileName); | ||
// this assumes symmetrical quad window | ||
isolation_lower = (float)(rx.PrecursorMass - rx.IsolationWidth / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep the result consistent with other formats?
// { | ||
// scan.Noises = dummyVal; | ||
// } | ||
var trailer = raw.GetTrailerExtraInformation(scanNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably better to use Trailer object, similarly to mzML implementation
This PR implements a new writer for the parquet output format, in line with the specification used here: https://github.com/lazear/mz_parquet
This is a "long" format for mass spectrometry data, where each row in the output file represents an ion with an associated m/z & intensity in the raw data. This enables highly efficient downstream applications for XIC generation, searching data, etc, since filters can be constructed for one 'column' (e.g.
mz
, orprecursor_mz
) at a time and leverage SIMD.The other benefit of the long format is that it is flexible - additional columns can be added by writers, and simply ignored (never read from disk) by downstream software if they don't need that column.
Note that this implementation (raw -> mz_parquet) produces mzparquet files with slightly different results than raw -> mzML -> mz_parquet converter, likely because of differences in how centroid streams are created. Please feel free to suggest the best way to pull out the raw data, and I'm happy to change it.