"Pre-Filter" Before Deserialization #569
Unanswered
puppydog1973
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am admittedly new to Parquet files and a little confused by the different approach to data from what I'm used to in .Net world.
So with that in mind here is what I'm trying to accomplish. We have an api that needs to retrieve (read only / select) information specific to a client / user. That's it. Select * From Table Where CreatedDate is > StartDate and < EndDate
One of the constraints we have being imposed on us is we HAVE to take this from the Gold Layer in Databricks.
Another is we have to stay in the .Net world so using python is not an option.
The data team threw a sample table out there for us to access and test with from our .Net API.
Using the Parquet .Net library I was easily able to deserialize (ParquetSerializer.DeserializeAsync) the parquet file into a List that I can then use Linq to extract the date range I want.
However this took 30 seconds to load this sample table into memory (deserialization process). I would imagine that with production data the deserialization might take several minutes.
Is there a way to "pre-filter" and only deserialize the records I want?
When I tried using the low level reader it felt like I was looping through all the columns in a filegroup then having to retrieve the data for that column and "rebuilding" the data structure. I fully recognize that I may be using the low level reader wrong so please help this newbie understand how to do it better.
Since the api is ultimately returning json I am willing to consider an option where Parquet .Net returns json if that would eliminate the need for deserialization.
TIA
Beta Was this translation helpful? Give feedback.
All reactions