"Pre-Filter" Before Deserialization #569

puppydog1973 · 2024-11-12T17:24:03Z

puppydog1973
Nov 12, 2024

I am admittedly new to Parquet files and a little confused by the different approach to data from what I'm used to in .Net world.

So with that in mind here is what I'm trying to accomplish. We have an api that needs to retrieve (read only / select) information specific to a client / user. That's it. Select * From Table Where CreatedDate is > StartDate and < EndDate

One of the constraints we have being imposed on us is we HAVE to take this from the Gold Layer in Databricks.
Another is we have to stay in the .Net world so using python is not an option.

The data team threw a sample table out there for us to access and test with from our .Net API.

Using the Parquet .Net library I was easily able to deserialize (ParquetSerializer.DeserializeAsync) the parquet file into a List that I can then use Linq to extract the date range I want.

However this took 30 seconds to load this sample table into memory (deserialization process). I would imagine that with production data the deserialization might take several minutes.

Is there a way to "pre-filter" and only deserialize the records I want?

When I tried using the low level reader it felt like I was looping through all the columns in a filegroup then having to retrieve the data for that column and "rebuilding" the data structure. I fully recognize that I may be using the low level reader wrong so please help this newbie understand how to do it better.

Since the api is ultimately returning json I am willing to consider an option where Parquet .Net returns json if that would eliminate the need for deserialization.

TIA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Pre-Filter" Before Deserialization #569

{{title}}

Replies: 0 comments

Select a reply

"Pre-Filter" Before Deserialization #569

puppydog1973 Nov 12, 2024

Replies: 0 comments

puppydog1973
Nov 12, 2024