fix: Reuse online store for materialization writes #166

omirandadev · 2025-01-30T21:24:20Z

What this PR does / why we need it:

This PR allows instances of the OnlineStore to be shared from within the spark materialization engine. This means a new connection to the store doesn't need to be created for each write.

Which issue(s) this PR fixes:

Misc

piket · 2025-01-30T21:29:42Z

sdk/python/feast/infra/materialization/contrib/spark/spark_materialization_engine.py

    """Load pandas df to online store"""
    for pdf in iterator:
        pdf_row_count = pdf.shape[0]
        start_time = time.time()
        # convert to pyarrow table
        if pdf_row_count == 0:
            print("INFO!!! Dataframe has 0 records to process")
-            return
+            continue


Shouldn't we break instead of continue? If the row count is zero we should stop looping which is essentially what the return was doing.

The reason I made this change is because I am not certain that if we encounter an empty pdf that that necessarily implies the remaining pdfs in the partition are also empty. The pull_latest_from_table_or_query method applies a filter within the SQL query in it. Namely, the
WHERE {timestamp_field} BETWEEN TIMESTAMP('{start_date_str}') AND TIMESTAMP('{end_date_str}' filter within the SQL query makes me think it is possible for one of the pdfs to be empty.

We can leave it as return. May be Empty partitions in spark leads to empty data frames in iterator (when you have less data than the number of partitions).

If you have a data frame with X partitions, mapInPandas converts each partition to a iterator[DataFrame]. Each DataFrame in the iterator has length of 10,000 as default size. If the partition has more than 10K records, then the iterator will help to iterate over each Dataframe.

Okay, changed it back.

fix: Reuse online store for materialization writes

9e31cb4

piket reviewed Jan 30, 2025

View reviewed changes

omirandadev requested a review from piket January 30, 2025 22:54

revert to returning on empty pdf

891ab12

EXPEbdodla approved these changes Jan 31, 2025

View reviewed changes

omirandadev merged commit b5f14f2 into master Jan 31, 2025
23 checks passed

omirandadev deleted the reuse_connections branch January 31, 2025 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Reuse online store for materialization writes #166

fix: Reuse online store for materialization writes #166

omirandadev commented Jan 30, 2025

piket Jan 30, 2025

omirandadev Jan 30, 2025

EXPEbdodla Jan 30, 2025

omirandadev Jan 31, 2025

fix: Reuse online store for materialization writes #166

fix: Reuse online store for materialization writes #166

Conversation

omirandadev commented Jan 30, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes:

Misc

piket Jan 30, 2025

Choose a reason for hiding this comment

omirandadev Jan 30, 2025

Choose a reason for hiding this comment

EXPEbdodla Jan 30, 2025

Choose a reason for hiding this comment

omirandadev Jan 31, 2025

Choose a reason for hiding this comment