First version of bulk export capability #3446

gaffer01 · 2024-10-10T08:16:25Z

Background

This is split off from #1393. This issue is to specify what's needed for a minimum viable version of bulk export.

Description

We want a user to be able to submit a request for an entire Sleeper table to be written out to Parquet files. There should be one output file per leaf partition. This file contains all data for that leaf partition in sorted order.

Analysis

There will be need to be sub issues for the different components of this. The following list describes some of the things that will need to be done:

A new optional stack called the BulkExportStack. This will need to contain a queue for the export request. This request will be picked up by a lambda, which will act similarly to the query planner, i.e. break the request up into sub-export requests, one for each leaf partition. We will then need an ECS cluster to run tasks to process these sub requests. The scaling up of tasks can happen in the same way we scale up tasks in other situations, e.g. compactions.
A container to receive messages from the queue and execute the job, i.e. performing a query for the whole leaf partition that will export all the data.

There will be other future improvements to this capability, such as the ability to specify additional filters to restrict the data that is returned, and execution of the output using DataFusion. But those will be added once the basic functionality exists.

Sub tasks

gaffer01 added the enhancement New feature or request label Oct 10, 2024

gaffer01 assigned ab295382 Oct 10, 2024

patchwork01 mentioned this issue Oct 14, 2024

Rust integration tests for DataFusion compaction #3477

Closed

This was referenced Oct 16, 2024

Create new CDK stack for Bulk Export #3500

Open

Setup the folder structure for bulk export #3501

Closed

ab295382 mentioned this issue Oct 23, 2024

Lambda handler to create bulk export partition queries #3541

Open

patchwork01 added this to the 0.26.0 milestone Oct 23, 2024

ab295382 mentioned this issue Oct 24, 2024

Bulk export documentation #3551

Open

gaffer01 modified the milestones: 0.26.0, 0.27.0 Nov 5, 2024

ab295382 mentioned this issue Nov 8, 2024

Export single leaf partition into a new sorted parquet file #3653

Open

patchwork01 removed this from the 0.27.0 milestone Nov 13, 2024

patchwork01 added the parent-issue An issue that is or should be split into multiple sub-issues label Nov 13, 2024

This was referenced Nov 26, 2024

Bulk Export - serialisation & deserialisation #3803

Closed

Bulk Export Query Splitter #3818

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First version of bulk export capability #3446

First version of bulk export capability #3446

gaffer01 commented Oct 10, 2024 •

edited by ab295382

Loading

First version of bulk export capability #3446

First version of bulk export capability #3446

Comments

gaffer01 commented Oct 10, 2024 • edited by ab295382 Loading

Background

Description

Analysis

Sub tasks

gaffer01 commented Oct 10, 2024 •

edited by ab295382

Loading