Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of bulk export capability #3446

Open
gaffer01 opened this issue Oct 10, 2024 · 0 comments
Open

First version of bulk export capability #3446

gaffer01 opened this issue Oct 10, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request parent-issue An issue that is or should be split into multiple sub-issues

Comments

@gaffer01
Copy link
Member

gaffer01 commented Oct 10, 2024

Background

This is split off from #1393. This issue is to specify what's needed for a minimum viable version of bulk export.

Description

We want a user to be able to submit a request for an entire Sleeper table to be written out to Parquet files. There should be one output file per leaf partition. This file contains all data for that leaf partition in sorted order.

Analysis

There will be need to be sub issues for the different components of this. The following list describes some of the things that will need to be done:

  • A new optional stack called the BulkExportStack. This will need to contain a queue for the export request. This request will be picked up by a lambda, which will act similarly to the query planner, i.e. break the request up into sub-export requests, one for each leaf partition. We will then need an ECS cluster to run tasks to process these sub requests. The scaling up of tasks can happen in the same way we scale up tasks in other situations, e.g. compactions.
  • A container to receive messages from the queue and execute the job, i.e. performing a query for the whole leaf partition that will export all the data.

There will be other future improvements to this capability, such as the ability to specify additional filters to restrict the data that is returned, and execution of the output using DataFusion. But those will be added once the basic functionality exists.

Sub tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parent-issue An issue that is or should be split into multiple sub-issues
Projects
None yet
Development

No branches or pull requests

3 participants