Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

Open
patchwork01 opened this issue Dec 10, 2024 · 0 comments
Open

Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

patchwork01 opened this issue Dec 10, 2024 · 0 comments
Labels
version-upgrades Issues to upgrade dependencies

Comments

@patchwork01
Copy link
Collaborator

patchwork01 commented Dec 10, 2024

Background

Required for:

Description

We'd like to upgrade to a version of EMR that uses a version of Hadoop where the AWS integration uses AWS SDK v2.

This seems to be necessary in order to upgrade Sleeper to use AWS SDK v2.

Analysis

Here's the documentation for versions of EMR, and the corresponding versions of Spark and Hadoop, which must match:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html

When we tried to upgrade to AWS SDK v2 before EMR used a version of Hadoop with v2 of the SDK, it made our jars too big to fit in a lambda. This is because we use the Hadoop AWS integration to interact with Parquet files in S3, which is how we store all Sleeper table data. That means we need the Hadoop AWS integration in a lot of lambdas. We also need the version of Hadoop we use to match the version used in EMR, so that we can use EMR for bulk import and still interact with Sleeper records in Parquet with the same code we use elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version-upgrades Issues to upgrade dependencies
Projects
None yet
Development

No branches or pull requests

1 participant