Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

patchwork01 · 2024-12-10T12:16:28Z

Background

Required for:

Standardise on using AWS SDK version 2 everywhere #1389

Description

We'd like to upgrade to a version of EMR that uses a version of Hadoop where the AWS integration uses AWS SDK v2.

This seems to be necessary in order to upgrade Sleeper to use AWS SDK v2.

Analysis

Here's the documentation for versions of EMR, and the corresponding versions of Spark and Hadoop, which must match:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html

When we tried to upgrade to AWS SDK v2 before EMR used a version of Hadoop with v2 of the SDK, it made our jars too big to fit in a lambda. This is because we use the Hadoop AWS integration to interact with Parquet files in S3, which is how we store all Sleeper table data. That means we need the Hadoop AWS integration in a lot of lambdas. We also need the version of Hadoop we use to match the version used in EMR, so that we can use EMR for bulk import and still interact with Sleeper records in Parquet with the same code we use elsewhere.

patchwork01 added the version-upgrades Issues to upgrade dependencies label Dec 10, 2024

patchwork01 mentioned this issue Dec 10, 2024

Standardise on using AWS SDK version 2 everywhere #1389

Open

patchwork01 modified the milestone: 0.30.0 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

patchwork01 commented Dec 10, 2024 •

edited

Loading

Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

Upgrade to a version of EMR/Hadoop using AWS SDK v2 #3897

Comments

patchwork01 commented Dec 10, 2024 • edited Loading

Background

Description

Analysis

patchwork01 commented Dec 10, 2024 •

edited

Loading