You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we tried to upgrade to AWS SDK v2 before EMR used a version of Hadoop with v2 of the SDK, it made our jars too big to fit in a lambda. This is because we use the Hadoop AWS integration to interact with Parquet files in S3, which is how we store all Sleeper table data. That means we need the Hadoop AWS integration in a lot of lambdas. We also need the version of Hadoop we use to match the version used in EMR, so that we can use EMR for bulk import and still interact with Sleeper records in Parquet with the same code we use elsewhere.
The text was updated successfully, but these errors were encountered:
Background
Required for:
Description
We'd like to upgrade to a version of EMR that uses a version of Hadoop where the AWS integration uses AWS SDK v2.
This seems to be necessary in order to upgrade Sleeper to use AWS SDK v2.
Analysis
Here's the documentation for versions of EMR, and the corresponding versions of Spark and Hadoop, which must match:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html
When we tried to upgrade to AWS SDK v2 before EMR used a version of Hadoop with v2 of the SDK, it made our jars too big to fit in a lambda. This is because we use the Hadoop AWS integration to interact with Parquet files in S3, which is how we store all Sleeper table data. That means we need the Hadoop AWS integration in a lot of lambdas. We also need the version of Hadoop we use to match the version used in EMR, so that we can use EMR for bulk import and still interact with Sleeper records in Parquet with the same code we use elsewhere.
The text was updated successfully, but these errors were encountered: