Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Alluxio support to Iceberg connector #20602

Merged
merged 1 commit into from
Feb 14, 2024

Conversation

amoghmargoor
Copy link
Member

@amoghmargoor amoghmargoor commented Feb 6, 2024

Description

In this patch we are extending the alluxio changes which exists currently for Delta connector to also include Iceberg.

Additional context and related issues

Fixes: #19829

Part of #20550

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve performance of scans by adding the ability to cache data files on local SSDs ({issue}`20602`)

Copy link

cla-bot bot commented Feb 6, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Amogh Margoor.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@amoghmargoor amoghmargoor changed the title Enable Alluxio for Iceberg connector Extend Alluxio support to Iceberg connector Feb 6, 2024
Copy link

cla-bot bot commented Feb 6, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the iceberg Iceberg connector label Feb 6, 2024
Copy link

cla-bot bot commented Feb 6, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@mosabua
Copy link
Member

mosabua commented Feb 6, 2024

@cla-bot check

Copy link

cla-bot bot commented Feb 6, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Feb 6, 2024

The cla-bot has been summoned, and re-checked this pull request!

@mosabua
Copy link
Member

mosabua commented Feb 6, 2024

CLA is good now. Thanks @amoghmargoor

Copy link
Contributor

@jkylling jkylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Delta Lake connector we added a variant of the base connector test with caching


, but it might be unnecessary.

I assume the product test and the Iceberg tests are still work in progress. The rest looks good.

P.S.
If it is of help, these commands were the inner part of my slow development loop when developing the product tests:

./mvnw -pl ':trino-product-tests-launcher'  install -nsu -DskipTests -Dskip.npm -Dskip.yarn -T 1C
./mvnw -pl ':trino-product-tests'  install -nsu -DskipTests -Dskip.npm -Dskip.yarn -T 1C
testing/bin/ptl test run --environment multinode-minio-data-lake-caching -- -t io.trino.tests.product.deltalake.TestDeltaLakeAlluxioCaching

You can probably extend the EnvMultinodeMinioDataLakeCaching for the product test.

@raunaqmorarka
Copy link
Member

raunaqmorarka commented Feb 8, 2024

When I tried to benchmark this PR I ran into planning timeouts on the coordinator, it likely indicates some issue with caching during metadata reads in the coordinator. I disabled caching on coordinator in a branch and ran benchmarks on that.
[Alluxio iceberg sf1k parquet partitioned.pdf](https://github.com/trinodb/trino/fil
Alluxio iceberg sf1k parquet unpartitioned.pdf
es/14213187/Alluxio.iceberg.sf1k.parquet.partitioned.pdf)
Partitioned parquet sf1k
Screenshot 2024-02-08 at 11 50 56 PM
Unpartitioned parquet sf1k
Screenshot 2024-02-08 at 11 50 26 PM
Significant improvements in wall time and some improvement in CPU time

@amoghmargoor
Copy link
Member Author

I have rebased it and added couple of product tests. Its still enabled on coodinator which i will try to remove now.

@amoghmargoor
Copy link
Member Author

@raunaqmorarka I have removed the Alluxio module from being binded from Coordinator for now. however the alluxio configs needs to be binded in coodinator still else they will throw property not used exception. Take a look at that logic. there might be few product test errors - ignore that for now i will fix them.

@github-actions github-actions bot added the delta-lake Delta Lake connector label Feb 12, 2024
@amoghmargoor amoghmargoor force-pushed the oss_alluxio_cache_2 branch 6 times, most recently from f5bde1f to 0bcfea4 Compare February 13, 2024 11:54
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comments

public Optional<String> getCacheKey(TrinoInputFile delegate)
throws IOException
{
return Optional.of(delegate.location().path());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the iceberg spec guarantees that all data files in iceberg are immutable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec says: "Once written, data and metadata files are immutable until they are deleted".

@amoghmargoor amoghmargoor force-pushed the oss_alluxio_cache_2 branch 2 times, most recently from b1d9d72 to 5783ce1 Compare February 13, 2024 14:21
@raunaqmorarka raunaqmorarka merged commit 102ce9d into trinodb:master Feb 14, 2024
64 checks passed
@github-actions github-actions bot added this to the 439 milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector iceberg Iceberg connector performance
Development

Successfully merging this pull request may close these issues.

Enabling Alluxio for Iceberg connector
6 participants