Skip to content

add_docs_and_backport_max_files_rewrite_option #13082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 20, 2025

Conversation

coderfender
Copy link
Contributor

@coderfender coderfender commented May 16, 2025

Backport changes to flink 1.19 , 2.0 and spark 3.4, 4.0 and update spark docs to provide more info about this option
Original PR : #12824

@coderfender coderfender changed the title backport_max_files_rewrite_option add_docs_and_backport_max_files_rewrite_option May 16, 2025
@github-actions github-actions bot added the docs label May 16, 2025
@coderfender coderfender marked this pull request as ready for review May 16, 2025 23:31
@coderfender
Copy link
Contributor Author

@pvary , @RussellSpitzer . Please take a look whenever you get a chance

| `delete-ratio-threshold` | 0.3 | Minimum deletion ratio that needs to be associated with a data file for it to be considered for rewriting |
| `output-spec-id` | current partition spec id | Identifier of the output partition spec. Data will be reorganized during the rewrite to align with the output partitioning. |
| `remove-dangling-deletes` | false | Remove dangling position and equality deletes after rewriting. A delete file is considered dangling if it does not apply to any live data files. Enabling this will generate an additional commit for the removal. |
| Name | Default Value | Description |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the changes is just formatting. Could we revert the formatting only changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try to revert the documentation changes but for some reason git picks up changes which it isnt supposed to

@pvary
Copy link
Contributor

pvary commented May 19, 2025

@coderfender: When backporting a PR, it is usually helpful if the contributor helps the reviewer by mentioning any specific changes which are specifically needed above the original ones, or say that this is a clean backport.

Am I right, that this is a clean backport, and only the documentation changes are specific?

Thanks for the backport,
Peter

@coderfender coderfender force-pushed the add_rewrite_data_files_option_backport branch from 95feed1 to e2223b7 Compare May 19, 2025 17:45
Iceberg tables with partition statistics files are not currently supported for path rewrite.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary newline change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like my IDE is adding these

Copy link
Contributor Author

@coderfender coderfender May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary . Retried the same approach through online git file edit. from UI as well. It seems like adding the new option is automatically showing line 1059 as a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to take guidance to fix this

Copy link
Member

@RussellSpitzer RussellSpitzer May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra " " at the end of this line that you are removing.

"for path rewrite " vs "for path rewrite"

Nvm, seems like it vanished? You could always reset the file and apply the other changes to see if it persists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double never mind, there it is
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer I tried editing the line from Github repo , sublime etc but for some reason my commits wouldnt work :|

@coderfender coderfender force-pushed the add_rewrite_data_files_option_backport branch from d6b12cc to aa8936b Compare May 19, 2025 20:37
* for more details.
*
* @param maxFilesToRewrite maximum files to rewrite
*/
public Builder maxFilesToRewrite(int maxFilesToRewrite) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary . The only change unrelated to backporting and docs is the method signature JAVA docs I added for flink API

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the endline fix as well

@RussellSpitzer
Copy link
Member

I'll let @pvary comment if he has any final comments, otherwise I'll merge tomorrow

@coderfender
Copy link
Contributor Author

coderfender commented May 19, 2025

@RussellSpitzer , @pvary the endline fix is unfortunately not recognized by Git (but it would let me commit after removing the whitespace though) . Ref : b87bbfc , c1939bb

@pvary pvary merged commit 09a5317 into apache:main May 20, 2025
36 checks passed
@pvary
Copy link
Contributor

pvary commented May 20, 2025

Merged to main.
Thanks for the backport @coderfender and @RussellSpitzer for the review!
Also thanks for the fight with the whitespaces 😄

@coderfender, @RussellSpitzer: Do you think that the maxRewriteBytes configuration option would be useful for Spark as well?

/**
* Configures the maximum byte size of the rewrites for one scheduled compaction. This could be
* used to limit the resources used by the compaction.
*
* @param newMaxRewriteBytes to limit the size of the rewrites
*/
public Builder maxRewriteBytes(long newMaxRewriteBytes) {

If you find the feature would be useful in Spark as well, then we could move the filtering from Flink to the core. @coderfender : If you have time, I would be happy to review it, or we could find someone else who has more time to do this.

@coderfender
Copy link
Contributor Author

@pvary , I am not sure if I understand this right ? I initially wrote it as part of SparkRewriteActions but after your refactor is merged, I had to move it upstream to the core planner logic. Are you saying I should add this option similar to flink maintenance API and perhaps add some more tests ? Please let me know if I am missing something here .

Thank you for merging in to main @pvary and thank you for the review / approval @RussellSpitzer

@pvary
Copy link
Contributor

pvary commented May 20, 2025

There is a feature in Flink compaction, where the user can limit the number of bytes to rewrite in a compaction (maxRewriteBytes). The question is: Would this be interesting for Spark as well? Then we should move the feature to the core so both Flink and Spark could use it.

@coderfender
Copy link
Contributor Author

coderfender commented May 20, 2025

Ah now I get it . I was confused between maxRewriteBytes vs maxFilesToRewrite :) . I would love to get that feature ported to core from Flink so that we can use in both Spark and Flink engines . Let me create an issue and assign it to myself . Thank you for the suggestion @pvary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants