-
Notifications
You must be signed in to change notification settings - Fork 2.6k
add_docs_and_backport_max_files_rewrite_option #13082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add_docs_and_backport_max_files_rewrite_option #13082
Conversation
@pvary , @RussellSpitzer . Please take a look whenever you get a chance |
docs/docs/spark-procedures.md
Outdated
| `delete-ratio-threshold` | 0.3 | Minimum deletion ratio that needs to be associated with a data file for it to be considered for rewriting | | ||
| `output-spec-id` | current partition spec id | Identifier of the output partition spec. Data will be reorganized during the rewrite to align with the output partitioning. | | ||
| `remove-dangling-deletes` | false | Remove dangling position and equality deletes after rewriting. A delete file is considered dangling if it does not apply to any live data files. Enabling this will generate an additional commit for the removal. | | ||
| Name | Default Value | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the changes is just formatting. Could we revert the formatting only changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try to revert the documentation changes but for some reason git picks up changes which it isnt supposed to
@coderfender: When backporting a PR, it is usually helpful if the contributor helps the reviewer by mentioning any specific changes which are specifically needed above the original ones, or say that this is a clean backport. Am I right, that this is a clean backport, and only the documentation changes are specific? Thanks for the backport, |
95feed1
to
e2223b7
Compare
Iceberg tables with partition statistics files are not currently supported for path rewrite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary newline change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like my IDE is adding these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary . Retried the same approach through online git file edit. from UI as well. It seems like adding the new option is automatically showing line 1059 as a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to take guidance to fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra " " at the end of this line that you are removing.
"for path rewrite " vs "for path rewrite"
Nvm, seems like it vanished? You could always reset the file and apply the other changes to see if it persists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer I tried editing the line from Github repo , sublime etc but for some reason my commits wouldnt work :|
d6b12cc
to
aa8936b
Compare
* for more details. | ||
* | ||
* @param maxFilesToRewrite maximum files to rewrite | ||
*/ | ||
public Builder maxFilesToRewrite(int maxFilesToRewrite) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary . The only change unrelated to backporting and docs is the method signature JAVA docs I added for flink API
Co-authored-by: Russell Spitzer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the endline fix as well
I'll let @pvary comment if he has any final comments, otherwise I'll merge tomorrow |
fix whitespace
@RussellSpitzer , @pvary the endline fix is unfortunately not recognized by Git (but it would let me commit after removing the whitespace though) . Ref : b87bbfc , c1939bb |
Merged to main. @coderfender, @RussellSpitzer: Do you think that the Lines 79 to 85 in 09a5317
If you find the feature would be useful in Spark as well, then we could move the filtering from Flink to the core. @coderfender : If you have time, I would be happy to review it, or we could find someone else who has more time to do this. |
@pvary , I am not sure if I understand this right ? I initially wrote it as part of SparkRewriteActions but after your refactor is merged, I had to move it upstream to the core planner logic. Are you saying I should add this option similar to flink maintenance API and perhaps add some more tests ? Please let me know if I am missing something here . Thank you for merging in to main @pvary and thank you for the review / approval @RussellSpitzer |
There is a feature in Flink compaction, where the user can limit the number of bytes to rewrite in a compaction ( |
Ah now I get it . I was confused between maxRewriteBytes vs |
Backport changes to flink 1.19 , 2.0 and spark 3.4, 4.0 and update spark docs to provide more info about this option
Original PR : #12824