Skip to content

add_docs_and_backport_max_files_rewrite_option #13082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 20, 2025
3 changes: 2 additions & 1 deletion docs/docs/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -533,6 +533,7 @@ Dangling deletes are always filtered out during rewriting.
| `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria |
| `rewrite-all` | false | Force rewriting of all provided files overriding other options |
| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data that should be rewritten in a single file group. The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into file-groups. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. |
| `max-files-to-rewrite` | null | This option sets an upper limit on the number of eligible files that will be rewritten. If this option is not specified, all eligible files will be rewritten. |

#### Output

Expand Down Expand Up @@ -1055,4 +1056,4 @@ metadata files and data files to the target location.
Lastly, the [register_table](#register_table) procedure can be used to register the copied table in the target location with a catalog.

!!! warning
Iceberg tables with partition statistics files are not currently supported for path rewrite.
Iceberg tables with partition statistics files are not currently supported for path rewrite.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary newline change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like my IDE is adding these

Copy link
Contributor Author

@coderfender coderfender May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary . Retried the same approach through online git file edit. from UI as well. It seems like adding the new option is automatically showing line 1059 as a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to take guidance to fix this

Copy link
Member

@RussellSpitzer RussellSpitzer May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra " " at the end of this line that you are removing.

"for path rewrite " vs "for path rewrite"

Nvm, seems like it vanished? You could always reset the file and apply the other changes to see if it persists

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double never mind, there it is
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer I tried editing the line from Github repo , sublime etc but for some reason my commits wouldnt work :|

Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,18 @@ public Builder maxFileGroupSizeBytes(long maxFileGroupSizeBytes) {
return this;
}

/**
* Configures max files to rewrite. See {@link BinPackRewriteFilePlanner#MAX_FILES_TO_REWRITE}
* for more details.
*
* @param maxFilesToRewrite maximum files to rewrite
*/
public Builder maxFilesToRewrite(int maxFilesToRewrite) {
this.rewriteOptions.put(
BinPackRewriteFilePlanner.MAX_FILES_TO_REWRITE, String.valueOf(maxFilesToRewrite));
return this;
}

/**
* The input is a {@link DataStream} with {@link Trigger} events and every event should be
* immediately followed by a {@link Watermark} with the same timestamp as the event.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,12 @@ public Builder maxFileGroupSizeBytes(long maxFileGroupSizeBytes) {
return this;
}

/**
* Configures max files to rewrite. See {@link BinPackRewriteFilePlanner#MAX_FILES_TO_REWRITE}
* for more details.
*
* @param maxFilesToRewrite maximum files to rewrite
*/
public Builder maxFilesToRewrite(int maxFilesToRewrite) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary . The only change unrelated to backporting and docs is the method signature JAVA docs I added for flink API

this.rewriteOptions.put(
BinPackRewriteFilePlanner.MAX_FILES_TO_REWRITE, String.valueOf(maxFilesToRewrite));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,18 @@ public Builder maxFileGroupSizeBytes(long maxFileGroupSizeBytes) {
return this;
}

/**
* Configures max files to rewrite. See {@link BinPackRewriteFilePlanner#MAX_FILES_TO_REWRITE}
* for more details.
*
* @param maxFilesToRewrite maximum files to rewrite
*/
public Builder maxFilesToRewrite(int maxFilesToRewrite) {
this.rewriteOptions.put(
BinPackRewriteFilePlanner.MAX_FILES_TO_REWRITE, String.valueOf(maxFilesToRewrite));
return this;
}

/**
* The input is a {@link DataStream} with {@link Trigger} events and every event should be
* immediately followed by a {@link Watermark} with the same timestamp as the event.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,8 @@ public class RewriteDataFilesSparkAction
USE_STARTING_SEQUENCE_NUMBER,
REWRITE_JOB_ORDER,
OUTPUT_SPEC_ID,
REMOVE_DANGLING_DELETES);
REMOVE_DANGLING_DELETES,
BinPackRewriteFilePlanner.MAX_FILES_TO_REWRITE);

private static final RewriteDataFilesSparkAction.Result EMPTY_RESULT =
ImmutableRewriteDataFiles.Result.builder().rewriteResults(ImmutableList.of()).build();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,8 @@ public class RewriteDataFilesSparkAction
USE_STARTING_SEQUENCE_NUMBER,
REWRITE_JOB_ORDER,
OUTPUT_SPEC_ID,
REMOVE_DANGLING_DELETES);
REMOVE_DANGLING_DELETES,
BinPackRewriteFilePlanner.MAX_FILES_TO_REWRITE);

private static final RewriteDataFilesSparkAction.Result EMPTY_RESULT =
ImmutableRewriteDataFiles.Result.builder().rewriteResults(ImmutableList.of()).build();
Expand Down