-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(ingest/s3): enhance readability #12686
base: master
Are you sure you want to change the base?
refactor(ingest/s3): enhance readability #12686
Conversation
0fc14e6
to
cade3f4
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Continue to review full report in Codecov by Sentry.
|
@@ -847,7 +847,7 @@ def get_folder_info( | |||
path_spec: PathSpec, | |||
bucket: "Bucket", | |||
prefix: str, | |||
) -> List[Folder]: | |||
) -> Iterable[Folder]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed for memory efficiency
if _is_allowed_path( | ||
path_spec, self.create_s3_path(obj.bucket_name, obj.key) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if modification_time is None: | ||
logger.warning( | ||
f"Unable to find any files in the folder {key}. Skipping..." | ||
) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this(L901~ 905) because groupby_unsorted
never returns an empty group.
for _, group in grouped_s3_objects_by_dirname: | ||
max_file = max(group, key=lambda x: x.last_modified) | ||
max_file_s3_path = self.create_s3_path(max_file.bucket_name, max_file.key) | ||
|
||
# If partition_id is None, it means the folder is not a partition | ||
partition_id = path_spec.get_partition_from_path(max_file_s3_path) | ||
|
||
yield Folder( | ||
partition_id=partition_id, | ||
is_partition=bool(partition_id), | ||
creation_time=min(obj.last_modified for obj in group), | ||
modification_time=max_file.last_modified, | ||
sample_file=max_file_s3_path, | ||
size=sum(obj.size for obj in group), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before | After |
---|---|
O(n) | O(3n) |
We got some performance loss but it is minimal. The time complexity has only increased from O(n) to O(3n).
The reason of O(3n) is this calls max(), min() and sum() in the each loop.
# If id is None, it means the folder is not a partition | ||
partitions.append( | ||
Folder( | ||
partition_id=id, | ||
is_partition=bool(id), | ||
creation_time=creation_time if creation_time else None, # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need null handling. The type of creation_time
is datetime
.
creation_time
is the value ofitem.last_modified
. (L896)- The type of
item
is ObjectSummary - The type of
ObjectSummary.last_modified
isdatetime
.
- Refactor S3Source().get_folder_info() to enhance readability - Add a test to ensure that get_folder_info() returns the expected result.
cade3f4
to
5952e30
Compare
Comment to change the label |
S3Source().get_folder_info()
to enhance readabilityget_folder_info()
returns the expected result.Checklist