-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark amount of intermediate data stored #10
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,9 +21,33 @@ def spec_from_config_file(filepath: str) -> cubed.Spec: | |
return spec_from_config(config) | ||
|
||
|
||
def get_directory_size(work_dir: str) -> float: | ||
"""Get the size of any data written to the given directory (local or remote).""" | ||
|
||
import fsspec | ||
|
||
total_size = 0 | ||
|
||
if work_dir.startswith('s3://'): | ||
fs = fsspec.filesystem('s3') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A simpler way to do this (that I only discovered yesterday!) is fs, _, _ = fsspec.get_fs_token_paths(work_dir) |
||
else: | ||
fs = fsspec.filesystem('file') | ||
|
||
# List all files and subdirectories in the directory | ||
contents = fs.glob(f'{work_dir}/**', detail=True) | ||
Comment on lines
+36
to
+37
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is currently failing to discover any of the contents |
||
|
||
for item in contents: | ||
if item['type'] == 'file': | ||
# If it's a file, add its size to the total | ||
total_size += item['size'] | ||
|
||
return total_size | ||
|
||
|
||
def run( | ||
result, | ||
executor, | ||
spec, | ||
benchmarks, | ||
**kwargs | ||
): | ||
|
@@ -34,7 +58,7 @@ def run( | |
callbacks.append(history) | ||
kwargs['callbacks'] = callbacks | ||
|
||
with benchmarks(history): | ||
with benchmarks(history, spec.work_dir): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The I can see that each such directly is uniquely labeled which is nice, but I'm not sure exactly how to get this label from the computation. Should I import There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So you're right - you want We should probably make Cubed store its intermediate data in a directory named There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation!
I would like to do that - I can make a PR to cubed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Continuing discussion from cubed-dev/cubed#413)
I tried looking at the I'm also wondering if there is any subtlety here with the optimized vs unoptimized plan? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan only has the op names now. Optimized vs unoptimized shouldn't make a difference. Try something like (untested) array_names = [name for name in dag if name.startswith("array-")] Where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't see a way to avoid passing extra information to the benchmarks context manager - the history does not contain this information. |
||
|
||
computed_result = result.compute( | ||
executor=executor, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems redundant