-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark amount of intermediate data stored #10
base: main
Are you sure you want to change the base?
Conversation
executor=spec.executor, | ||
spec=spec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems redundant
@@ -34,7 +58,7 @@ def run( | |||
callbacks.append(history) | |||
kwargs['callbacks'] = callbacks | |||
|
|||
with benchmarks(history): | |||
with benchmarks(history, spec.work_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec.work_dir
isn't the right thing to pass - it's not specific enough. I want the directory containing only the intermediate results written during the last cubed run.
I can see that each such directly is uniquely labeled which is nice, but I'm not sure exactly how to get this label from the computation. Should I import cubed.core.plan.CONTEXT_ID
or is there a simpler way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The event
object passed to on_compute_start
(and on_compute_end
) in the callback contains the context_id
(see https://github.com/cubed-dev/cubed/blob/fda2f1e7bedec5389cae00ee751a54076ccf28e3/cubed/runtime/types.py#L28-L49). The HistoryCallback
doesn't capture this - perhaps it could be changed, or maybe it's simpler to write a custom callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cubed.core.plan
there is a CONTEXT_ID
and a compute_id
. The former seems to be used to make temporary directories, whilst the latter is what's stored in the event
object. What's the difference between them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONTEXT_ID
is the directory name created for the duration of the client Python process. The compute_id
is a recent addition used for creating directories to store history data in that is unique to the call to compute
.
So you're right - you want CONTEXT_ID
, but the intermediate data for all computations run in the pytest session will be there, so it's hard to pick out the arrays for a particular computation.
We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}
, but that's a bit more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
We should probably make Cubed store its intermediate data in a directory named
{CONTEXT_ID}/{compute_id}
, but that's a bit more work.
I would like to do that - I can make a PR to cubed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Continuing discussion from cubed-dev/cubed#413)
I think the easier way to solve the original problem in #10 would be to just get the intermediate array paths from the DAG
So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in
history.plan
?
I tried looking at the plan
object but if this information is stored in there I can't seem to find it. I thought looking at target
would give me what I need but for the smallest quad means example I just get None
for each node, when clearly something was written.
I'm also wondering if there is any subtlety here with the optimized vs unoptimized plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan only has the op names now. Optimized vs unoptimized shouldn't make a difference. Try something like (untested)
array_names = [name for name in dag if name.startswith("array-")]
Where dag
comes from the callback object.
@@ -34,7 +58,7 @@ def run( | |||
callbacks.append(history) | |||
kwargs['callbacks'] = callbacks | |||
|
|||
with benchmarks(history): | |||
with benchmarks(history, spec.work_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see a way to avoid passing extra information to the benchmarks context manager - the history does not contain this information.
# List all files and subdirectories in the directory | ||
contents = fs.glob(f'{work_dir}/**', detail=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently failing to discover any of the contents
total_size = 0 | ||
|
||
if work_dir.startswith('s3://'): | ||
fs = fsspec.filesystem('s3') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler way to do this (that I only discovered yesterday!) is
fs, _, _ = fsspec.get_fs_token_paths(work_dir)
WIP Attempt to record the amount of intermediate data stored, just by looking at the work directory.