Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark amount of intermediate data stored #10

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Mar 4, 2024

WIP Attempt to record the amount of intermediate data stored, just by looking at the work directory.

Comment on lines 72 to +73
executor=spec.executor,
spec=spec,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant

@@ -34,7 +58,7 @@ def run(
callbacks.append(history)
kwargs['callbacks'] = callbacks

with benchmarks(history):
with benchmarks(history, spec.work_dir):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec.work_dir isn't the right thing to pass - it's not specific enough. I want the directory containing only the intermediate results written during the last cubed run.

I can see that each such directly is uniquely labeled which is nice, but I'm not sure exactly how to get this label from the computation. Should I import cubed.core.plan.CONTEXT_ID or is there a simpler way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event object passed to on_compute_start (and on_compute_end) in the callback contains the context_id (see https://github.com/cubed-dev/cubed/blob/fda2f1e7bedec5389cae00ee751a54076ccf28e3/cubed/runtime/types.py#L28-L49). The HistoryCallback doesn't capture this - perhaps it could be changed, or maybe it's simpler to write a custom callback?

Copy link
Member Author

@TomNicholas TomNicholas Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cubed.core.plan there is a CONTEXT_ID and a compute_id. The former seems to be used to make temporary directories, whilst the latter is what's stored in the event object. What's the difference between them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTEXT_ID is the directory name created for the duration of the client Python process. The compute_id is a recent addition used for creating directories to store history data in that is unique to the call to compute.

So you're right - you want CONTEXT_ID, but the intermediate data for all computations run in the pytest session will be there, so it's hard to pick out the arrays for a particular computation.

We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}, but that's a bit more work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}, but that's a bit more work.

I would like to do that - I can make a PR to cubed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Continuing discussion from cubed-dev/cubed#413)

I think the easier way to solve the original problem in #10 would be to just get the intermediate array paths from the DAG

So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in history.plan?

I tried looking at the plan object but if this information is stored in there I can't seem to find it. I thought looking at target would give me what I need but for the smallest quad means example I just get None for each node, when clearly something was written.

I'm also wondering if there is any subtlety here with the optimized vs unoptimized plan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan only has the op names now. Optimized vs unoptimized shouldn't make a difference. Try something like (untested)

array_names = [name for name in dag if name.startswith("array-")]

Where dag comes from the callback object.

@@ -34,7 +58,7 @@ def run(
callbacks.append(history)
kwargs['callbacks'] = callbacks

with benchmarks(history):
with benchmarks(history, spec.work_dir):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a way to avoid passing extra information to the benchmarks context manager - the history does not contain this information.

Comment on lines +36 to +37
# List all files and subdirectories in the directory
contents = fs.glob(f'{work_dir}/**', detail=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently failing to discover any of the contents

total_size = 0

if work_dir.startswith('s3://'):
fs = fsspec.filesystem('s3')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simpler way to do this (that I only discovered yesterday!) is

fs, _, _ = fsspec.get_fs_token_paths(work_dir)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants