Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc20: need partial release guidance #230

Open
garlick opened this issue Feb 29, 2020 · 22 comments
Open

rfc20: need partial release guidance #230

garlick opened this issue Feb 29, 2020 · 22 comments

Comments

@garlick
Copy link
Member

garlick commented Feb 29, 2020

RFC 20 describes the form of R version 1.

One thing that is missing is a discussion of how the exec system generates R fragments to support partial release of resources.

The exec system will operate at the granularity of shell instances, which currently map 1:1 with broker ranks or "execution targets". Since this is the same unit as the R_lite "rank", splitting this portion of R into fragments should not be challenging.

What if anything should be done with the (optional) scheduler dict?

If there is no way for the exec system to release portions of JGF, then is there a compelling reason to expose it directly in R? If it's only for scheduler bookkeeping, could the scheduler instead include some abbreviated id instead?

Edit: regardless of the details, the main issue is that this important use case of R isn't covered in the RFC.

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

Thanks. Tagging @milroy, as he probably want to pay special attention to this per his research.

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

If there is no way for the exec system to release portions of JGF, then is there a compelling reason to expose it directly in R? If it's only for scheduler bookkeeping, could the scheduler instead include some abbreviated id instead?

I don't follow. Are you suggesting the scheduler add "rank" (or execution target) into each resource vertex?

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

I don't follow. Are you suggesting the scheduler add "rank" (or execution target) into each resource vertex?

Looking at the code, JGF already add "rank" to each vertex.

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

The exec system will operate at the granularity of shell instances, which currently map 1:1 with broker ranks or "execution targets". Since this is the same unit as the R_lite "rank", splitting this portion of R into fragments should not be challenging.

I think an important design point would be to determine how exactly we deal with the R object on a partial release.

For free RPC, it seems the pair of jobid and releasing execution target list should be sufficient for flux-sched to do its resource deallocation. (Need some testing to see how difficult or easy to do this though)

But we probably don't want to keep the original R on such partial release. If this partially freed job continues to run across a scheduler reload event, this will lead the newly loaded scheduler to reconstruct the full allocation, not partial allocation.

So it seems we have two choice:

  1. Upon partial release, manipulate the original R to carve out the released portion from either or both R_lite and JGF keys.

Or

  1. Augment R (or introduce another metadata) to encode the releasing execution target .

Now I suspect, flux-core probably don't want to write a graph code to manipulate JGF. Perhaps, we can add the "released" key or similar somewhere and make the list of releasing execution targets as its value?

On a scheduler reload, such an augmented R (or R + the released execution target metadata) can be passed to the hello callback to assist the scheduler with correct state reconstruction?

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

BTW, having the JGF in R is absolutely necessary for flux-sched to be able to reconstruct the scheduler state on reload and to construct the nested scheduler state when we use a reader beyond hwloc. For high end systems, I believe we will need more than hwloc.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

My main thought was that perhaps the integer exec targets (ranks) could be stand-ins for JGF subgraphs in R if the scheduler maintained a consistent internal mapping, including across restarts.
The advantage would be allowing exec to work in a scheduler-neutral manner, and keeping the size of R's lean and operations on them simple.

As far as what appears in the KVS, I don't think we can modify the original R since that should remain intact for provenance (what node did I run on?) We had discussed dropping R fragments into the KVS (maybe in a "shrink" subdirectory) as chunks are freed... Then the complementary "grow" directory could contain chunks that are added.

I dunno about passing resources down to a subinstance, but it seems like the common case for every job may not be to bootstrap a graph scheduler instance, and we care about job throughput, so should we think about other options?

If I've wandered into the weeds, apologies!

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

My main thought was that perhaps the integer exec targets (ranks) could be stand-ins for JGF subgraphs in R if the scheduler maintained a consistent internal mapping, including across restarts.

I don't think ranks will work because node-local resources can be allocated. Graph vertex and edges IDs may become such stand-ins at the expense of added complexity. But this probably won't serve your need?

We had discussed dropping R fragments into the KVS (maybe in a "shrink" subdirectory) as chunks are freed... Then the complementary "grow" directory could contain chunks that are added.

R fragments can only contain R_lite. Or this can even be a simpler form of "ranks" given the current granularity. Maybe we can revise R_lite key to make rank list form also a valid R. Then, original R - Freed Rs + Extended R can represent your current state and can be used in scheduler state reconstruction as well as elastic scheduling I supposed.

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

I dunno about passing resources down to a subinstance, but it seems like the common case for every job may not be to bootstrap a graph scheduler instance, and we care about job throughput, so should we think about other options?

Yeah, this is an important case. My thought on this has been to leverage our concept of "scheduler specialization" again

Here I think it makes sense we use the full JGF-included R writer only at the system and internal level instances. At the leaf level where large number of jobs need to scheduled and run, users can specialize the emit behavior of scheduler to R_lite only.

My conjecture is using JFG at the internal levels actually will increase the throughtput compared to relying on hwloc reader.

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

The advantage would be allowing exec to work in a scheduler-neutral manner

If we go with the proposed Original R - Freed Rs + Extended R approach, would having the optional scheduling key disallow exec to work consistently across different schedulers though?

If I've wandered into the weeds, apologies!

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

If I've wandered into the weeds, apologies!

These are really essential points to discuss at this point. Please keep your comments coming.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

After reading my comments and your responses again, I just realized an error in my thinking: I was suggesting an execution target could be a standin for a JGF subtree in R, but that only works if whole nodes are allocated. Without more information in R, the scheduler receiving it during FREE or HELLO wouldn't know what subset of the execution target's resources were allocated to that job. Sorry about that.

R fragments can only contain R_lite. Or this can even be a simpler form of "ranks" given the current granularity. Maybe we can revise R_lite key to make rank list form also a valid R.

It is good if JGF does not need to be repeated in R fragments during "shrink", and if exec / job manager can just ignore it. That was one question I wanted to get clarified in the RFC.

In fact if that is all that is needed during partial release, the release events logged to the job eventlog in the KVS already contain this information and might be sufficient?

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

After reading my comments and your responses again, I just realized an error in my thinking: I was suggesting an execution target could be a standin for a JGF subtree in R, but that only works if whole nodes are allocated. Without more information in R, the scheduler receiving it during FREE or HELLO wouldn't know what subset of the execution target's resources were allocated to that job. Sorry about that.

Exactly!

It is good if JGF does not need to be repeated in R fragments during "shrink", and if exec / job manager can just ignore it. That was one question I wanted to get clarified in the RFC.

Yes.

In fact if that is all that is needed during partial release, the release events logged to the job eventlog in the KVS already contain this information and might be sufficient?

I think we still need this info passed to FREE and also hello callback for scheduler reload event. If the eventlog is the ultimate source of this info which serves this two calls, this should be sufficient I think. Like I said before I need to do some work to see how easy or difficult to do this this at flux-sched. But I am pretty positive that this can be done.

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

Then the complementary "grow" directory could contain chunks that are added.

While we are here, maybe we can also has this out a bit as this is what @milroy will soon need.

I don't think adding an additional R is difficult. But what is currently difficult would be how to do this under the original JOBID. In particular, flux job submit will always generate a new JOBID. Do you think there is an easy path to to generate a new R under the same JOBID using flux job submit|flux mini interface?

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

Could we keep this issue focused on what needs to be updated in RFC 20 to implement partial release, and open another issue for the grow case?

@dongahn
Copy link
Member

dongahn commented Mar 1, 2020

Yes. Indeed that's what I was going to suggest anyway.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

I think we still need this info passed to FREE and also hello callback for scheduler reload event. If the eventlog is the ultimate source of this info which serves this two calls, this should be sufficient I think. Like I said before I need to do some work to see how easy or difficult to do this this at flux-sched. But I am pretty positive that this can be done.

How about if we just send a free request to the scheduler as we do now, let libschedutil lookup the full R before calling eachfree() callback as now (or cache it as an optimization), and add an idset to the request which describes which exec target ranks are being released in that free message? The scheduler would then need to take the intersection of the idset and the original R to decide what to free internally.

For the hello handshake, we could also add an idset to the hello() callback that indicates which exec target ranks are still allocated if a subset? The scheduler would take the intersection of the idset and R to decide what to allocate internally.

Is it reasonable to change RV1 to describe partial release in terms of exec targets, not in terms of R subsets; and indicate that the scheduler dict is optional, opaque scheduler-specific data not to be tampered with or depended on by other system components? (Thus avoiding the implication that it has to be subdivided for partial release)

This is probably not a long term solution since it only covers a coarse granularity "shrink", but it may get us well past our TOSS milestone. I think the point of version 1 was to get the bare minimum defined, and it is a hard production requirement to be able to tolerate a hung node without tying up unrelated resources, so I think it fits.

@grondo
Copy link
Contributor

grondo commented Mar 1, 2020

Is it reasonable to change RV1 to describe partial release in terms of exec targets, not in terms of R subsets; and indicate that the scheduler dict is optional, opaque scheduler-specific data not to be tampered with or depended on by other system components?

Sorry to jump in late, but @garlick I feel your scheme posted above makes the most sense for an initial solution. It allows each consumer of R to make use of any ancillary data stored in the format, for as long as we restrict partial release to one or more execution target ids, anyway.

Note, one of the early use cases for partial release (release of resources from execution targets as soon as epilog completes, rather than waiting for the entire job to finish) will be for flux jobs or other tools to display the currently allocated resources for jobs. We should strive for a partial release/shrink format that makes this operation straightforward. Also, this might have implications for the accounting service, e.g. do sites want to charge for resources while the job epilog is running, and if so the accounting module may need to integrate over the shrink operation of partial release.

Here's a crazy idea: what if R was an eventlog instead of a static object? It could eventually have grow and shrink events and the exec system could watch this eventlog.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

for flux jobs or other tools to display the currently allocated resources for jobs. We should strive for a partial release/shrink format that makes this operation straightforward.

Good point.

Here's a crazy idea: what if R was an eventlog instead of a static object? It could eventually have grow and shrink events and the exec system could watch this eventlog.

I always kind of thought of the alloc and free events in the main job eventlog as having an implicit, indirectly referenced R operand, and that grow would be represented by more allocs with explicit operands, and shrink by more frees with explicit operands. So in a way we already something like that?

@grondo
Copy link
Contributor

grondo commented Mar 1, 2020

an implicit, indirectly referenced R operand

I guess my point was then an individual R is not meaningful. Tools that want to determine the allocated resources to any job at any given time have to parse the job eventlog and load multiple Rs from the kvs to put together something sane.

However, I agree it is mostly the same conceptually.

Edit: Also, I guess it is the mechanism of taking the "implicltly referenced" and making it explicit for all R users that we're trying to figure out in this issue? I guess I was thinking it would be nice if the R format itself could be directly amended by an append.

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

Makes sense. Or maybe R and its deltas just gets pulled into the eventlog alloc/free contexts?

Free range coffee discussion indicated!

@grondo
Copy link
Contributor

grondo commented Mar 1, 2020

Perhaps we could put a pin in this to be taken up when we're ready to do the actual work involved? Or is this issue on critical path for upcoming (within the week) milestone?

@garlick
Copy link
Member Author

garlick commented Mar 1, 2020

Yes, pinned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants