-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc20: need partial release guidance #230
Comments
Thanks. Tagging @milroy, as he probably want to pay special attention to this per his research. |
I don't follow. Are you suggesting the scheduler add "rank" (or execution target) into each resource vertex? |
Looking at the code, JGF already add "rank" to each vertex. |
I think an important design point would be to determine how exactly we deal with the For free RPC, it seems the pair of jobid and releasing execution target list should be sufficient for flux-sched to do its resource deallocation. (Need some testing to see how difficult or easy to do this though) But we probably don't want to keep the original R on such partial release. If this partially freed job continues to run across a scheduler reload event, this will lead the newly loaded scheduler to reconstruct the full allocation, not partial allocation. So it seems we have two choice:
Or
Now I suspect, flux-core probably don't want to write a graph code to manipulate JGF. Perhaps, we can add the "released" key or similar somewhere and make the list of releasing execution targets as its value? On a scheduler reload, such an augmented |
BTW, having the JGF in |
My main thought was that perhaps the integer exec targets (ranks) could be stand-ins for JGF subgraphs in R if the scheduler maintained a consistent internal mapping, including across restarts. As far as what appears in the KVS, I don't think we can modify the original R since that should remain intact for provenance (what node did I run on?) We had discussed dropping R fragments into the KVS (maybe in a "shrink" subdirectory) as chunks are freed... Then the complementary "grow" directory could contain chunks that are added. I dunno about passing resources down to a subinstance, but it seems like the common case for every job may not be to bootstrap a graph scheduler instance, and we care about job throughput, so should we think about other options? If I've wandered into the weeds, apologies! |
I don't think
R fragments can only contain |
Yeah, this is an important case. My thought on this has been to leverage our concept of "scheduler specialization" again Here I think it makes sense we use the full JGF-included My conjecture is using JFG at the internal levels actually will increase the throughtput compared to relying on hwloc reader. |
If we go with the proposed
|
These are really essential points to discuss at this point. Please keep your comments coming. |
After reading my comments and your responses again, I just realized an error in my thinking: I was suggesting an execution target could be a standin for a JGF subtree in R, but that only works if whole nodes are allocated. Without more information in R, the scheduler receiving it during FREE or HELLO wouldn't know what subset of the execution target's resources were allocated to that job. Sorry about that.
It is good if JGF does not need to be repeated in R fragments during "shrink", and if exec / job manager can just ignore it. That was one question I wanted to get clarified in the RFC. In fact if that is all that is needed during partial release, the |
Exactly!
Yes.
I think we still need this info passed to FREE and also hello callback for scheduler reload event. If the eventlog is the ultimate source of this info which serves this two calls, this should be sufficient I think. Like I said before I need to do some work to see how easy or difficult to do this this at flux-sched. But I am pretty positive that this can be done. |
While we are here, maybe we can also has this out a bit as this is what @milroy will soon need. I don't think adding an additional |
Could we keep this issue focused on what needs to be updated in RFC 20 to implement partial release, and open another issue for the grow case? |
Yes. Indeed that's what I was going to suggest anyway. |
How about if we just send a free request to the scheduler as we do now, let libschedutil lookup the full R before calling each For the hello handshake, we could also add an idset to the Is it reasonable to change RV1 to describe partial release in terms of exec targets, not in terms of R subsets; and indicate that the This is probably not a long term solution since it only covers a coarse granularity "shrink", but it may get us well past our TOSS milestone. I think the point of version 1 was to get the bare minimum defined, and it is a hard production requirement to be able to tolerate a hung node without tying up unrelated resources, so I think it fits. |
Sorry to jump in late, but @garlick I feel your scheme posted above makes the most sense for an initial solution. It allows each consumer of R to make use of any ancillary data stored in the format, for as long as we restrict partial release to one or more execution target ids, anyway. Note, one of the early use cases for partial release (release of resources from execution targets as soon as epilog completes, rather than waiting for the entire job to finish) will be for Here's a crazy idea: what if R was an eventlog instead of a static object? It could eventually have |
Good point.
I always kind of thought of the |
I guess my point was then an individual R is not meaningful. Tools that want to determine the allocated resources to any job at any given time have to parse the job eventlog and load multiple Rs from the kvs to put together something sane. However, I agree it is mostly the same conceptually. Edit: Also, I guess it is the mechanism of taking the "implicltly referenced" and making it explicit for all R users that we're trying to figure out in this issue? I guess I was thinking it would be nice if the R format itself could be directly amended by an append. |
Makes sense. Or maybe R and its deltas just gets pulled into the eventlog alloc/free contexts? Free range coffee discussion indicated! |
Perhaps we could put a pin in this to be taken up when we're ready to do the actual work involved? Or is this issue on critical path for upcoming (within the week) milestone? |
Yes, pinned! |
RFC 20 describes the form of R version 1.
One thing that is missing is a discussion of how the exec system generates R fragments to support partial release of resources.
The exec system will operate at the granularity of shell instances, which currently map 1:1 with broker ranks or "execution targets". Since this is the same unit as the
R_lite
"rank", splitting this portion of R into fragments should not be challenging.What if anything should be done with the (optional)
scheduler
dict?If there is no way for the exec system to release portions of JGF, then is there a compelling reason to expose it directly in R? If it's only for scheduler bookkeeping, could the scheduler instead include some abbreviated id instead?
Edit: regardless of the details, the main issue is that this important use case of R isn't covered in the RFC.
The text was updated successfully, but these errors were encountered: