-
Notifications
You must be signed in to change notification settings - Fork 32
JobManagement
sirepo.job_api is the entry point into the job system for the GUI. This consists of HTTP requests to the supervisor, which runs as a separate Tornado process.
The supervisor sends
Ops (search: class _Op
)
via drivers,
which will start
a job_agent process.
Ops are sent asynchronously over a single Websocket connection to each job_agent. An Op has at least one response, and sometimes multiple responses (repeated status updates). This means that Ops are a "state" to be managed.
Some API calls result in multiple Ops, but most map one-to-one to a single Op.
Due to legacy of how the GUI works, APIs are sent that have modal responses, specifically runSimulation, which results either in a status response or a result (for sequential executions only).
Ops can be queued or pending (already sent to job_agent). API replies are always synchronous.
Issue #2164 caused us to step back and evaluate the state transitions. The following maps state transitions due to API events when a job has a particular Op pending or queued.
OP_ANALYSIS is a single entry point, which results in a subprocess
call to
job_cmd
with an entry point for each jobCmd: cancel, compute,
get_simulation_frame, prepare_simulation, sbatch_status,
sequential_result. These will be referred to by name with the prefix job_cmd.
to
imply OP_ANALYSIS. More confusing is that OP_RUN results in
job_cmd.compute and sometimes also a job_cmd.sbatch_status, but
for state transitions in the Supervisor, a pending OP_RUN is sufficient.
The tables below have links to actions which are define further down.
Op | Action |
---|---|
None | send_or_not_found |
OP_CANCEL | queue |
OP_RUN | send_or_not_found |
job_cmd.get_data_file | queue |
job_cmd.get_simulation_frame | queue |
job_cmd.sequential_result | queue |
Special API: HTTP request to return state of supervisor synchronously, independent of a particular job.
Op | Action |
---|---|
None | reply_canceled |
OP_CANCEL | reply_canceled |
OP_RUN | cancel |
job_cmd.get_data_file | reply_normally |
job_cmd.get_simulation_frame | reply_normally |
job_cmd.sequential_result | discard_and_cancel |
Op | Action |
---|---|
None | run_or_result |
OP_CANCEL | queue |
OP_RUN | status_or_collision |
job_cmd.get_data_file | queue_or_reply |
job_cmd.get_simulation_frame | queue_or_reply |
job_cmd.sequential_result | queue_or_reply |
Op | Action |
---|---|
None | status_or_result |
OP_CANCEL | status |
OP_RUN | status |
job_cmd.get_data_file | status |
job_cmd.get_simulation_frame | status |
job_cmd.sequential_result | status |
All other APIs should be queued until the sbatchLogin completes.
If sbatchLogin comes in, and there are ops pending or queued, error. The agent is alive.
Op | Action |
---|---|
None | send_or_not_found |
OP_CANCEL | queue |
OP_RUN | send_or_not_found |
job_cmd.get_data_file | queue |
job_cmd.get_simulation_frame | queue |
job_cmd.sequential_result | queue |
If the computeJobHash and computeJobSerial are valid or already exited, set the state to canceled. Send a OP_CANCEL to the agent. Wait for the reply.
If an OP_RUN is in the queue or pending, discard_and_cancel.
reply_canceled, discard normal reply when it comes in.
If the op is queued, then discard the op, and reply_canceled.
Put the translated op in the driver's queue (ops_pending_send). e.g. runCancel enqueues OP_CANCEL.
If the request has a valid hash and computeJobSerial or force, queue the Op.
Otherwise, reply collision.
Reply canceled (even if the hash/serial is invalid).
If parallel, wait for the Op to complete and return the reply as received (no need to check hash).
A cancel will not discard a queued API bound to this action.
If sequential, discard_and_cancel.
If the computeJobSerial and computeJobHash are valid, reply immediately with the status of the ComputeJob.
Otherwise, reply missing
.
If running or pending and force or the computeJobHash/Serial does not match, reply collision. The job is already running from another GUI.
If force or mismatch or not in completed state, run the simulation.
Otherwise, status_or_result.
If force or computeJobHash/Serial does not match, send an OP_RUN and reply with status.
Otherwise, status_or_result.
Send the translated op and wait for the reply.
If is parallel, the computeJobHash/Serial matches, and the status is running, pending, canceled, or completed:
If is sequential, and the status is completed, send unless there is an OP_ANALYSIS ahead of this API, then queue.
Otherwise, reply not found, because the job is still running or does not match.
If the computeJobSerial and computeJobHash are valid,
reply status for parallel jobs, else send job_cmd.sequential_result
and
reply with value received.
Otherwise, reply missing
.
Sequential jobs begin running immediately when a user visits the simulation page. The computeModel
for a sequential job is a report. On the frontend, they are known as transient.
Parallel jobs run when a user presses a "Start new simulation" button. The computeModel
for a parallel job is an animation. On the frontend, they are known as persistent.
License: http://www.apache.org/licenses/LICENSE-2.0.html
Copyright ©️ 2015–2020 RadiaSoft LLC. All Rights Reserved.
- Activait
- Controls
- elegant
- FLASH
- Genesis
- JSPEC
- JupyterHub
- MAD-X
- OPAL
- Radia
- Shadow
- Synchrotron Radiation Workshop (SRW)
- Warp PBA
- Warp VND
- Zgoubi
- Authentication and Account Creation
- How Your Sirepo Workspace Works
- Navigating the Sirepo Simulations Interface
- How to upload a lattice file
- How to share a Sirepo simulation via URL
- How Example simulations work
- How to report a bug in Sirepo
- Using lattice files in Sirepo
- Resetting an Example Simulation to default
- Backup SRW Sirepo simulations
- SRW Aperture
- SRW Brilliance Report
- SRW Circular Cylinder Mirror
- SRW CRL
- SRW Crystal
- SRW Electron Beam
- SRW Elliptical Cylinder Mirror
- SRW Fiber
- SRW Flux
- SRW Fully Coherent Gaussian Beam
- SRW Import Python or JSON Simulation File
- SRW Initial Wavefront Simulation Grid
- SRW Intensity Report
- SRW Planar Mirror
- SRW Power Density Report
- SRW Propagation Parameters
- SRW Single Electron Spectrum Report
- SRW Spherical Mirror
- SRW Toroid Mirror
- SRW Watchpoint
- SRW Additional Documentation