-
Notifications
You must be signed in to change notification settings - Fork 32
Debugging the job system
Debugging the job system involves looking at logs for the Flask server, job supervisor, and agent(s). Agents in the local, Docker, and Sbatch execution modes all write their logs messages to different locations, but interpretation of the logs is generally the same.
Below, you'll find:
- Instructions for debugging the job system independent of the type of agents running, and
- Specific instructions for the different types of agents.
Regardless of the type of agent in use, the Flask server and the job supervisor are debugged in the same manner. The keys you'll use to track the flow of messages across the system are the same. These id's identify a compute job (computeJid), a user (uid), a location where a simulation is run (runDir), an operation (opIds), and an agent (agentId).
Across the system we try to use the pkdebug_str() (ex for a ServerReq) method to format the string representation of objects. This method is similar to __str__ except it passes through our truncation and redaction logging system.
Every user has a globally unique 8 character id (ex vn9m3eZW
).
Each simulation corresponds to a globally unique computeJid . The supervisor
keeps a
record
of every computeJid and uses them as the first level of separation for
requests. Compute jid's are composed
of
a user id, simulation id, and compute model each separated by a dash (ex
vn9m3eZW-i9btqxdR-fluxReport
).
AgentId's uniquely identify an instance of an agent. They are assigned to each agent for the life of the job supervisor. Even if an agent is killed (or crashes) and is restarted it will still have the same agentId. Every incoming message will have an agentId letting the supervisor know which agent the message is from.
AgentId's are a secret. They uniquely identify an agent and are used to route incoming messages. If they were to be exposed anyone with the id could impersonate that agent.
Agents Ids are 32 character randomly generated
strings
(ex VrwfCGHEGFf9mHdIEwYKsTmVy1XSBqI3
) but in log messages we truncate the id
to be only the first four
characters
(ex VrwfCGHEGFf9mHdIEwYKsTmVy1XSBqI3
becomes Vrwf
) which is enough for
identification in logs.
OpId's uniquely identify an instance of an op. The opId's are the primary key used to identify messages between the supervisor and the agent. An opId is an "operation" (ex run a simulation) which may result in one or more messages (ex the simulation is started and in a subsequent message here is the status of it). Regardless of how many messages they will all have the same opId.
OpId's are similar to agentId's in that they are 32 character uuids that are truncated to 6 characters in the logs.
Run dir's are filesystem locations in the db where a simulation is run. In the docker and local case the runDirs are on the host the simulation is run on. In the sbatch case they are on the sbatch host. You can eaxamine the runDir to see the files that have been used in the simulation (ex in.json, parameters.py). A runDir has a few components that can be useful to pull information out of.
For example imagine this runDir
/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/srw/ZaU3VvUy/multiElectronAnimation
. You
will find the user id (.../user/<user-id>
; EPRSqtUW in the example), the
simulation type (.../<user-id>/<simulation-type>
; srw in the example),
simulation Id (.../<simulation-type>/<simulation-id>
; ZaU3VvUy in the
example), and compute model (.../<simulation-id>/<compute-model>
;
multiElectronAnimation in the example). With these keys you can find other
pieces of information about the job (ex computeJid).
The Flask server is usually the least interesting/buggy place to look when debugging the job system. The bulk of its work is to translate request to properly formatted messages for the job supervisor. It is the place where some keys that will be used throughout the system (ex computeJid and uid) are put on the request.
In the logs you will see messages like:
api=api_simulationFrame runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/srw/ZaU3VvUy/multiElectronAnimation
These tell you what api was requested (ex api_simulationFrame
) and the runDir
(ex
/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/srw/ZaU3VvUy/multiElectronAnimation
)
in use.
The job supervisor is the broker of all messages in the system. It is likely the most important place to read logs when debugging the job system.
For each simulation the supervisor uses the computeJid (compute job id) as the "primary key" to keep track of the simulation.
The supervisor keeps a record of every computeJid and metadata about the it (ex start time, status). It persists this data to the database in json files. The filename for each file is the computeJid. The files also contain a list of historic runs of the computeJid with a few of the most important metadata fields.
The files are persisted in
<srdb-root>/supervisor-job/<compute-jid>.json
.
These files are most useful when you want to know what information the supervisor has about a computeJid (ex the status of the simulation).
The supervisor log messages are crucial in debugging the system. In development they are written to stdout and on production they are written to journald.
In the journald case you can read the logs using journalctl -u sirepo_job_supervisor
. This is configured using
rsconf
Each agent writes their logs to different locations.
Local agents write their logs to the same stdout stream as the supervisor. This makes local development easier so you can see sequential logs between the supervisor and agent interleaved correctly.
Docker agents write their logs to journald in development and production. In
order to find the logs you need two pieces of information. First, you need the
container name. This is in the form srj-<s || p>-<user-id>
(s stands for
sequential and p for parallel). Second, you need the host that the docker agent
is running on. In development this is always your local machine so the host
isn't important but in production we run the agents on multiple machines. You
need to ssh into the specific host to get the logs. To find the host in the
supervisor logs search for the uid and the keyword "running" and "run" (ex
EPRS.*running.* run
). This will yield a log line like:
DockerDriver(a=fgTm k=parallel u=EPRS [_Op(run, j2FX)]) running: docker --host=tcp://localhost.localdomain:2376 --tlsverify --tlscacert=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/cacert.pem --tlscert=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/cert.pem --tlskey=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/key.pem run --attach=stdin --cpus=2 --init --interactive --memory=1g --name=srj-p-EPRSqtUW --network=host --rm --ulimit=core=0 --ulimit=nofile=1024 --user=1000 --volume=/home/vagrant/src:/home/vagrant/src:ro --volume=/home/vagrant/.pyenv:/home/vagrant/.pyenv:ro --volume=/home/vagrant/.local:/home/vagrant/.local:ro --volume=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW:/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW radiasoft/sirepo:dev /bin/bash -l
This shows the docker run
command that was used to start the agent. The
--host
flag will show you the host that the container is being run on (ex
--host=tcp://localhost.localdomain:2376
). You'll need to ssh into that host to
access the logs.
Once on the host you can retrieve logs for a container using journald and the
container name. For example, journalctl CONTAINER_NAME=srj-p-EPRSqtUW
.
Sbatch is the trickiest of agents to debug. Because agents run as the user who
requested the agent we don't have access to their logs. If you are debugging a
job you ran as yourself on NERSC you can ssh into Cori and view your own log
files. The sbatch agent's working directory is $SCRATCH/sirepo-<env>
. For
example, $SCRATCH/sirepo-dev
. A log file
($SCRATCH/sirepo-<env>/job_agent.log
) will be created and logs written to it.
When an Sbatch agent is started it checks to see if there is an existing
job_agent.log
. If there is one it copies it over to the supervisor db and
outputs it in a file named
<datetime>-remote.log
. In
addition if there is any output created when starting the agent (there shouldn't
be but might be on error). Then it stores the the output in a file named
<datetime>-start.log
. Both
of these files are place in
<srdb-root>/user/<user-id>/agent-sbatch/<sbatch-hostname>/<datetime-remote||start>.log
.
Below is a walkthrough of a run simulation request and how to track its progress
across the system. For this demo I'll use a local agent. For Docker and sbatch
the only difference is you'll need to locate the job_agent.log
for those
specific agent types as described above.
- An incoming request to the Flask server
api=api_runSimulation runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
- An incoming request to the job supervisor (runDir and computeJid matches flask request).
Feb 24 21:39:11 14458 0 sirepo/pkcli/job_supervisor.py:173:_incoming class=<class 'sirepo.job_supervisor.ServerReq'> content={'analysisModel': 'heightWeightReport', 'api': 'api_runSimulation', 'computeJid': 'LoIEQ6fa-Mtk0hZyj-heightWeightReport', 'computeJobHash': 'ae629e4c20b23a65c1ca5bcea480334a', 'computeJobSerial': 0, 'computeModel': 'heightWeightReport', 'data': {<SNIP>}, 'isParallel': False, 'jobRunMode': 'sequential', 'reqId': 'ZMCMHyAFVrgMResV45Tm98PfUXTpas0w', 'runDir': '/home/vagrant/src/radiasoft/sirepo/run/user/LoIEQ6fa/myapp/Mtk0hZyj/heightWeightReport', 'serverSecret':<REDACTED>, 'simulationId': 'Mtk0hZyj', 'simulationType': 'myapp', 'simulation_lib_dir': '/home/vagrant/src/radiasoft/sirepo/run/user/LoIEQ6fa/myapp/lib', 'uid': 'LoIEQ6fa', 'userDir': '/home/vagrant/src/radiasoft/sirepo/run/user/LoIEQ6fa'}
- The creation of a ServerReq (computeJid matches incoming request).
ServerReq(api_runSimulation, EPRSqtUW-IVh7w74d-heightWeightReport)
- The creation of an op (runDir matches).
_Op(run, pb5u) runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
- The start of an agent (uid matches).
LocalDriver(a=WUt4 k=sequential u=EPRS [])
- An incoming message from the agent letting us know that it is started (agentId matches).
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) opName=alive o=None
- The sending of the run op that was created above (opId matches).
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) _Op(run, pb5u) runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
- The agent received the op (again opId matches).
opName=run o=pb5u runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
- The agent replying to the run op (agentId and opId match).
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) opName=run o=pb5u
- The start of the job_cmd in the runDir (runDir matches)
cmd=('/bin/bash', '-l') stdin=b"source $HOME/.bashrc
set -e
mkdir -p '.'
cd '.'
pyenv shell py2
export SIREPO_MPI_CORES='1'
export SIREPO_SIM_DATA_LIB_FILE_URI=''
export PYKERN_PKCONFIG_CHANNEL='dev'
export PYKERN_PKDEBUG_CONTROL='.*'
export PYKERN_PKDEBUG_OUTPUT=''
export PYKERN_PKDEBUG_REDIRECT_LOGGING=''
export PYKERN_PKDEBUG_WANT_PID_TIME='1'
export SIREPO_FEATURE_CONFIG_JOB='1'
export PYTHONPATH=''
export PYTHONSTARTUP=''
export PYTHONUNBUFFERED='1'
export SIREPO_AUTH_LOGGED_IN_USER='EPRSqtUW'
export SIREPO_JOB_VERIFY_TLS='False'
export SIREPO_JOB_MAX_MESSAGE_SIZE='100000000'
export SIREPO_JOB_PING_INTERVAL_SECS='120'
export SIREPO_JOB_PING_TIMEOUT_SECS='240'
export SIREPO_SRDB_ROOT='/home/vagrant/src/radiasoft/sirepo/run'
exec 'sirepo' 'job_cmd' '/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport/in-rqBKJ5c09rBg3xVqTRV6WXWqaL9gDwHn.json'
"
- The agent reading in from the job_cmd that the run is complete (jid, op_id, and run_dir match)
cmd=_Cmd(jid=EPRSqtUW-IVh7w74d-heightWeightReport o=pb5u job_cmd=compute run_dir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport) stdout=b'{"state":"completed"
- The message received by the supervisor that the run is complete (agentId and
opId match)
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) opName=run o=pb5u
License: http://www.apache.org/licenses/LICENSE-2.0.html
Copyright ©️ 2015–2020 RadiaSoft LLC. All Rights Reserved.
- Activait
- Controls
- elegant
- FLASH
- Genesis
- JSPEC
- JupyterHub
- MAD-X
- OPAL
- Radia
- Shadow
- Synchrotron Radiation Workshop (SRW)
- Warp PBA
- Warp VND
- Zgoubi
- Authentication and Account Creation
- How Your Sirepo Workspace Works
- Navigating the Sirepo Simulations Interface
- How to upload a lattice file
- How to share a Sirepo simulation via URL
- How Example simulations work
- How to report a bug in Sirepo
- Using lattice files in Sirepo
- Resetting an Example Simulation to default
- Backup SRW Sirepo simulations
- SRW Aperture
- SRW Brilliance Report
- SRW Circular Cylinder Mirror
- SRW CRL
- SRW Crystal
- SRW Electron Beam
- SRW Elliptical Cylinder Mirror
- SRW Fiber
- SRW Flux
- SRW Fully Coherent Gaussian Beam
- SRW Import Python or JSON Simulation File
- SRW Initial Wavefront Simulation Grid
- SRW Intensity Report
- SRW Planar Mirror
- SRW Power Density Report
- SRW Propagation Parameters
- SRW Single Electron Spectrum Report
- SRW Spherical Mirror
- SRW Toroid Mirror
- SRW Watchpoint
- SRW Additional Documentation