Skip to content

Debugging the job system

Remy Poore edited this page Mar 20, 2020 · 16 revisions

Debugging the job system involves looking at logs for the Flask server, job supervisor, and agent(s). Agents in the local, Docker, and Sbatch execution modes all write their logs messages to different locations, but interpretation of the logs is generally the same.

Below, you'll find:

  1. Instructions for debugging the job system independent of the type of agents running, and
  2. Specific instructions for the different types of agents.

General debugging of the job system

Regardless of the type of agent in use, the Flask server and the job supervisor are debugged in the same manner. The keys you'll use to track the flow of messages across the system are the same. These id's identify a compute job (computeJid), a user (uid), a location where a simulation is run (runDir), an operation (opIds), and an agent (agentId).

Across the system we try to use the pkdebug_str() (ex for a ServerReq) method to format the string representation of objects. This method is similar to __str__ except it passes through our truncation and redaction logging system.

Important Id's

uid (user id)

Every user has a globally unique 8 character id (ex vn9m3eZW).

computeJid (compute job id)

Each simulation corresponds to a globally unique computeJid . The supervisor keeps a record of every computeJid and uses them as the first level of separation for requests. Compute jid's are composed of a user id, simulation id, and compute model each separated by a dash (ex vn9m3eZW-i9btqxdR-fluxReport).

agentId

AgentId's uniquely identify an instance of an agent. They are assigned to each agent for the life of the job supervisor. Even if an agent is killed (or crashes) and is restarted it will still have the same agentId. Every incoming message will have an agentId letting the supervisor know which agent the message is from.

AgentId's are a secret. They uniquely identify an agent and are used to route incoming messages. If they were to be exposed anyone with the id could impersonate that agent.

Agents Ids are 32 character randomly generated strings (ex VrwfCGHEGFf9mHdIEwYKsTmVy1XSBqI3) but in log messages we truncate the id to be only the first four characters (ex VrwfCGHEGFf9mHdIEwYKsTmVy1XSBqI3 becomes Vrwf) which is enough for identification in logs.

opId (operation id)

OpId's uniquely identify an instance of an op. The opId's are the primary key used to identify messages between the supervisor and the agent. An opId is an "operation" (ex run a simulation) which may result in one or more messages (ex the simulation is started and in a subsequent message here is the status of it). Regardless of how many messages they will all have the same opId.

OpId's are similar to agentId's in that they are 32 character uuids that are truncated to 6 characters in the logs.

runDir (run directory)

Run dir's are filesystem locations in the db where a simulation is run. In the docker and local case the runDirs are on the host the simulation is run on. In the sbatch case they are on the sbatch host. You can eaxamine the runDir to see the files that have been used in the simulation (ex in.json, parameters.py). A runDir has a few components that can be useful to pull information out of.

For example imagine this runDir /home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/srw/ZaU3VvUy/multiElectronAnimation. You will find the user id (.../user/<user-id>; EPRSqtUW in the example), the simulation type (.../<user-id>/<simulation-type>; srw in the example), simulation Id (.../<simulation-type>/<simulation-id>; ZaU3VvUy in the example), and compute model (.../<simulation-id>/<compute-model>; multiElectronAnimation in the example). With these keys you can find other pieces of information about the job (ex computeJid).

The Flask server

The Flask server is usually the least interesting/buggy place to look when debugging the job system. The bulk of its work is to translate request to properly formatted messages for the job supervisor. It is the place where some keys that will be used throughout the system (ex computeJid and uid) are put on the request.

In the logs you will see messages like:

api=api_simulationFrame runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/srw/ZaU3VvUy/multiElectronAnimation

These tell you what api was requested (ex api_simulationFrame) and the runDir (ex /home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/srw/ZaU3VvUy/multiElectronAnimation) in use.

The job supervisor

The job supervisor is the broker of all messages in the system. It is likely the most important place to read logs when debugging the job system.

Supervisor db files

For each simulation the supervisor uses the computeJid (compute job id) as the "primary key" to keep track of the simulation.

The supervisor keeps a record of every computeJid and metadata about the it (ex start time, status). It persists this data to the database in json files. The filename for each file is the computeJid. The files also contain a list of historic runs of the computeJid with a few of the most important metadata fields.

The files are persisted in <srdb-root>/supervisor-job/<compute-jid>.json.

These files are most useful when you want to know what information the supervisor has about a computeJid (ex the status of the simulation).

Supervisor log messages.

The supervisor log messages are crucial in debugging the system. In development they are written to stdout and on production they are written to journald.

In the journald case you can read the logs using journalctl -u sirepo_job_supervisor. This is configured using rsconf

Agents

Each agent writes their logs to different locations.

Local

Local agents write their logs to the same stdout stream as the supervisor. This makes local development easier so you can see sequential logs between the supervisor and agent interleaved correctly.

Docker

Docker agents write their logs to journald in development and production. In order to find the logs you need two pieces of information. First, you need the container name. This is in the form srj-<s || p>-<user-id> (s stands for sequential and p for parallel). Second, you need the host that the docker agent is running on. In development this is always your local machine so the host isn't important but in production we run the agents on multiple machines. You need to ssh into the specific host to get the logs. To find the host in the supervisor logs search for the uid and the keyword "running" and "run" (ex EPRS.*running.* run ). This will yield a log line like:

DockerDriver(a=fgTm k=parallel u=EPRS [_Op(run, j2FX)]) running: docker --host=tcp://localhost.localdomain:2376 --tlsverify --tlscacert=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/cacert.pem --tlscert=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/cert.pem --tlskey=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/key.pem run --attach=stdin --cpus=2 --init --interactive --memory=1g --name=srj-p-EPRSqtUW --network=host --rm --ulimit=core=0 --ulimit=nofile=1024 --user=1000 --volume=/home/vagrant/src:/home/vagrant/src:ro --volume=/home/vagrant/.pyenv:/home/vagrant/.pyenv:ro --volume=/home/vagrant/.local:/home/vagrant/.local:ro --volume=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW:/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW radiasoft/sirepo:dev /bin/bash -l

This shows the docker run command that was used to start the agent. The --host flag will show you the host that the container is being run on (ex --host=tcp://localhost.localdomain:2376). You'll need to ssh into that host to access the logs.

Once on the host you can retrieve logs for a container using journald and the container name. For example, journalctl CONTAINER_NAME=srj-p-EPRSqtUW.

Sbatch (NERSC)

Sbatch is the trickiest of agents to debug. Because agents run as the user who requested the agent we don't have access to their logs. If you are debugging a job you ran as yourself on NERSC you can ssh into Cori and view your own log files. The sbatch agent's working directory is $SCRATCH/sirepo-<env>. For example, $SCRATCH/sirepo-dev. A log file ($SCRATCH/sirepo-<env>/job_agent.log) will be created and logs written to it.

When an Sbatch agent is started it checks to see if there is an existing job_agent.log. If there is one it copies it over to the supervisor db and outputs it in a file named <datetime>-remote.log. In addition if there is any output created when starting the agent (there shouldn't be but might be on error). Then it stores the the output in a file named <datetime>-start.log. Both of these files are place in <srdb-root>/user/<user-id>/agent-sbatch/<sbatch-hostname>/<datetime-remote||start>.log.

Tracking the flow of a request

Below is a walkthrough of a run simulation request and how to track its progress across the system. For this demo I'll use a local agent. For Docker and sbatch the only difference is you'll need to locate the job_agent.log for those specific agent types as described above.

  1. An incoming request to the Flask server
api=api_runSimulation runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
  1. An incoming request to the job supervisor (runDir and computeJid matches flask request).
Feb 24 21:39:11 14458     0 sirepo/pkcli/job_supervisor.py:173:_incoming class=<class 'sirepo.job_supervisor.ServerReq'> content={'analysisModel': 'heightWeightReport', 'api': 'api_runSimulation', 'computeJid': 'LoIEQ6fa-Mtk0hZyj-heightWeightReport', 'computeJobHash': 'ae629e4c20b23a65c1ca5bcea480334a', 'computeJobSerial': 0, 'computeModel': 'heightWeightReport', 'data': {<SNIP>}, 'isParallel': False, 'jobRunMode': 'sequential', 'reqId': 'ZMCMHyAFVrgMResV45Tm98PfUXTpas0w', 'runDir': '/home/vagrant/src/radiasoft/sirepo/run/user/LoIEQ6fa/myapp/Mtk0hZyj/heightWeightReport', 'serverSecret':<REDACTED>, 'simulationId': 'Mtk0hZyj', 'simulationType': 'myapp', 'simulation_lib_dir': '/home/vagrant/src/radiasoft/sirepo/run/user/LoIEQ6fa/myapp/lib', 'uid': 'LoIEQ6fa', 'userDir': '/home/vagrant/src/radiasoft/sirepo/run/user/LoIEQ6fa'}
  1. The creation of a ServerReq (computeJid matches incoming request).
ServerReq(api_runSimulation, EPRSqtUW-IVh7w74d-heightWeightReport)
  1. The creation of an op (runDir matches).
_Op(run, pb5u) runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
  1. The start of an agent (uid matches).
LocalDriver(a=WUt4 k=sequential u=EPRS [])
  1. An incoming message from the agent letting us know that it is started (agentId matches).
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) opName=alive o=None
  1. The sending of the run op that was created above (opId matches).
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) _Op(run, pb5u) runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
  1. The agent received the op (again opId matches).
opName=run o=pb5u runDir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport
  1. The agent replying to the run op (agentId and opId match).
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) opName=run o=pb5u
  1. The start of the job_cmd in the runDir (runDir matches)
cmd=('/bin/bash', '-l') stdin=b"source $HOME/.bashrc
set -e
mkdir -p '.'
cd '.'
pyenv shell py2
export SIREPO_MPI_CORES='1'
export SIREPO_SIM_DATA_LIB_FILE_URI=''
export PYKERN_PKCONFIG_CHANNEL='dev'
export PYKERN_PKDEBUG_CONTROL='.*'
export PYKERN_PKDEBUG_OUTPUT=''
export PYKERN_PKDEBUG_REDIRECT_LOGGING=''
export PYKERN_PKDEBUG_WANT_PID_TIME='1'
export SIREPO_FEATURE_CONFIG_JOB='1'
export PYTHONPATH=''
export PYTHONSTARTUP=''
export PYTHONUNBUFFERED='1'
export SIREPO_AUTH_LOGGED_IN_USER='EPRSqtUW'
export SIREPO_JOB_VERIFY_TLS='False'
export SIREPO_JOB_MAX_MESSAGE_SIZE='100000000'
export SIREPO_JOB_PING_INTERVAL_SECS='120'
export SIREPO_JOB_PING_TIMEOUT_SECS='240'
export SIREPO_SRDB_ROOT='/home/vagrant/src/radiasoft/sirepo/run'
exec 'sirepo' 'job_cmd' '/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport/in-rqBKJ5c09rBg3xVqTRV6WXWqaL9gDwHn.json'
"
  1. The agent reading in from the job_cmd that the run is complete (jid, op_id, and run_dir match)
cmd=_Cmd(jid=EPRSqtUW-IVh7w74d-heightWeightReport o=pb5u job_cmd=compute run_dir=/home/vagrant/src/radiasoft/sirepo/run/user/EPRSqtUW/myapp/IVh7w74d/heightWeightReport) stdout=b'{"state":"completed"
  1. The message received by the supervisor that the run is complete (agentId and opId match)
LocalDriver(a=WUt4 k=sequential u=EPRS [_Op(run, pb5u)]) opName=run o=pb5u
Clone this wiki locally