Skip to content

Debugging the job system

Evan Carlin edited this page Feb 24, 2020 · 16 revisions

Debugging the job system

Debugging the job system involves looking at logs for the Flask server, job supervisor, and agent(s). Agents in the local, Docker, and Sbatch execution modes all write their logs messages to different locations but interpretation of the logs is generally the same. Below you'll find instructions for debugging the job system independent of the type of agents running and below that you'll find specific instructions for the different types of agents.

General debugging of the job system

Regardless of the type of agent in use, the Flask server and the job supervisor are debugged in the same manner. In addition, the keys you'll use to track the flow of messages across the system are the same. These id's identify a compute job (computeJid), a user (uid), a location where a simulation is run (runDir), an operation (opIds), and an agent (agentId).

Across the system we try to use the pkdebug_str() (ex for a ServerReq) method to format the string representation of objects. This method is similar to __str__ except it passes through our truncation and redaction logging system.

Important Id's

uid (user id)

Every user has a globally unique 8 character id (ex vn9m3eZW).

computeJid (compute job id)

Each simulation corresponds to a globally unique computeJid . The supervisor keeps a record of every computeJid and uses them as the first level of separation for requests. Compute jid's are composed of a user id, simulation id, and compute model each separated by a dash (ex vn9m3eZW-i9btqxdR-fluxReport.json).

agentId

Agent Ids uniquely identify an instance of an agent. They are assigned to each agent for the life of the system. Even if an agent is killed (or crashes) and is restarted it will still have the same agentId. Every incoming message will have an agent Id letting the supervisor know which agent the message is from.

Agents Ids are 32 character uuids (ex VrwfCGHEGFf9mHdIEwYKsTmVy1XSBqI3) but in log messages we truncate the id to be only the first six characters (ex VrwfCGHEGFf9mHdIEwYKsTmVy1XSBqI3 becomes VrwfCG) which is enough for identification in logs.

opId (operation id)

Op ids uniquely identify an instance of an op. The op ids are the primary key used to identify messages between the supervisor and the agent. An opId is an "operation" (ex run a simulation) which may result in one or more messages (ex the simulation is started and in a subsequent message here is the status of it). Regardless of how many messages they will all have the same opId.

Op ids are similar to agent ids in that they are 32 character uuids that are truncated to 6 characters in the logs.

runDir (run directory)

Run dir's are filesystem locations in the db where a simulation is run. You can use the runDir to go to the location in the db and see the files that have been used in the simulation (ex in.json, parameters.py). A runDir has a few components that can be useful to pull information out of.

For example imagine this runDir /home/vagrant/src/radiasoft/sirepo/run/user/vn9m3eZW/srw/i9btqxdR/multiElectronAnimation. You will find the user id (.../user/<user-id>; vn9m3eZW in the example), the simulation type (.../<user-id>/<simulation-type>; srw in the example), simulation Id (.../<simulation-type>/<simulation-id>; i9btqxdR in the example), and compute model (.../<simulation-id>/<compute-model>; multiElectronAnimation in the example). With these keys you can find other pieces of information about the job (ex computeJid).

The Flask server

The Flask server is usually the least interesting/buggy place to look when debugging the job system. The bulk of its work is to translate request to properly formatted messages for the job supervisor. It is the place where some keys that will be used throughout the system (ex computeJid and uid) are put on the request.

In the logs you will see messages like:

Feb 24 15:50:33  3412   126 ../sirepo/job_api.py:182:_request api=api_simulationFrame runDir=/home/vagrant/src/radiasoft/sirepo/run/user/vn9m3eZW/srw/i9btqxdR/multiElectronAnimation

These tell you what api was requested (ex api_simulationFrame) and the runDir (ex /home/vagrant/src/radiasoft/sirepo/run/user/vn9m3eZW/srw/i9btqxdR/multiElectronAnimation) in use.

The job supervisor

The job supervisor is the broker of all messages in the system. It is likely the most important place to read logs when debugging the job system.

Supervisor db files

For each simulation the supervisor uses the computeJid (compute job id) as the "primary key" to keep track of the simulation (see the computeJid section below for more details on this key; TODO link to doc).

The supervisor keeps a record of every computeJid and metadata about the it (ex start time, status). It persists this data to the database in json files. The filename for each file is the computeJid. The files also contain a list of historic runs of the computeJid with a few of the most important metadata fields.

The files are persisted in <srdb-root>/supervisor-job/<compute-jid>.json.

These files are most useful when you want to know what information the supervisor has about a computeJid (ex the status of the simulation).

Supervisor log messages.

The supervisor log messages are crucial in debugging the system. In development they are written to stdout and on production they are written to journald (TODO rn how to find them?).

For the purpose of this guide we will also include any messages written by the drivers (ex this) as "supervisor" message since they are written to the same stdout stream as the supervisor.

Agents

Each agent writes their logs to different locations

Local

Local agents write their logs to the same stdout stream as the supervisor. This makes local development easier so you can see sequential logs between the supervisor and agent interleaved correctly.

Docker

Docker agents write their logs to journald in development and production. In order to find the logs you need two pieces of information. First, you need the container name. This is in the form srj-<s || p>- (s stands for sequential and p for parallel) ( you can search the logs for cname to find container names). Second you need the host that the docker agent is running on. In development this is always your local machine so the host isn't important but in production we run the agents on multiple machines. We don't aggregate logs to a single place so you need to ssh into the specific host to get the logs. To find the host in the supervisor logs search for the uid and the keyword "running" and "run" (ex vn9m3eZW.*running.* run ). This will yield a log line like:

Feb 24 20:07:49  9207     0 sirepo/job_driver/docker.py:183:_cmd self=DockerDriver(agentId=3gr1Uu kind=sequential uid=vn9m3eZW ops=[_Op(run, 3r3MYv)]) running: docker --host=tcp://localhost.localdomain:2376 --tlsverify --tlscacert=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/cacert.pem --tlscert=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/cert.pem --tlskey=/home/vagrant/src/radiasoft/sirepo/run/docker_tls/localhost.localdomain/key.pem run --attach=stdin --cpus=1 --init --interactive --memory=1g --name=srj-s-vn9m3eZW --network=host --rm --ulimit=core=0 --ulimit=nofile=1024 --user=1000 --volume=/home/vagrant/src:/home/vagrant/src:ro --volume=/home/vagrant/.pyenv:/home/vagrant/.pyenv:ro --volume=/home/vagrant/.local:/home/vagrant/.local:ro --volume=/home/vagrant/src/radiasoft/sirepo/run/user/vn9m3eZW:/home/vagrant/src/radiasoft/sirepo/run/user/vn9m3eZW radiasoft/sirepo:dev /bin/bash -l

This shows the docker run command that was used to start the agent. The --host flag will show you the host that the container is being run on (ex --host=tcp://localhost.localdomain:2376). You'll need to ssh into that host to access the logs.

Once on the host you can retrieve logs for a container using journald and the container name. For example, journalctl CONTAINER_NAME=srj-s-vn9m3eZW.

Sbatch (NERSC)

Sbatch is the trickiest of agents to debug. Because agents run as the user who requested the agent we don't have access to their logs. If you are debugging running as yourself on NERSC you can ssh into Cori and view your own log files. The sbatch agent's working directory is /global/cscratch1/sd/<sbatch_user>/sirepo-<env>. For example, /global/cscratch1/sd/rms/sirepo-dev. A log file (job_agent.log) will be created in that directory and agent logs written to it.

When an Sbatch agent is started it checks to see if there is an existing job_agent.log. If there is one it copies it over to the supervisor db and outputs it in a file named <datetime>-remote.log. In addition if there is any output created when starting the agent (there shouldn't be but might be on error). Then it stores the the output in a file named <datetime>-start.log. Both of these files are place in <srdb-root>/user/<user-id>/agent-sbatch/<sbatch-hostname>/<datetime-remote||start>.log.

Clone this wiki locally