idea: collect and record runtime performance data #16

couvares · 2019-06-25T14:54:56Z

@Richard14916, might it be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? (E.g., dumping it into a text file in the output.)

PyCBC did this a long time ago and it's been enormously helpful in understanding and improving actual real-world workflow performance, and in addition to simply allowing you to do ad-hoc inspection of a job's performance after the fact, it would enable you to do all sorts of nifty things like setting alarms (or aborting) if things go outside expected bounds (e.g., CPU utilization is ~zero, indicating that a job is wedged for some reason) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.

I'd be happy to help, if this is something you think you might want to implement.

couvares · 2019-06-25T15:00:24Z

I'll add that you can start with something spectacularly unsophisticated like running a command-line tool to append two integers for CPU and GPU utilization to a file every N seconds... you can always add more fancy-pants features if/when they seem useful.

Richard14916 · 2019-06-25T15:35:57Z

If you think this would be a boon to performance I'd be happy to put in time to work on it. That being said, I'm not sure I am the best qualified to do so, seeing as I am still working on running even basic jobs on OSG (this is actually what I'm planning to start work on next), so things would likely go much smoother if someone else is able to take the lead.

oshaughn · 2019-06-25T15:47:31Z

Right now our bulk profiling is via condor_userlog, usually performed at the end. As far as I know we don’t have a native tool that tracks integrated GPU utilization (ie total number of cycles). We could add something real-time to the dag, but my sense now is that it is simplest to diagnose by hand with condor_userlog with another script intermittently for all a user’s running jobs.

couvares added the enhancement New feature or request label Jun 25, 2019

couvares mentioned this issue Jun 25, 2019

can we identify (and re-route) intrinsically CPU-intensive workflows? #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: collect and record runtime performance data #16

idea: collect and record runtime performance data #16

couvares commented Jun 25, 2019

couvares commented Jun 25, 2019

Richard14916 commented Jun 25, 2019

oshaughn commented Jun 25, 2019

idea: collect and record runtime performance data #16

idea: collect and record runtime performance data #16

Comments

couvares commented Jun 25, 2019

couvares commented Jun 25, 2019

Richard14916 commented Jun 25, 2019

oshaughn commented Jun 25, 2019