You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@Richard14916, might it be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? (E.g., dumping it into a text file in the output.)
PyCBC did this a long time ago and it's been enormously helpful in understanding and improving actual real-world workflow performance, and in addition to simply allowing you to do ad-hoc inspection of a job's performance after the fact, it would enable you to do all sorts of nifty things like setting alarms (or aborting) if things go outside expected bounds (e.g., CPU utilization is ~zero, indicating that a job is wedged for some reason) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.
I'd be happy to help, if this is something you think you might want to implement.
The text was updated successfully, but these errors were encountered:
I'll add that you can start with something spectacularly unsophisticated like running a command-line tool to append two integers for CPU and GPU utilization to a file every N seconds... you can always add more fancy-pants features if/when they seem useful.
If you think this would be a boon to performance I'd be happy to put in time to work on it. That being said, I'm not sure I am the best qualified to do so, seeing as I am still working on running even basic jobs on OSG (this is actually what I'm planning to start work on next), so things would likely go much smoother if someone else is able to take the lead.
Right now our bulk profiling is via condor_userlog, usually performed at the end. As far as I know we don’t have a native tool that tracks integrated GPU utilization (ie total number of cycles). We could add something real-time to the dag, but my sense now is that it is simplest to diagnose by hand with condor_userlog with another script intermittently for all a user’s running jobs.
@Richard14916, might it be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? (E.g., dumping it into a text file in the output.)
PyCBC did this a long time ago and it's been enormously helpful in understanding and improving actual real-world workflow performance, and in addition to simply allowing you to do ad-hoc inspection of a job's performance after the fact, it would enable you to do all sorts of nifty things like setting alarms (or aborting) if things go outside expected bounds (e.g., CPU utilization is ~zero, indicating that a job is wedged for some reason) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.
I'd be happy to help, if this is something you think you might want to implement.
The text was updated successfully, but these errors were encountered: