Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: collect and record runtime performance data #16

Open
couvares opened this issue Jun 25, 2019 · 3 comments
Open

idea: collect and record runtime performance data #16

couvares opened this issue Jun 25, 2019 · 3 comments
Labels
enhancement New feature or request

Comments

@couvares
Copy link

@Richard14916, might it be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? (E.g., dumping it into a text file in the output.)

PyCBC did this a long time ago and it's been enormously helpful in understanding and improving actual real-world workflow performance, and in addition to simply allowing you to do ad-hoc inspection of a job's performance after the fact, it would enable you to do all sorts of nifty things like setting alarms (or aborting) if things go outside expected bounds (e.g., CPU utilization is ~zero, indicating that a job is wedged for some reason) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.

I'd be happy to help, if this is something you think you might want to implement.

@couvares
Copy link
Author

I'll add that you can start with something spectacularly unsophisticated like running a command-line tool to append two integers for CPU and GPU utilization to a file every N seconds... you can always add more fancy-pants features if/when they seem useful.

@Richard14916
Copy link
Contributor

If you think this would be a boon to performance I'd be happy to put in time to work on it. That being said, I'm not sure I am the best qualified to do so, seeing as I am still working on running even basic jobs on OSG (this is actually what I'm planning to start work on next), so things would likely go much smoother if someone else is able to take the lead.

@oshaughn
Copy link
Owner

Right now our bulk profiling is via condor_userlog, usually performed at the end. As far as I know we don’t have a native tool that tracks integrated GPU utilization (ie total number of cycles). We could add something real-time to the dag, but my sense now is that it is simplest to diagnose by hand with condor_userlog with another script intermittently for all a user’s running jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants