-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sonar is slow without --batchless because of how we get slurm job IDs #97
Comments
That sounds very slow. This is built with |
Yes. The problem, I think, is that we moved process filtering much later in the pipeline, so this task is run much more often than it used to be. The two obvious approaches to fixing this is to not compute the job ID until we need it (but I don't know how helpful that will be) and to avoid the shell pipeline for what is after all a very simple job of getting some text from a file with what does not actually need to be a regular expression. |
OK. Indeed we need to fix this. It needs to run well below 1 second per poll. Ideally only milliseconds. I think we might need to do both: avoid shell pipeline and delay it as much as we can. |
I'll take a look after lunch, since technically I introduced this bug :-) |
Avoiding the pipeline brings the time of the slow version down to the time of the fast version, and this is on a system where the matching line is never found because there is no batch job system, so the entire cgroup file has to be read and parsed. I'm not sure how well I'll be able to test this locally (yet) but I'll look into that. |
Another factor that I don't know how to think about yet is that at least on the compute nodes on Fox, once I've found a slurm ID for one process, it could look like all the processes on the node that have a slurm ID have the same slurm ID. This would be a tricky invariant to rely on, and it's probably not low-hanging fruit. Let's see what the profile looks like after we've fixed #86, #87, and #88. |
Running a sonar release build just now on a lightly-loaded ML node (ml7, a beefy AMD system), it runs in 0.27s real time with
--batchless
and in 2.5s real time without--batchless
(about 10x). The difference is even more stark on my development system (a slightly older Xeon tower): 0.03s vs 1.63s (about 50x).I run with
--exclude-users=root --exclude-system-jobs --rollup
to keep the amount of output to a minimum, so that we can know it's not output generation that's the main problem.Running
perf
on this it is clear that the problem is inget_slurm_job_id
: every profiling hit in the first several pages of profiler output is in the pipeline that that function runs to get the job ID. We can probably do much better here (and we'll need to).The text was updated successfully, but these errors were encountered: