Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute cross-node load and utilization #24

Closed
6 tasks done
Tracked by #6
lars-t-hansen opened this issue Aug 4, 2023 · 5 comments
Closed
6 tasks done
Tracked by #6

Compute cross-node load and utilization #24

lars-t-hansen opened this issue Aug 4, 2023 · 5 comments
Assignees
Labels
task:enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Aug 4, 2023

Currently the load and utilization computations relate only to a single host, but for multi-node jobs we must take the into account the capabilities of all the nodes involved.

Depends on:

@lars-t-hansen
Copy link
Collaborator Author

An interesting question here (cf #6 (comment)) is what "utilization" means in a multi-node system. Say the program requests 4 GPUs and runs flat-out on one of them and ignores the other three. Utilization seen as a whole is 25%. But this is not the only view. If we take that number at face value we may start to look for eg communication bottlenecks, or stalls for I/O, while the real problem is that there's zero scalability. The user may believe that by doubling the number of cores the program might speed up some, but this is not true: the program will never run faster, except on a faster CPU.

In other words, the analysis of utilization in a multi-node system may be more complicated than aggregating data across the nodes.

@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Aug 17, 2023

There are probably several overlapping use cases here.

For the use cases current_utilization and historical_utilization (see upcoming change to README.md), we're initially interested in cross-node utilization, ie, computed as sums or averages relative to the capacity of the individual nodes. The current structure of the code suffices for that, modulo that host name sets are not aggregated properly at present. (Nor gpu sets, I think.) That's easily fixed and fairly easily printed and fits in with the jobs verb.

For other use cases such as verify_resource_use and verify_scalability and thin_pipe, we do want to know resource use per-node. But importantly these are per-job queries. It is possible that this is a new verb, utilization, with various switches, that aggregate not cross-node but by-node, and present by-node data. (Similar to jobgraph.) A fair amount of code can probably be shared with jobs, too.

@lars-t-hansen
Copy link
Collaborator Author

The key to computing cross-node values for the aggregated use case in the previous comment is to synthesize an event stream that takes into account the activity on all nodes, and then process that in the normal manner. See also #43.

@lars-t-hansen
Copy link
Collaborator Author

Cross-node utilization (first paragraph in #24 (comment)) was implemented by fa366cc, this adds a switch --batch that says to merge job numbers across nodes and compute appropriate aggregates.

@lars-t-hansen
Copy link
Collaborator Author

It's not clear to me that the by-node load data is not adequately presented (second paragraph in #24 (comment)) using the --job= switch. Consider these data from a Fox job (command line abbreviated for clarity):

$ sonalyze jobs --job=281495 --fmt=job,user,duration,host,cpu,mem,cmd
job     user        duration  host   cpu-avg  cpu-peak  mem-avg  mem-peak  cmd      
281495  ec-larstha  0d 0h 2m  c1-20  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-5   85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-14  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-13  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-12  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-26  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-21  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-23  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-28  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-8   85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-10  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-27  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-24  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-9   85       100       1        1         tsp_mpi  

Combined with summary data (same commandline along with -b):

job     user        duration  host                                 cpu-avg  cpu-peak  mem-avg  mem-peak  cmd      
281495  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  1184     1400      2        2         tsp_mpi  

we have a pretty good idea about what's going on here. As we add monitoring capabilities for eg communication, we can see system utilization and perform a simple scalability analysis pretty broadly.

For example, there is an interesting question about the above job (which takes about 3 minutes), because it should be 100% CPU bound with a quick communication phase at the beginning and end. Is the 85% average CPU utilization due to system effects or
is my program doing something wrong?

Let's explore (output is abbreviated but all nodes look the same):

$ sonalyze load --job=281495 --none
HOST: c1-10
date        time   cpu  mem  gpu  gpumem  
2023-08-23  09:38  0    1    0    0       
2023-08-23  09:38  94   1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:40  100  1    0    0

That's a clue - things are slow during startup, the job takes a minute to get going. More investigation is needed, but it's not clear that sonalyze needs much more specialized functionality.

(Arguably the first "0" record is really a sentinel and pollutes the averages. I'll file a bug.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant