Compute cross-node load and utilization #24

lars-t-hansen · 2023-08-04T07:24:49Z

Currently the load and utilization computations relate only to a single host, but for multi-node jobs we must take the into account the capabilities of all the nodes involved.

Depends on:

lars-t-hansen · 2023-08-04T07:29:21Z

An interesting question here (cf #6 (comment)) is what "utilization" means in a multi-node system. Say the program requests 4 GPUs and runs flat-out on one of them and ignores the other three. Utilization seen as a whole is 25%. But this is not the only view. If we take that number at face value we may start to look for eg communication bottlenecks, or stalls for I/O, while the real problem is that there's zero scalability. The user may believe that by doubling the number of cores the program might speed up some, but this is not true: the program will never run faster, except on a faster CPU.

In other words, the analysis of utilization in a multi-node system may be more complicated than aggregating data across the nodes.

lars-t-hansen · 2023-08-17T12:22:21Z

There are probably several overlapping use cases here.

For the use cases current_utilization and historical_utilization (see upcoming change to README.md), we're initially interested in cross-node utilization, ie, computed as sums or averages relative to the capacity of the individual nodes. The current structure of the code suffices for that, ~~modulo that host name sets are not aggregated properly at present. (Nor gpu sets, I think.)~~ That's ~~easily fixed and~~ fairly easily printed and fits in with the jobs verb.

For other use cases such as verify_resource_use and verify_scalability and thin_pipe, we do want to know resource use per-node. But importantly these are per-job queries. It is possible that this is a new verb, utilization, with various switches, that aggregate not cross-node but by-node, and present by-node data. (Similar to jobgraph.) A fair amount of code can probably be shared with jobs, too.

lars-t-hansen · 2023-08-23T13:33:41Z

The key to computing cross-node values for the aggregated use case in the previous comment is to synthesize an event stream that takes into account the activity on all nodes, and then process that in the normal manner. See also #43.

lars-t-hansen · 2023-08-24T10:49:56Z

Cross-node utilization (first paragraph in #24 (comment)) was implemented by fa366cc, this adds a switch --batch that says to merge job numbers across nodes and compute appropriate aggregates.

lars-t-hansen · 2023-08-24T11:12:57Z

It's not clear to me that the by-node load data is not adequately presented (second paragraph in #24 (comment)) using the --job= switch. Consider these data from a Fox job (command line abbreviated for clarity):

$ sonalyze jobs --job=281495 --fmt=job,user,duration,host,cpu,mem,cmd
job     user        duration  host   cpu-avg  cpu-peak  mem-avg  mem-peak  cmd      
281495  ec-larstha  0d 0h 2m  c1-20  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-5   85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-14  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-13  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-12  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-26  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-21  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-23  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-28  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-8   85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-10  85       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-27  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-24  84       100       1        1         tsp_mpi  
281495  ec-larstha  0d 0h 2m  c1-9   85       100       1        1         tsp_mpi

Combined with summary data (same commandline along with -b):

job     user        duration  host                                 cpu-avg  cpu-peak  mem-avg  mem-peak  cmd      
281495  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  1184     1400      2        2         tsp_mpi

we have a pretty good idea about what's going on here. As we add monitoring capabilities for eg communication, we can see system utilization and perform a simple scalability analysis pretty broadly.

For example, there is an interesting question about the above job (which takes about 3 minutes), because it should be 100% CPU bound with a quick communication phase at the beginning and end. Is the 85% average CPU utilization due to system effects or
is my program doing something wrong?

Let's explore (output is abbreviated but all nodes look the same):

$ sonalyze load --job=281495 --none
HOST: c1-10
date        time   cpu  mem  gpu  gpumem  
2023-08-23  09:38  0    1    0    0       
2023-08-23  09:38  94   1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:39  100  1    0    0       
2023-08-23  09:40  100  1    0    0

That's a clue - things are slow during startup, the job takes a minute to get going. More investigation is needed, but it's not clear that sonalyze needs much more specialized functionality.

(Arguably the first "0" record is really a sentinel and pollutes the averages. I'll file a bug.)

lars-t-hansen mentioned this issue Aug 4, 2023

Figure out how to do multi-node jobs and to compute the load of multi-node systems #6

Closed

6 tasks

lars-t-hansen added the task:enhancement New feature or request label Aug 4, 2023

lars-t-hansen self-assigned this Aug 4, 2023

lars-t-hansen mentioned this issue Aug 4, 2023

Aggregate host names in the JobAggregate when aggregating across hosts #23

Closed

lars-t-hansen mentioned this issue Aug 24, 2023

Sentinel record pollutes averages #44

Closed

lars-t-hansen closed this as completed in 48aa648 Sep 6, 2023

lars-t-hansen mentioned this issue Sep 6, 2023

Breakdowns by host and process #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute cross-node load and utilization #24

Compute cross-node load and utilization #24

lars-t-hansen commented Aug 4, 2023 •

edited

Loading

lars-t-hansen commented Aug 4, 2023

lars-t-hansen commented Aug 17, 2023 •

edited

Loading

lars-t-hansen commented Aug 23, 2023

lars-t-hansen commented Aug 24, 2023

lars-t-hansen commented Aug 24, 2023

Compute cross-node load and utilization #24

Compute cross-node load and utilization #24

Comments

lars-t-hansen commented Aug 4, 2023 • edited Loading

lars-t-hansen commented Aug 4, 2023

lars-t-hansen commented Aug 17, 2023 • edited Loading

lars-t-hansen commented Aug 23, 2023

lars-t-hansen commented Aug 24, 2023

lars-t-hansen commented Aug 24, 2023

lars-t-hansen commented Aug 4, 2023 •

edited

Loading

lars-t-hansen commented Aug 17, 2023 •

edited

Loading