use mixin class for osu #222

smoors · 2024-12-30T19:53:18Z

EESSI_Mixin class:

change measure_mem_usage() to @run_before('setup', always_last=True) to avoid perf_variables being overwritten by the hpctestlib
set default num_tasks_per_compute_unit

OSU test:

use mixin class
create base class to minimize code duplication
reduce required memory per node to 1GB per task
add partial node scales for the pt2pt GPU test and skip the ones that don't have exactly 2 GPUs

fixes #145

eessi/testsuite/tests/apps/osu.py

casparvl · 2025-01-09T14:14:50Z

eessi/testsuite/tests/apps/osu.py

 from eessi.testsuite.utils import find_modules, log


 def filter_scales_pt2pt():
    """
    Filtering function for filtering scales for the pt2pt OSU test
+    returns all scales with either 2 cores, 1 full node, or 2 full nodes


I was a bit surprised that it runs on 2 cores, full node and 2 full nodes, but not e.g. on 2 GPUs (which on Snellius happens to be 1_2_node). This was already the case before your changes here, so it's not introduced in this PR, but maybe @satishskamath can comment on why this is the case? I do seem to remember him saying he thought exclusive nodes would be good for reproducibility for these tests, but according to that argument also the 2-cores size would have been filtered - I'd say having 2 GPUs is the equivalent .

Now, I do think it is difficult to filter the GPUs down to 2-GPU setups, as you don't know this until after the setup phase. I.e. we'd basically have to accept everything single-node in the init phase, and then skip the ones that would provide more than 2 GPUs.

will fix this in a follow-up PR, see #224

edit: i decided to add it to this PR anyway.

My initial attempt on this and the reason I kept 2_cores and 1cpn_2nodes case for GPUs for point to point tests was because these cases were valid and one needs at least one host CPU to assign to a device. But these scales are mainly for CPUs therefore can be a bit misleading and I do lean towards the fact that network based tests are performed the best using two full exclusive nodes or 1 full node which is the objective of this test as well.

casparvl

Tested, runs fine. My main question would be the required_mem_per_node remark, otherwise, I think this is ready to be merged.

casparvl · 2025-01-13T15:06:42Z

eessi/testsuite/tests/apps/osu.py

@@ -157,3 +184,9 @@ def set_compute_unit(self):
            DEVICE_TYPES[GPU]: COMPUTE_UNIT[GPU],
        }
        self.compute_unit = device_to_compute_unit.get(self.device_type)
+
+    @run_after('setup')
+    def skip_test_1gpu(self):


Didn't we agree that if it's a one node test, we would check if we had exactly 2 GPUs and skip otherwise? I mean, the downside is that for nodes with very weird GPU counts (3, or 5), none of the standard scales would have this. The downside of the current situations is that I now essentially run the same test twice on a 4-GPU node: once at the 1_node scale, and once at the 1_2_node scale.

On the other hand, we also discussed the value of exclusively allocating, so that means we would also not skip 1_node in any case. So I guess the only nodes where this would make a difference is if you have 8 gpus per node. In that case, you might want to run 1_node and 1_4_node, but skip 1_2_node.

Well, let me know what you think. It might also be too messy to implement, in which case I'm happy to keep the current setup.

(this comment is on the collective test, but you probably meant to add it to the pt2pt test ;))

yes, i had the same thoughts, and i agree neither choice is ideal.

i am now thinking that maybe we should not include the 1_node and instead leave it up to the user to request an exclusive node (with sbatch --exclusive) if they want a full node.

there is also the question of what we should do with the 2_nodes scale. a solution could be to introduce extra scales 1_gpu, 2_gpus, and 1gpn_2nodes (that we may want to keep out of the default SCALES list?), which automatically sets the number of cores so that the default number of cores per gpu is respected based on the info in the site config file and autodetection. this solution would also avoid having to add the skip_if statements because we can then select only the scales with 2 gpus.

what do you think of this solution?

I'm wondering what the impact will be on existing tests.

I.e. suppose you have a GPU test that just runs at any scale (say: GROMACS). Having both the scales 1_4_node and 1_gpu will generate two essentially identical test instances on a partition with 4 GPUs per node. This means that as an end-user, you pretty much have to select a set of tags. Whereas right now, it is completely reasonable to run the test suite without any tags, if you just want to run anything. I would like to keep that behavior tbh: if you just want to run everything (once), just run reframe without tags. I mean, that's essentially why I was complaining about the fact that this would run 2 essentially identical test instances in the first case.

Similarly, suppose you have CPU tests. You then have to filter all of those gpu-related scales. That makes a straightforward case (a CPU test that runs at any scale) suddenly more complex, since every CPU test would need to implement some filtering (though we could probably define a filtered list in constants.py as well).

I don't know, maybe we should just leave it as you implemented it now, I feel the cure is getting worse than the disease :). Sure, the 1_2_node and 1_node scales are nearly identical on a 4-GPU node (with the exception that the second is, indeed, exclusive, which has some value at least). So I guess only on an 8-GPU node would you really get multiple tests that provide zero extra information (1_4_node and 1_2_node would be 'identical' there: only using 2 GPUs, but no exclusive node in either case).

casparvl · 2025-01-14T11:14:35Z

I found an issue with the OSU test that I didn't realize was there before. It does NOT pop up if you have the ALWAYS_REQUEST_GPUS feature set. But if you don't, then test.num_gpus_per_node doesn't get set anywhere, and that results in this error:

[2025-01-14T11:45:49] info: reframe:   * Reason: type error: ../test-suite/eessi/testsuite/tests/apps/osu.py:151: unsupported operand type(s) for *: 'NoneType' and 'int'
        num_gpus = self.num_gpus_per_node * self.num_nodes

[2025-01-14T11:45:49] verbose: reframe: Traceback (most recent call last):
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/frontend/executors/__init__.py", line 318, in _safe_call
    return fn(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/osu.py", line 151, in skip_test_1gpu
    num_gpus = self.num_gpus_per_node * self.num_nodes
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Now, it makes total sense to use

compute_unit = COMPUTE_UNIT[NODE]

and set

    @run_after('init')
    def set_num_tasks_per_compute_unit(self):
        """ Setting number of tasks per compute unit and cpus per task. This sets num_cpus_per_task
        for 1 node and 2 node options where the request is for full nodes."""
        if SCALES.get(self.scale).get('num_nodes') == 1:
            self.num_tasks_per_compute_unit = 2

in order to make sure the right amount of tasks is launched per node. So, I'm a bit at a loss where we messed this up. Should we somehow have this test set the test.num_gpus_per_node=test.default_num_gpus_per_node manually or something, now that the hook won't do it? Or should we consider this a 'bug' in the hook? I don't think it is a bug in the hook tbh, it had no specific reason to explicitly ask for a num_gpus_per_node, as we didn't define `ALWAYS_REQUEST_GPUS... I have the feeling it's the tests responsibility to say that 'yeah, I'm asking for node as a compute unit BUT I still want to see how many gpus per node I'm expecting to get, and skip if that's less than only 1'.

casparvl · 2025-01-14T11:31:04Z

Note that it is not just important for the skip_if: we need to set

    @run_after('setup')
    def set_gpus_per_node(self):
        self.num_gpus_per_node = self.default_num_gpus_per_node

in order for those GPUs to even be requested by the job... otherwise, you'd get a CPU-only allocation on a GPU node.

Even more accurate might be:

    @run_after('setup')
    def skip_test_1gpu(self):
        num_gpus = self.default_num_gpus_per_node * self.num_nodes
        self.skip_if(
            num_gpus != 2 and self.scale not in ['1_node', '2_nodes'],
            f"Skipping test : {num_gpus} GPU(s) available for this test case, need exactly 2"
        )

    @run_after('setup')
    def set_gpus_per_node(self):
        if self.scale not in ['1_node', '2_nodes']:
            self.num_gpus_per_node = 2  # make sure to allocate 2 GPUs
       elif self.scale in ['1_node']:
           self.num_gpus_per_node = self.default_num_gpus_per_node  # just allocate exclusively
      else:
           self.num_gpus_per_node = 1  # for multinode tests, we use 1 GPU per node

For collectives, there is no problem, because there we map the DEVICE_TYPES[GPU] to COMPUTE_UNIT[GPU], thus defining the self.num_gpus_per_node just fine.

casparvl · 2025-01-14T11:53:27Z

Ok, this doesn't account for the fact that the node could have only 1 GPU per node. We should do:

    @run_after('setup')
    def skip_test_1gpu(self):
        num_gpus = self.default_num_gpus_per_node * self.num_nodes
        # On a partial node allocation, run this test only if exactly 2 GPUs are allocated
        if self.scale not in ['1_node', '2_nodes']:
            self.skip_if(
                num_gpus != 2,
                f"Skipping test : {num_gpus} GPU(s) available for this test case, need exactly 2"
        )
        # If the scale is 1_node, make sure there are at least 2 GPUs
        elif self.scale == '1_node':
            self.skip_if(num_gpus < 2, "Skipping GPU test : only 1 GPU available for this test case")

    @run_after('setup')
    def set_gpus_per_node(self):
        if self.scale not in ['1_node', '2_nodes']:
            self.num_gpus_per_node = 2  # make sure to allocate 2 GPUs
        elif self.scale == '1_node':
            self.num_gpus_per_node = self.default_num_gpus_per_node  # just allocate exclusively
        else:
            self.num_gpus_per_node = 1  # for multinode tests, we use 1 GPU per node

casparvl · 2025-01-15T10:24:31Z

Note: I made a PR to your feature branch @smoors

satishskamath · 2025-01-16T13:23:01Z

I found an issue with the OSU test that I didn't realize was there before. It does NOT pop up if you have the ALWAYS_REQUEST_GPUS feature set. But if you don't, then test.num_gpus_per_node doesn't get set anywhere, and that results in this error:
[2025-01-14T11:45:49] info: reframe:   * Reason: type error: ../test-suite/eessi/testsuite/tests/apps/osu.py:151: unsupported operand type(s) for *: 'NoneType' and 'int'
        num_gpus = self.num_gpus_per_node * self.num_nodes

[2025-01-14T11:45:49] verbose: reframe: Traceback (most recent call last):
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/frontend/executors/__init__.py", line 318, in _safe_call
    return fn(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/osu.py", line 151, in skip_test_1gpu
    num_gpus = self.num_gpus_per_node * self.num_nodes
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
Now, it makes total sense to use
compute_unit = COMPUTE_UNIT[NODE]
and set
    @run_after('init')
    def set_num_tasks_per_compute_unit(self):
        """ Setting number of tasks per compute unit and cpus per task. This sets num_cpus_per_task
        for 1 node and 2 node options where the request is for full nodes."""
        if SCALES.get(self.scale).get('num_nodes') == 1:
            self.num_tasks_per_compute_unit = 2
in order to make sure the right amount of tasks is launched per node. So, I'm a bit at a loss where we messed this up. Should we somehow have this test set the test.num_gpus_per_node=test.default_num_gpus_per_node manually or something, now that the hook won't do it? Or should we consider this a 'bug' in the hook? I don't think it is a bug in the hook tbh, it had no specific reason to explicitly ask for a num_gpus_per_node, as we didn't define `ALWAYS_REQUEST_GPUS... I have the feeling it's the tests responsibility to say that 'yeah, I'm asking for node as a compute unit BUT I still want to see how many gpus per node I'm expecting to get, and skip if that's less than only 1'.

@casparvl In the version in the main branch, I used to set this in the test itself.

use mixin class for osu

40dae87

smoors added this to the v0.4.1 milestone Dec 30, 2024

Samuel Moors added 3 commits December 30, 2024 21:01

import run_before

2905154

initialize compute_unit in base class

21de9d9

don't inherit from mixin in base class

06e808a

casparvl reviewed Jan 9, 2025

View reviewed changes

eessi/testsuite/tests/apps/osu.py Outdated Show resolved Hide resolved

casparvl reviewed Jan 9, 2025

View reviewed changes

Samuel Moors added 3 commits January 9, 2025 17:22

use 1GB memory per task

e905e28

Merge branch 'main' into osu_mixin

84f213f

add partial node scales to the pt2pt gpu tests

f27ba00

casparvl reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use mixin class for osu #222

use mixin class for osu #222

smoors commented Dec 30, 2024 •

edited

Loading

casparvl Jan 9, 2025

smoors Jan 11, 2025

smoors Jan 12, 2025

satishskamath Jan 14, 2025

casparvl left a comment

casparvl Jan 13, 2025

smoors Jan 14, 2025

casparvl Jan 14, 2025 •

edited

Loading

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 15, 2025

satishskamath commented Jan 16, 2025 •

edited

Loading

use mixin class for osu #222

Are you sure you want to change the base?

use mixin class for osu #222

Conversation

smoors commented Dec 30, 2024 • edited Loading

casparvl Jan 9, 2025

Choose a reason for hiding this comment

smoors Jan 11, 2025

Choose a reason for hiding this comment

smoors Jan 12, 2025

Choose a reason for hiding this comment

satishskamath Jan 14, 2025

Choose a reason for hiding this comment

casparvl left a comment

Choose a reason for hiding this comment

casparvl Jan 13, 2025

Choose a reason for hiding this comment

smoors Jan 14, 2025

Choose a reason for hiding this comment

casparvl Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 15, 2025

satishskamath commented Jan 16, 2025 • edited Loading

smoors commented Dec 30, 2024 •

edited

Loading

casparvl Jan 14, 2025 •

edited

Loading

satishskamath commented Jan 16, 2025 •

edited

Loading