Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks filtering using valid programs shows only the first partition defined in the configuration #3371

Open
victorusu opened this issue Jan 30, 2025 · 1 comment · Fixed by #3377
Assignees
Milestone

Comments

@victorusu
Copy link
Contributor

TL;DR

I am using the valid_systems = [r'%scheduler=slurm', r'%scheduler=squeue'] syntax described in the documentation to select which tests to run. ReFrame only selects the tests that depend on the first partition on the list defined in site_configuration.

ReFrame version

4.8.0-dev.3+cf670fe1 (latest as of today)

Steps to reproduce the error

1. Create system configuration files

Create a configuration file for a given cluster, defining two different partitions. In my case, I have used local and slurm.
Copy the file into a different one and change the order of appearance of the partitions inside the partitions key in the systems list.

I have used the two configuration files below.

File daint-local-partition-first-config.py:

site_configuration = {
    'systems': [
        {
            'name' : 'daint',
            'descr' : 'Piz Daint vCluster',
            'hostnames' : ['daint'],
            'partitions': [
                {
                    'name': 'login',
                    'scheduler': 'local',
                    'time_limit': '10m',
                    'environs': [
                        'builtin',
                    ],
                    'descr': 'Login nodes',
                    'max_jobs': 4,
                    'launcher': 'local'
                },
                {
                    'name': 'normal',
                    'descr': 'GH200',
                    'scheduler': 'slurm',
                    'time_limit': '10m',
                    'environs': [
                        'builtin',
                    ],
                    'max_jobs': 100,
                    'launcher': 'srun',
                },
            ]
        }
    ],
}

File daint-slurm-partition-first-conf.py:

site_configuration = {
    'systems': [
        {
            'name' : 'daint',
            'descr' : 'Piz Daint vCluster',
            'hostnames' : ['daint'],
            'partitions': [
                {
                    'name': 'normal',
                    'descr': 'GH200',
                    'scheduler': 'slurm',
                    'time_limit': '10m',
                    'environs': [
                        'builtin',
                    ],
                    'max_jobs': 100,
                    'launcher': 'srun',
                },
                {
                    'name': 'login',
                    'scheduler': 'local',
                    'time_limit': '10m',
                    'environs': [
                        'builtin',
                    ],
                    'descr': 'Login nodes',
                    'max_jobs': 4,
                    'launcher': 'local'
                },
            ]
        }
    ],
}

2. Define two tests

One test should set valid_systems = [r'%scheduler=slurm'] and the other valid_systems = [r'%scheduler=local'].

I am using the following two tests.

import os

import reframe as rfm
import reframe.utility.sanity as sn

SLEEPCMD='/bin/sleep'


@rfm.simple_test
class sleep_submit_job_check(rfm.RunOnlyRegressionTest):
    executable = SLEEPCMD
    # run only when slurm is the workload manager
    valid_systems = [r'%scheduler=slurm']
    valid_prog_environs = ['builtin']
    executable_opts = ['1']

    @sanity_function
    def assert_sanity(self):
        return True


@rfm.simple_test
class sleep_local_job_check(rfm.RunOnlyRegressionTest):
    executable = SLEEPCMD
    # run in the local scheduler
    valid_systems = [r'%scheduler=local']
    valid_prog_environs = ['builtin']
    executable_opts = ['1']

    @sanity_function
    def assert_sanity(self):
        return sn.all([
            sn.assert_eq(os.stat(sn.evaluate(self.stdout)).st_size, 0,
                         msg=f'file {self.stdout} is not empty'),
            sn.assert_eq(os.stat(sn.evaluate(self.stderr)).st_size, 0,
                         msg=f'file {self.stderr} is not empty'),
            ])

3. The output

When the local partition is defined as the first entry, it selects only the job that sets valid_systems = [r'%scheduler=local'].

$ reframe -C daint-local-partition-first-config.py -c mini-reproducer.py -l
[ReFrame Setup]
  version:           4.8.0-dev.3+cf670fe1
...
[List of matched checks]
- sleep_local_job_check /7370cc85
Found 1 check(s)
...

When the slurm partition is defined as the first entry, it selects only the job that sets valid_systems = [r'%scheduler= slurm'].

$ reframe -C daint-slurm-partition-first-conf.py -c mini-reproducer.py -l
[ReFrame Setup]
  version:           4.8.0-dev.3+cf670fe1
...
[List of matched checks]
- sleep_submit_job_check /4d2777d3
Found 1 check(s)
....

4. The expected output

ReFrame should select both tests independently of the order in the site_configuration variable.

Thus this

$ reframe -C daint-local-partition-first-config.py -c mini-reproducer.py -l
[ReFrame Setup]
  version:           4.8.0-dev.3+cf670fe1
...
[List of matched checks]
- sleep_local_job_check /7370cc85
- sleep_submit_job_check /4d2777d3
Found 2 check(s)
...

should have the same output as below.

$ reframe -C daint-slurm-partition-first-conf.py -c mini-reproducer.py -l
[ReFrame Setup]
  version:           4.8.0-dev.3+cf670fe1
...
[List of matched checks]
- sleep_local_job_check /7370cc85
- sleep_submit_job_check /4d2777d3
Found 2 check(s)
....
@victorusu victorusu added the bug label Jan 30, 2025
@vkarak vkarak added triage and removed bug labels Jan 30, 2025
@vkarak vkarak moved this to Todo in ReFrame Backlog Jan 30, 2025
@vkarak vkarak added this to the ReFrame 4.8 milestone Jan 30, 2025
@vkarak vkarak modified the milestones: ReFrame 4.8, ReFrame 4.7.4 Feb 1, 2025
@teojgo teojgo self-assigned this Feb 1, 2025
@teojgo teojgo added bug and removed triage labels Feb 1, 2025
@teojgo
Copy link
Contributor

teojgo commented Feb 1, 2025

This was really tricky to understand what is happening:

  1. Since there are no extras defined in any partition, the first partition that will be examined will retrieve the default value from the schema via:
    return _match_option(default_key, self._schema['defaults'])
  2. Then the extras is going to be populated/mutated using the following:
  1. The second partition will try to fetch the default value from the extras and will get the one set already from the steps 1-2 and when it's set_default is called for the scheduler and launcher, this has already default values.

@vkarak vkarak moved this from Todo to In Progress in ReFrame Backlog Feb 4, 2025
@vkarak vkarak moved this from In Progress to Merge To Develop in ReFrame Backlog Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Merge To Develop
Development

Successfully merging a pull request may close this issue.

3 participants