Added AWS config file #53

casparvl · 2023-06-13T15:14:25Z

Added a config file for use in the EESSI AWS build and test environment. It's not complete yet, it still needs core counts for each of the partitions, but I want to see if I can use ReFrame's autodetect feature to autodetect these.

…ize citc

…nitialize runtime: failed to initialize partition citc:None: no such scheduler: None'

casparvl · 2023-06-13T15:50:13Z

Autodetect feature seems to do something:

$ reframe -C test-suite/config/aws_citc.py -c test-suite/eessi/testsuite/tests/apps/gromacs.py -t 1_node -l
Detecting topology of remote partition 'citc:c4.2xlarge (haswell)': this may take some time...

It's doing it one by one though. Since the provisioning through the cloud API takes quite a bit of time, this command might take very long to complete the first time around. It should store the processor info however, so next time should be fast.

Interesting that it also does remote detection if you just list the tests with -l. But, still quite convenient to see if the remote detection works. I'll report back once it's completed, but I think the config is ready to be used.

casparvl · 2023-06-13T16:08:25Z

Hm, I need to dive further into this. It seems ReFrame submits a rfm-detect-job, but always to fair-mastodon-c6g-2xlarge-0003 (whereas it should submit one to every node type). Then, I also get

$ reframe -C test-suite/config/aws_citc.py -c test-suite/eessi/testsuite/tests/apps/gromacs.py -t 1_node -l
Detecting topology of remote partition 'citc:c4.2xlarge (haswell)': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Detecting topology of remote partition 'citc:c4.4xlarge (haswell)': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Detecting topology of remote partition 'citc:c5a.2xlarge (ZEN2)': this may take some time...

On the output. Finally, the files should end up in ~/.reframe/topology/{system}-{partition}/processor.json, but I don't see those. Something is going wrong here...

casparvl · 2023-06-14T09:37:57Z

Full error:

detecting topology info for citc:c4.2xlarge (haswell)
> no topology file found; auto-detecting...
Detecting topology of remote partition 'citc:c4.2xlarge (haswell)': this may take some time...
submitting detection script
--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.sh ---
#!/bin/bash
#SBATCH --job-name="rfm-detect-job"
#SBATCH --ntasks=1
#SBATCH --output=rfm-detect-job.out
#SBATCH --error=rfm-detect-job.err

_onerror()
{
    exitcode=$?
    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"
    exit $exitcode
}

trap _onerror ERR

./bootstrap.sh
mpirun -np 1 ./bin/reframe --detect-host-topology=topo.json

--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.sh ---
job finished
--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.out ---
==> python3 -m ensurepip --root /mnt/shared/home/casparvl/rfm.7lnb11np/external/ --default-pip
Requirement already satisfied: setuptools in /mnt/shared/home/casparvl/reframe_421/lib/python3.6/site-packages
Requirement already satisfied: pip in /mnt/shared/home/casparvl/reframe_421/lib/python3.6/site-packages
==> python3 -m pip install --no-cache-dir -q --upgrade pip --target=external/
==> python3 -m pip install --no-cache-dir -q -r requirements.txt --target=external/ --upgrade
-reframe: command `mpirun -np 1 ./bin/reframe --detect-host-topology=topo.json' failed (exit code: 127)

--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.out ---
--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.err ---
/var/spool/slurmd/job04947/slurm_script: line 17: mpirun: command not found

--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.err ---
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Traceback (most recent call last):
  File "/mnt/shared/home/casparvl/reframe/reframe/frontend/autodetect.py", line 156, in _remote_detect
    topo_info = json.loads(_contents('topo.json'))
  File "/mnt/shared/home/casparvl/reframe/reframe/frontend/autodetect.py", line 30, in _contents
    with open(filename) as fp:
FileNotFoundError: [Errno 2] No such file or directory: 'topo.json'

> device auto-detection is not supported

Ok, so the problem seems to be it tries to use the configured launcher, but clearly, that is only available if some mpi module is loaded. I'll try to configure it with srun, just to see if that makes the autodetection pass. We always decide later on if we still prefer mpirun, and if so, how we make sure the autodetect script loads the correct modules for it.

casparvl · 2023-06-14T09:44:41Z

Works!

$ cat .reframe/topology/citc-c4.2xlarge\ \(haswell\)/processor.json
{
  "arch": "haswell",
  "topology": {
    "numa_nodes": [
      "0x00ff"
    ],
    "sockets": [
      "0x00ff"
    ],
    "cores": [
      "0x0011",
      "0x0022",
      "0x0044",
      "0x0088"
    ],
    "caches": [
      {
        "type": "L2",
        "size": 262144,
        "linesize": 64,
        "associativity": 8,
        "num_cpus": 2,
        "cpusets": [
          "0x0011",
          "0x0022",
          "0x0044",
          "0x0088"
        ]
      },
      {
        "type": "L1",
        "size": 32768,
        "linesize": 64,
        "associativity": 8,
        "num_cpus": 2,
        "cpusets": [
          "0x0011",
          "0x0022",
          "0x0044",
          "0x0088"
        ]
      },
      {
        "type": "L3",
        "size": 26214400,
        "linesize": 64,
        "associativity": 20,
        "num_cpus": 8,
        "cpusets": [
          "0x00ff"
        ]
      }
    ]
  },
  "num_cpus": 8,
  "num_cpus_per_core": 2,
  "num_cpus_per_socket": 8,
  "num_sockets": 1

casparvl · 2023-06-14T09:45:35Z

I'll rename the config names to be without spaces, since the topology files are put in directories with those names (and I'd prefer to avoid Linux directory names with spaces :))

…irectories when autodetecting CPUs. Changed launcher to srun, since the CPU autodetect script doesn't work with mpirun, as there is no 'system' mpirun command

casparvl · 2023-06-14T11:33:55Z

Requested a feature to make the launcher for the remote detection job configurable reframe-hpc/reframe#2926 . If that is implemented, we can alter this config file and specify local as the launcher for the remote detection job, and go back to mpirun as a launcher for other tests.

…nment on the batch node to get the right architecture prefix

config/aws_citc.py

…partition names to be more descriptive

casparvl · 2023-06-19T15:50:51Z

It's running new CPU autodetects, and afterwards will run all singlenode GROMACS test. I'll update here once (if) that completes succesfully, just to check that this config indeed works :)

…orted in job steps, otherwise srun doesn't work

casparvl · 2023-06-20T13:12:51Z

Ok, moving back to mpirun, as I now remember why I used that. srun only works in one of two scenarios:

Using pmi2 (i.e. srun --mpi=pmi2), if the MPI library has been properly configured with pmi2 integration. Typically, this requires linking to the exact same libpmi2.so as slurm is built with - anything else typically results in An error occured in MPI_Init or something similar. Clearly, the OpenMPI in EESSI is too generic for this, as the libpmi2.so used to build the host SLURM is host-specific.
Using pmix (i.e. srun --mpi=pmix), if the SLURM supports pmix. CITC's SLURM is not build with pmix support however.

I'll drop a comment in the config referring this explaination. For now, it means we'll use mpirun as the default launcher on CITC. Only if we need to do CPU autodetection will we (temporarily) have to switch it to srun.

… otherwise the job is effectively submitted without constraint

casparvl · 2023-06-22T09:33:29Z

Hm, I ran into issues with the CPU autodetection. I think it started when we added --export=NONE.

Digging into this, it seems the ReFrame autodetect job hangs in the bootstrap, specifically, when executing

python3 -m pip install --no-cache-dir --upgrade pip -q --target=external/

Trying this manually with the system python, indeed, it hangs. When trying this with the python3 from a virtual environment (which is based on the system python!), I seem to have no issue. The venv is very simple:

rm -rf /tmp/reframe_421
python3 -m venv /tmp/reframe_421
source /tmp/reframe_421/bin/activate
python3 -m pip install reframe-hpc==4.2.1

No clue why it hangs. For now, we can work around it by taking out the --export=NONE temporarily, so that the venv is exported to the job. Note that the CPU autodetect job doesn't respect prepare_cmds, so loading the venv in a prepare_cmds doesn't work.

casparvl · 2023-06-22T09:55:10Z

Ok, we'll just have to swap back and forth a bit if you want to autodetect the CPU features. That's a bit annoying, but it's the way it is for now. I've not experienced the pip --upgrade hanging issue on other systems, only on AWS.

casparvl · 2023-06-23T16:12:21Z

Ugh, issue on the gravitons: the python begin invoked there by the bootstrap script of ReFrame when --export=None is enabled is just the system python again - i.e. the one that hangs. I'm not sure how I got this to work the first time. The issue is I cannot do anything to make the CPU autodetect use a different python3 on the gravitons - my virtualenv was built in /tmp, and even if I would build it in on shared FS, it would be an x86_64 executable, which wouldn't work on Graviton.

I guess, I will after all need to figure out why

python3 -m pip install --no-cache-dir --upgrade pip -q --target=external/

hangs and just fix that...

casparvl · 2023-06-23T16:15:10Z

python3 -m pip install --no-cache-dir --upgrade pip -vvv --target=external/

Hangs at

Installing collected packages: pip

  Creating /tmp/tmprlinfxyx/bin
  changing mode of /tmp/tmprlinfxyx/bin/pip to 755
  changing mode of /tmp/tmprlinfxyx/bin/pip3 to 755
  changing mode of /tmp/tmprlinfxyx/bin/pip3.6 to 755
Successfully installed pip-21.3.1
Cleaning up...
Starting new HTTPS connection (1): pypi.python.org
https://pypi.python.org:443 "GET /pypi/pip/json HTTP/1.1" 301 122
Starting new HTTPS connection (1): pypi.org
https://pypi.org:443 "GET /pypi/pip/json HTTP/1.1" 200 37694

I.e. it is essentially done, but then makes a new connection for some reason, and hangs there.

casparvl · 2023-06-23T16:16:26Z

Potentially https://stackoverflow.com/a/64605497

casparvl · 2023-06-23T16:21:28Z

It's really strange that it completes succesfully as long as I just make a virtualenv out of it first...

[casparvl@fair-mastodon-c6g-2xlarge-0001 tmp]$ python3 -m venv bla
[casparvl@fair-mastodon-c6g-2xlarge-0001 tmp]$ source bla/bin/activate
(bla) [casparvl@fair-mastodon-c6g-2xlarge-0001 tmp]$ python3 -m pip install --no-cache-dir --upgrade pip -vvv --target=external/

Completes fine.

boegel · 2023-06-28T11:18:00Z

@casparvl What do we actually gain from the auto-detect stuff?

Doesn't it make more sense to not use auto-detect in place where we know the CPU architecture?

config/aws_citc.py

…the logs

casparvl · 2023-07-05T14:11:59Z

Doesn't it make more sense to not use auto-detect in place where we know the CPU architecture?

Autodetect CPU topology has a standard format, and if we recommend this as a best practice, all tests can rely on those fields being set. E.g. they also set num_cpus_per_core, which is 2 if hyperthreading is enabled. This then allows us to launch e.g. different task counts on systems with hyperthreading.

Anyway, the uniformity and predictability of the topology information is they key point here.

casparvl · 2023-07-10T15:37:31Z

Put in a ticket with ReFrame regarding the hang of the bootstrap script on the Graviton nodes, see if they know more: reframe-hpc/reframe#2950

…the autodetect doesn't work there

casparvl · 2023-08-09T12:43:30Z

If this works, then #65 is not needed anymore and can be closed

laraPPr · 2023-08-22T09:10:51Z

the script that reframe is making is failing when it tries to load the module it cannot find it even though the output says that the EESSI environment has successfully been set.

output:

Found EESSI pilot repo @ /cvmfs/pilot.eessi-hpc.org/versions/2021.12!

archspec says x86_64/intel/skylake_avx512

Using x86_64/intel/skylake_avx512 as software subdirectory.

Using /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/intel/skylake_avx512/modules/all as the directory to be added to MODULEPATH.

Found Lmod configuration file at /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/intel/skylake_avx512/.lmod/lmodrc.lua

Initializing Lmod...

Prepending /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/intel/skylake_avx512/modules/all to $MODULEPATH...

Environment set up to use EESSI pilot software stack, have fun!

error:

Lmod has detected the following error: The following module(s) are unknown:

"GROMACS/2021.3-foss-2021a"

laraPPr · 2023-08-23T09:04:52Z

I ran the test suite on the EESSI repository /cvmfs/pilot.eessi-hpc.org/latest/init/bash with the reframe that is available in the software layer.
To make sure that the autodetection worked we had to clone the ReFrame repository and add it to the PYTHONPATH. This is due to a bug in Reframe which you can find out more about in the following issue.

git clone [email protected]:reframe-hpc/reframe.git
export PYTHONPATH=$PWD:$PYTHONPATH

The autodetection for x86_64 cpu targets worked by following the instructions in the config file. However for the aarch64 cpu targets, only the launcher had to be changed from mpirun to srun.

After the autodetection succeeded I ran three tests with the following parameters which all passed successfully:

reframe --config-file config/aws_citc.py --checkpath eessi/testsuite/tests/apps -R --tag 1_node --tag CI --system citc:aarch64-graviton2-8c-16gb --run --performance-report
reframe --config-file config/aws_citc.py --checkpath eessi/testsuite/tests/apps -R --tag 1_node --tag CI --system citc:x86_64-skylake-cascadelake-8c-16gb --run --performance-report
eframe --config-file config/aws_citc.py --checkpath eessi/testsuite/tests/apps -R --tag 1_node --tag CI --system citc:x86_64-haswell-8c-15gb --run --performance-report

config/aws_citc.py

casparvl · 2023-08-23T17:24:07Z

From discussion on the Slack, Lara had this error at some point:

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm._xzvpdjt/rfm-detect-job.sh ---

#!/bin/bash

#SBATCH --job-name="rfm-detect-job"

#SBATCH --ntasks=1

#SBATCH --output=rfm-detect-job.out

#SBATCH --error=rfm-detect-job.err

#SBATCH --constraint=shape=c6g.8xlarge



_onerror()

{

    exitcode=$?

    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"

    exit $exitcode

}



trap _onerror ERR



./bootstrap.sh

srun ./bin/reframe --detect-host-topology=topo.json

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.sh ---

job finished

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.out ---

-reframe: command `./bootstrap.sh' failed (exit code: 126)



--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.out ---

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.err ---

./bootstrap.sh: line 76: /cvmfs/pilot.eessi-hpc.org/versions/2021.12/compat/linux/x86_64/bin/sed: cannot execute binary file: Exec format error

./bootstrap.sh: line 76: /cvmfs/pilot.eessi-hpc.org/versions/2021.12/compat/linux/x86_64/usr/bin/python3: cannot execute binary file: Exec format error



--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.err ---

WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'

Traceback (most recent call last):

  File "/mnt/shared/home/laraPPr/eessi/reframe/reframe/frontend/autodetect.py", line 156, in _remote_detect

    topo_info = json.loads(_contents('topo.json'))

  File "/mnt/shared/home/laraPPr/eessi/reframe/reframe/frontend/autodetect.py", line 30, in _contents

    with open(filename) as fp:

FileNotFoundError: [Errno 2] No such file or directory: 'topo.json'

That was due to the environment from the login node being exported. That environment contained the x86_64 compat layer. Which of course doesn't work on aarch64 nodes.

I then suggested to try with --export=None (for the partition access) and with a source /cvmfs/.../init/bash. However, it turns out the latter is not needed, and that makes sense: the whole idea of the CPU autodetect script is that it is self-contained. It bootstraps an installation, which is performed on the batchnode, and then it uses that (bootstrapped) installation to do the CPU autodetect.

Thinking about this, I'm suddenly not sure why I needed to remove the --export=None when I ran this with a ReFrame installation from a virtual environment. Did I need something from that environment to also be available on the batch node? Maybe, but I can't remember what.

So: I'll retest it myself as well, and assuming that goes well, take out the instruction to remove --export=None.

casparvl · 2023-08-23T17:32:37Z

Ah, now I remember why I removed --export=None, it was in this comment. Not sure if it is still relevant though. I've started a run with --export=None, and it seems to work fine for now - and I guess Lara saw the same. Maybe the python3 got updated, maybe it was related to the same cache issue that broke things for me on the Graviton nodes.

Switching to srun will still be needed, as there is no system mpirun command.

laraPPr

lgtm

casparvl added 3 commits June 13, 2023 15:11

Added AWS config file

cb10f9d

Rename environment, and add mgmt as hostname for the system to recogn…

8da9abb

…ize citc

Remove slash from partition names, to avoid error 'ERROR: failed to i…

bc7ab43

…nitialize runtime: failed to initialize partition citc:None: no such scheduler: None'

casparvl marked this pull request as ready for review June 13, 2023 15:50

Removed spaces from partition names, since these are used to create d…

fa747a9

…irectories when autodetecting CPUs. Changed launcher to srun, since the CPU autodetect script doesn't work with mpirun, as there is no 'system' mpirun command

Make sure vars don't get exported, we need to source the EESSI enviro…

cffe271

…nment on the batch node to get the right architecture prefix

boegel reviewed Jun 15, 2023

View reviewed changes

config/aws_citc.py Outdated Show resolved Hide resolved

boegel reviewed Jun 15, 2023

View reviewed changes

config/aws_citc.py Outdated Show resolved Hide resolved

boegel added this to the 0.1 milestone Jun 15, 2023

Changed launcher to srun, to work with CPU autodetect. Also, changed …

27c9c90

…partition names to be more descriptive

Need to set SLURM_EXPORT_ENV in job to make sure environment gets exp…

3a531a6

…orted in job steps, otherwise srun doesn't work

casparvl added 2 commits June 20, 2023 14:21

Changed back to mpirun

7684290

somehow this only works if --constraint is followed by an equal sign,…

bb6e8ef

… otherwise the job is effectively submitted without constraint

casparvl added 2 commits June 22, 2023 09:50

Add explaination on how to make autodetect work

191a313

Add explaination on how to make autodetect work

511ab6e

casparvl requested a review from boegel June 23, 2023 11:17

boegel requested changes Jun 28, 2023

View reviewed changes

config/aws_citc.py Outdated Show resolved Hide resolved

Changed append to True, even if we also already use date stampts for …

9bd26a4

…the logs

casparvl added 6 commits July 18, 2023 09:16

Hardcode the processor configs for the graviton nodes for now, since …

2316d90

…the autodetect doesn't work there

Merge branch 'main' into config_aws

9d4f3dc

Bring logging in line with Vega config. Also, use FEATURES constant

fb02996

Corrected two small mistakes in the sytnax

36a79a0

Merge branch 'main' into config_aws

1f450e2

Removed graviton CPU description, as autodetect seems to work now

67e8cbe

casparvl assigned boegel Aug 9, 2023

boegel requested changes Aug 23, 2023

View reviewed changes

config/aws_citc.py Outdated Show resolved Hide resolved

Clarified comments on what to do to make CPU autodetection work

a4f777f

laraPPr approved these changes Aug 24, 2023

View reviewed changes

boegel approved these changes Aug 24, 2023

View reviewed changes

laraPPr merged commit 012e74c into EESSI:main Aug 24, 2023
9 checks passed

boegel mentioned this pull request Aug 25, 2023

use common config section for logging #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added AWS config file #53

Added AWS config file #53

casparvl commented Jun 13, 2023

casparvl commented Jun 13, 2023

casparvl commented Jun 13, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 20, 2023

casparvl commented Jun 22, 2023

casparvl commented Jun 22, 2023

casparvl commented Jun 23, 2023

casparvl commented Jun 23, 2023

casparvl commented Jun 23, 2023

casparvl commented Jun 23, 2023

boegel commented Jun 28, 2023

casparvl commented Jul 5, 2023 •

edited

Loading

casparvl commented Jul 10, 2023

casparvl commented Aug 9, 2023

laraPPr commented Aug 22, 2023 •

edited

Loading

laraPPr commented Aug 23, 2023 •

edited

Loading

casparvl commented Aug 23, 2023

casparvl commented Aug 23, 2023

laraPPr left a comment

Added AWS config file #53

Added AWS config file #53

Conversation

casparvl commented Jun 13, 2023

casparvl commented Jun 13, 2023

casparvl commented Jun 13, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 14, 2023

casparvl commented Jun 19, 2023

casparvl commented Jun 20, 2023

casparvl commented Jun 22, 2023

casparvl commented Jun 22, 2023

casparvl commented Jun 23, 2023

casparvl commented Jun 23, 2023

casparvl commented Jun 23, 2023

casparvl commented Jun 23, 2023

boegel commented Jun 28, 2023

casparvl commented Jul 5, 2023 • edited Loading

casparvl commented Jul 10, 2023

casparvl commented Aug 9, 2023

laraPPr commented Aug 22, 2023 • edited Loading

laraPPr commented Aug 23, 2023 • edited Loading

casparvl commented Aug 23, 2023

casparvl commented Aug 23, 2023

laraPPr left a comment

Choose a reason for hiding this comment

casparvl commented Jul 5, 2023 •

edited

Loading

laraPPr commented Aug 22, 2023 •

edited

Loading

laraPPr commented Aug 23, 2023 •

edited

Loading