Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AWS config file #53

Merged
merged 19 commits into from
Aug 24, 2023
Merged

Added AWS config file #53

merged 19 commits into from
Aug 24, 2023

Conversation

casparvl
Copy link
Collaborator

Added a config file for use in the EESSI AWS build and test environment. It's not complete yet, it still needs core counts for each of the partitions, but I want to see if I can use ReFrame's autodetect feature to autodetect these.

casparvl added 3 commits June 13, 2023 15:11
@casparvl
Copy link
Collaborator Author

Autodetect feature seems to do something:

$ reframe -C test-suite/config/aws_citc.py -c test-suite/eessi/testsuite/tests/apps/gromacs.py -t 1_node -l
Detecting topology of remote partition 'citc:c4.2xlarge (haswell)': this may take some time...

It's doing it one by one though. Since the provisioning through the cloud API takes quite a bit of time, this command might take very long to complete the first time around. It should store the processor info however, so next time should be fast.

Interesting that it also does remote detection if you just list the tests with -l. But, still quite convenient to see if the remote detection works. I'll report back once it's completed, but I think the config is ready to be used.

@casparvl casparvl marked this pull request as ready for review June 13, 2023 15:50
@casparvl
Copy link
Collaborator Author

Hm, I need to dive further into this. It seems ReFrame submits a rfm-detect-job, but always to fair-mastodon-c6g-2xlarge-0003 (whereas it should submit one to every node type). Then, I also get

$ reframe -C test-suite/config/aws_citc.py -c test-suite/eessi/testsuite/tests/apps/gromacs.py -t 1_node -l
Detecting topology of remote partition 'citc:c4.2xlarge (haswell)': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Detecting topology of remote partition 'citc:c4.4xlarge (haswell)': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Detecting topology of remote partition 'citc:c5a.2xlarge (ZEN2)': this may take some time...

On the output. Finally, the files should end up in ~/.reframe/topology/{system}-{partition}/processor.json, but I don't see those. Something is going wrong here...

@casparvl
Copy link
Collaborator Author

Full error:

detecting topology info for citc:c4.2xlarge (haswell)
> no topology file found; auto-detecting...
Detecting topology of remote partition 'citc:c4.2xlarge (haswell)': this may take some time...
submitting detection script
--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.sh ---
#!/bin/bash
#SBATCH --job-name="rfm-detect-job"
#SBATCH --ntasks=1
#SBATCH --output=rfm-detect-job.out
#SBATCH --error=rfm-detect-job.err

_onerror()
{
    exitcode=$?
    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"
    exit $exitcode
}

trap _onerror ERR

./bootstrap.sh
mpirun -np 1 ./bin/reframe --detect-host-topology=topo.json

--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.sh ---
job finished
--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.out ---
==> python3 -m ensurepip --root /mnt/shared/home/casparvl/rfm.7lnb11np/external/ --default-pip
Requirement already satisfied: setuptools in /mnt/shared/home/casparvl/reframe_421/lib/python3.6/site-packages
Requirement already satisfied: pip in /mnt/shared/home/casparvl/reframe_421/lib/python3.6/site-packages
==> python3 -m pip install --no-cache-dir -q --upgrade pip --target=external/
==> python3 -m pip install --no-cache-dir -q -r requirements.txt --target=external/ --upgrade
-reframe: command `mpirun -np 1 ./bin/reframe --detect-host-topology=topo.json' failed (exit code: 127)

--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.out ---
--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.err ---
/var/spool/slurmd/job04947/slurm_script: line 17: mpirun: command not found

--- /mnt/shared/home/casparvl/rfm.7lnb11np/rfm-detect-job.err ---
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Traceback (most recent call last):
  File "/mnt/shared/home/casparvl/reframe/reframe/frontend/autodetect.py", line 156, in _remote_detect
    topo_info = json.loads(_contents('topo.json'))
  File "/mnt/shared/home/casparvl/reframe/reframe/frontend/autodetect.py", line 30, in _contents
    with open(filename) as fp:
FileNotFoundError: [Errno 2] No such file or directory: 'topo.json'

> device auto-detection is not supported

Ok, so the problem seems to be it tries to use the configured launcher, but clearly, that is only available if some mpi module is loaded. I'll try to configure it with srun, just to see if that makes the autodetection pass. We always decide later on if we still prefer mpirun, and if so, how we make sure the autodetect script loads the correct modules for it.

@casparvl
Copy link
Collaborator Author

Works!

$ cat .reframe/topology/citc-c4.2xlarge\ \(haswell\)/processor.json
{
  "arch": "haswell",
  "topology": {
    "numa_nodes": [
      "0x00ff"
    ],
    "sockets": [
      "0x00ff"
    ],
    "cores": [
      "0x0011",
      "0x0022",
      "0x0044",
      "0x0088"
    ],
    "caches": [
      {
        "type": "L2",
        "size": 262144,
        "linesize": 64,
        "associativity": 8,
        "num_cpus": 2,
        "cpusets": [
          "0x0011",
          "0x0022",
          "0x0044",
          "0x0088"
        ]
      },
      {
        "type": "L1",
        "size": 32768,
        "linesize": 64,
        "associativity": 8,
        "num_cpus": 2,
        "cpusets": [
          "0x0011",
          "0x0022",
          "0x0044",
          "0x0088"
        ]
      },
      {
        "type": "L3",
        "size": 26214400,
        "linesize": 64,
        "associativity": 20,
        "num_cpus": 8,
        "cpusets": [
          "0x00ff"
        ]
      }
    ]
  },
  "num_cpus": 8,
  "num_cpus_per_core": 2,
  "num_cpus_per_socket": 8,
  "num_sockets": 1

@casparvl
Copy link
Collaborator Author

I'll rename the config names to be without spaces, since the topology files are put in directories with those names (and I'd prefer to avoid Linux directory names with spaces :))

…irectories when autodetecting CPUs. Changed launcher to srun, since the CPU autodetect script doesn't work with mpirun, as there is no 'system' mpirun command
@casparvl
Copy link
Collaborator Author

Requested a feature to make the launcher for the remote detection job configurable reframe-hpc/reframe#2926 . If that is implemented, we can alter this config file and specify local as the launcher for the remote detection job, and go back to mpirun as a launcher for other tests.

…nment on the batch node to get the right architecture prefix
config/aws_citc.py Outdated Show resolved Hide resolved
config/aws_citc.py Outdated Show resolved Hide resolved
@boegel boegel added this to the 0.1 milestone Jun 15, 2023
@casparvl
Copy link
Collaborator Author

It's running new CPU autodetects, and afterwards will run all singlenode GROMACS test. I'll update here once (if) that completes succesfully, just to check that this config indeed works :)

…orted in job steps, otherwise srun doesn't work
@casparvl
Copy link
Collaborator Author

Ok, moving back to mpirun, as I now remember why I used that. srun only works in one of two scenarios:

  • Using pmi2 (i.e. srun --mpi=pmi2), if the MPI library has been properly configured with pmi2 integration. Typically, this requires linking to the exact same libpmi2.so as slurm is built with - anything else typically results in An error occured in MPI_Init or something similar. Clearly, the OpenMPI in EESSI is too generic for this, as the libpmi2.so used to build the host SLURM is host-specific.
  • Using pmix (i.e. srun --mpi=pmix), if the SLURM supports pmix. CITC's SLURM is not build with pmix support however.

I'll drop a comment in the config referring this explaination. For now, it means we'll use mpirun as the default launcher on CITC. Only if we need to do CPU autodetection will we (temporarily) have to switch it to srun.

casparvl added 2 commits June 20, 2023 14:21
… otherwise the job is effectively submitted without constraint
@casparvl
Copy link
Collaborator Author

Hm, I ran into issues with the CPU autodetection. I think it started when we added --export=NONE.

Digging into this, it seems the ReFrame autodetect job hangs in the bootstrap, specifically, when executing

python3 -m pip install --no-cache-dir --upgrade pip -q --target=external/

Trying this manually with the system python, indeed, it hangs. When trying this with the python3 from a virtual environment (which is based on the system python!), I seem to have no issue. The venv is very simple:

rm -rf /tmp/reframe_421
python3 -m venv /tmp/reframe_421
source /tmp/reframe_421/bin/activate
python3 -m pip install reframe-hpc==4.2.1

No clue why it hangs. For now, we can work around it by taking out the --export=NONE temporarily, so that the venv is exported to the job. Note that the CPU autodetect job doesn't respect prepare_cmds, so loading the venv in a prepare_cmds doesn't work.

@casparvl
Copy link
Collaborator Author

Ok, we'll just have to swap back and forth a bit if you want to autodetect the CPU features. That's a bit annoying, but it's the way it is for now. I've not experienced the pip --upgrade hanging issue on other systems, only on AWS.

@casparvl casparvl requested a review from boegel June 23, 2023 11:17
@casparvl
Copy link
Collaborator Author

Ugh, issue on the gravitons: the python begin invoked there by the bootstrap script of ReFrame when --export=None is enabled is just the system python again - i.e. the one that hangs. I'm not sure how I got this to work the first time. The issue is I cannot do anything to make the CPU autodetect use a different python3 on the gravitons - my virtualenv was built in /tmp, and even if I would build it in on shared FS, it would be an x86_64 executable, which wouldn't work on Graviton.

I guess, I will after all need to figure out why

python3 -m pip install --no-cache-dir --upgrade pip -q --target=external/

hangs and just fix that...

@casparvl
Copy link
Collaborator Author

python3 -m pip install --no-cache-dir --upgrade pip -vvv --target=external/

Hangs at

Installing collected packages: pip

  Creating /tmp/tmprlinfxyx/bin
  changing mode of /tmp/tmprlinfxyx/bin/pip to 755
  changing mode of /tmp/tmprlinfxyx/bin/pip3 to 755
  changing mode of /tmp/tmprlinfxyx/bin/pip3.6 to 755
Successfully installed pip-21.3.1
Cleaning up...
Starting new HTTPS connection (1): pypi.python.org
https://pypi.python.org:443 "GET /pypi/pip/json HTTP/1.1" 301 122
Starting new HTTPS connection (1): pypi.org
https://pypi.org:443 "GET /pypi/pip/json HTTP/1.1" 200 37694

I.e. it is essentially done, but then makes a new connection for some reason, and hangs there.

@casparvl
Copy link
Collaborator Author

Potentially https://stackoverflow.com/a/64605497

@casparvl
Copy link
Collaborator Author

It's really strange that it completes succesfully as long as I just make a virtualenv out of it first...

[casparvl@fair-mastodon-c6g-2xlarge-0001 tmp]$ python3 -m venv bla
[casparvl@fair-mastodon-c6g-2xlarge-0001 tmp]$ source bla/bin/activate
(bla) [casparvl@fair-mastodon-c6g-2xlarge-0001 tmp]$ python3 -m pip install --no-cache-dir --upgrade pip -vvv --target=external/

Completes fine.

@boegel
Copy link
Contributor

boegel commented Jun 28, 2023

@casparvl What do we actually gain from the auto-detect stuff?

Doesn't it make more sense to not use auto-detect in place where we know the CPU architecture?

config/aws_citc.py Outdated Show resolved Hide resolved
@casparvl
Copy link
Collaborator Author

casparvl commented Jul 5, 2023

Doesn't it make more sense to not use auto-detect in place where we know the CPU architecture?

Autodetect CPU topology has a standard format, and if we recommend this as a best practice, all tests can rely on those fields being set. E.g. they also set num_cpus_per_core, which is 2 if hyperthreading is enabled. This then allows us to launch e.g. different task counts on systems with hyperthreading.

Anyway, the uniformity and predictability of the topology information is they key point here.

@casparvl
Copy link
Collaborator Author

Put in a ticket with ReFrame regarding the hang of the bootstrap script on the Graviton nodes, see if they know more: reframe-hpc/reframe#2950

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 9, 2023

If this works, then #65 is not needed anymore and can be closed

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 22, 2023

the script that reframe is making is failing when it tries to load the module it cannot find it even though the output says that the EESSI environment has successfully been set.

output:

Found EESSI pilot repo @ /cvmfs/pilot.eessi-hpc.org/versions/2021.12!

archspec says x86_64/intel/skylake_avx512

Using x86_64/intel/skylake_avx512 as software subdirectory.

Using /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/intel/skylake_avx512/modules/all as the directory to be added to MODULEPATH.

Found Lmod configuration file at /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/intel/skylake_avx512/.lmod/lmodrc.lua

Initializing Lmod...

Prepending /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/intel/skylake_avx512/modules/all to $MODULEPATH...

Environment set up to use EESSI pilot software stack, have fun!

error:

Lmod has detected the following error: The following module(s) are unknown:

"GROMACS/2021.3-foss-2021a"

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 23, 2023

I ran the test suite on the EESSI repository /cvmfs/pilot.eessi-hpc.org/latest/init/bash with the reframe that is available in the software layer.
To make sure that the autodetection worked we had to clone the ReFrame repository and add it to the PYTHONPATH. This is due to a bug in Reframe which you can find out more about in the following issue.

git clone [email protected]:reframe-hpc/reframe.git
export PYTHONPATH=$PWD:$PYTHONPATH

The autodetection for x86_64 cpu targets worked by following the instructions in the config file. However for the aarch64 cpu targets, only the launcher had to be changed from mpirun to srun.

After the autodetection succeeded I ran three tests with the following parameters which all passed successfully:

  • reframe --config-file config/aws_citc.py --checkpath eessi/testsuite/tests/apps -R --tag 1_node --tag CI --system citc:aarch64-graviton2-8c-16gb --run --performance-report
  • reframe --config-file config/aws_citc.py --checkpath eessi/testsuite/tests/apps -R --tag 1_node --tag CI --system citc:x86_64-skylake-cascadelake-8c-16gb --run --performance-report
  • eframe --config-file config/aws_citc.py --checkpath eessi/testsuite/tests/apps -R --tag 1_node --tag CI --system citc:x86_64-haswell-8c-15gb --run --performance-report

config/aws_citc.py Outdated Show resolved Hide resolved
@casparvl
Copy link
Collaborator Author

From discussion on the Slack, Lara had this error at some point:

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm._xzvpdjt/rfm-detect-job.sh ---

#!/bin/bash

#SBATCH --job-name="rfm-detect-job"

#SBATCH --ntasks=1

#SBATCH --output=rfm-detect-job.out

#SBATCH --error=rfm-detect-job.err

#SBATCH --constraint=shape=c6g.8xlarge



_onerror()

{

    exitcode=$?

    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"

    exit $exitcode

}



trap _onerror ERR



./bootstrap.sh

srun ./bin/reframe --detect-host-topology=topo.json

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.sh ---

job finished

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.out ---

-reframe: command `./bootstrap.sh' failed (exit code: 126)



--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.out ---

--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.err ---

./bootstrap.sh: line 76: /cvmfs/pilot.eessi-hpc.org/versions/2021.12/compat/linux/x86_64/bin/sed: cannot execute binary file: Exec format error

./bootstrap.sh: line 76: /cvmfs/pilot.eessi-hpc.org/versions/2021.12/compat/linux/x86_64/usr/bin/python3: cannot execute binary file: Exec format error



--- /mnt/shared/home/laraPPr/eessi/test-suite/rfm.9v9st15f/rfm-detect-job.err ---

WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'

Traceback (most recent call last):

  File "/mnt/shared/home/laraPPr/eessi/reframe/reframe/frontend/autodetect.py", line 156, in _remote_detect

    topo_info = json.loads(_contents('topo.json'))

  File "/mnt/shared/home/laraPPr/eessi/reframe/reframe/frontend/autodetect.py", line 30, in _contents

    with open(filename) as fp:

FileNotFoundError: [Errno 2] No such file or directory: 'topo.json'

That was due to the environment from the login node being exported. That environment contained the x86_64 compat layer. Which of course doesn't work on aarch64 nodes.

I then suggested to try with --export=None (for the partition access) and with a source /cvmfs/.../init/bash. However, it turns out the latter is not needed, and that makes sense: the whole idea of the CPU autodetect script is that it is self-contained. It bootstraps an installation, which is performed on the batchnode, and then it uses that (bootstrapped) installation to do the CPU autodetect.

Thinking about this, I'm suddenly not sure why I needed to remove the --export=None when I ran this with a ReFrame installation from a virtual environment. Did I need something from that environment to also be available on the batch node? Maybe, but I can't remember what.

So: I'll retest it myself as well, and assuming that goes well, take out the instruction to remove --export=None.

@casparvl
Copy link
Collaborator Author

Ah, now I remember why I removed --export=None, it was in this comment. Not sure if it is still relevant though. I've started a run with --export=None, and it seems to work fine for now - and I guess Lara saw the same. Maybe the python3 got updated, maybe it was related to the same cache issue that broke things for me on the Graviton nodes.

Switching to srun will still be needed, as there is no system mpirun command.

Copy link
Collaborator

@laraPPr laraPPr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@laraPPr laraPPr merged commit 012e74c into EESSI:main Aug 24, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants