Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Tensorflow test #38

Merged
merged 36 commits into from
Jul 31, 2023
Merged

add Tensorflow test #38

merged 36 commits into from
Jul 31, 2023

Conversation

casparvl
Copy link
Collaborator

No description provided.

Caspar van Leeuwen and others added 8 commits May 22, 2023 18:24
…nted into functions. Also separated the training (which is timed now) and evaluation (which is not timed). Clearly print computational performance and accuracy at the end, to make it easy for a ReFrame test to pick up in sanity and performance functions.
…sks etc is still hard-coded. Also still a todo: make sure that proper binding is used
@casparvl
Copy link
Collaborator Author

The main issue with this PR currently is twofold:

  1. All threads are incorrectly bound to the first core in the allocation as soon as either OMP_PLACES or OMP_PROC_BIND is set. For details, see OMP_PROC_BIND or OMP_PLACES either ignored or respected incorrectly tensorflow/tensorflow#60843
  2. On a multi-GPU run, one of the ranks is executing a massive amount on the CPU (instead of GPU), while the GPU seems underutilized. This problem is exacerbated by binding, leading to a slowdown.

Note that I've been testing this with our locally built TensorFlow module, TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0.

Issue 1

For details, see tensorflow/tensorflow#60843

Issue 2

Running

mpirun -np 4 --bind-to none python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

I see quite a lot of CPU useage. Note that I upped the batch size to 4096 to get the GPUs to >90% utilization. Although not very clearly visible from the screenshot, pretty much all of the CPU activity is from the first python process (the one corresponding to rank 0):

image

From the screenshot, it is also clear that one of the GPUs seems to be underutilized. The one that is underutilized also seems to change per epoch

image

Total throughput for this run is 324557.2348381291 img/s.

Note that when using 1 GPU:

mpirun -np 1 --bind-to none python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

I don't see this high CPU utilization. I guess it is related to the communication with the other workers (maybe wait time?).

image

For reference, this run resulted in Performance: 305804.27216238284 img/s. Note that this means hardly any increase in throughput with 4 GPUs. This could be due to the management overhead or because it is bottlenecked by this one GPU that is not well utilized?

The high CPU utilization wouldn't be a huge issue if I didn't see a reduced performance when e.g. binding:

mpirun -np 4 --map-by node:PE=18 python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

Utilization is clearly less

image

and indeed perfromance goes down to Performance: 182420.67007894177 img/s.

@casparvl
Copy link
Collaborator Author

casparvl commented Jun 19, 2023

I've tried to reproduce this with tf-nightly. For some reason, tf-nightly seems to hang in NCCL initialization. Turning off NCCL as communication option:

#   communication_options = tf.distribute.experimental.CommunicationOptions(
#       implementation=tf.distribute.experimental.CommunicationImplementation.NCCL)
#   strategy = tf.distribute.MultiWorkerMirroredStrategy(communication_options=communication_options)
    strategy = tf.distribute.MultiWorkerMirroredStrategy()

Makes the code run. Then, I see a nicely balanced load over the GPUs (though accupancy isn't great), and no strange high CPU usage for the first rank:

image

Performance is pretty similar to before Performance: 338331.0040426434 img/s .

Let's see if I can replicate this with my TensorFlow module if I turn of NCCL:

image

Ugh, I guess that's a no: still seeing the high CPU load, and GPU utilization is terrible If I don't select the NCCL communicator. Unsuprisingly, performance is terrible too: Performance: 95309.03952265615 img/s

Of course, this could still be a version difference: nightly is somewhere beyond the latest release, which is version 2.12, whereas my local module is 2.11.

Ugh... TensorFlow is so complicated - and multinode TensorFlow doubly so.

@casparvl
Copy link
Collaborator Author

From a pip-installed version 2.11 of TensorFlow, running:

XLA_FLAGS="--xla_gpu_cuda_data_dir=$CUDA_HOME" mpirun -np 4 --map-by node:PE=18 python tf_test.py --device gpu --intra-op-parallelism 18 --inter-op-parallelism 1

image

Performance is similar to the unbound case we had before: Performance: 313995.7833386784 img/s. However, here, the processes are bound, and we don't see this rediculously high CPU usage.

Note that this is without NCCL as communication_option. With NCCL, I see the same strange behaviour of unbalanced GPU usage (still no high CPU usage though):

image

Performance is slightly higher than for other cases: Performance: 362982.86494292156 img/s

It makes me wonder: maybe the high GPU utilization on some GPUs simply indicates wait-cycles on the GPU. The GPU with the lowest utilization is simply the bottleneck in that epoch, and the rest is waiting for it? I'm not sure, just a theory.

@satishskamath
Copy link
Collaborator

If you bypass NCCL, then does it also bypass NVlink and does all the communication via host? That could be the reason for the worse performance with TF nightly. Regarding Utilization from 1-4 GPUs that is a bigger mystery. A deeper profile is required.

@casparvl
Copy link
Collaborator Author

That could be the reason for the worse performance with TF nightly

Tf-nightly (without NCCL) performs pretty decent actually, with around 330k img/s. It was our local TF module without NCCL that was terrible :P

@casparvl
Copy link
Collaborator Author

The more I think about it, the more I believe the test is actually done. It currently does:

  • Process binding, but not thread binding, to avoid issue (1). This is a pretty reasonable default. It might not give the absolute highest performance, but should be fairly decent in all cases. And: this is a test, not a benchmark, we don't need it to give the absolute highest performance achievable.
  • NCCL communication on GPU. With process binding + high CPU load, this actually shows sub-par performance. However, that is exactly what this test should be doing: it shows that something is off with our module installation. If anything, that makes it a good test.

Maybe its simply time to take the next step, and have some others test it on their TensorFlow module. Unfortunately, there is no GPU-capable TF module in EESSI yet, so I can't test with that (and maybe worrying about the GPU part of this test is a bit premature anyway). @smoors @boegel any chance you could run this on your own clusters, with your own TensorFlow module, and see if you see similar things that I see above (i.e. high CPU utilization for rank 0, somewhat sub-par performance)?

@casparvl
Copy link
Collaborator Author

Hm, I overlooked something in the hooks:

[     FAIL ] ( 1/33) TENSORFLOW_EESSI %scale=1_core %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /af8226d5 @snellius:thin+default
==> test failed during 'run': test staged in '/scratch-shared/casparl/reframe_output/staging/snellius/thin/default/TENSORFLOW_EESSI_af8226d5'
[     FAIL ] ( 2/33) TENSORFLOW_EESSI %scale=1_core %module_name=TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 %device_type=cpu /7feefe35 @snellius:thin+default
==> test failed during 'run': test staged in '/scratch-shared/casparl/reframe_output/staging/snellius/thin/default/TENSORFLOW_EESSI_7feefe35'

The jobscript looks like this:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=0
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=0:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=0
export SLURM_CPU_BIND=q
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 0 --inter-op-parallelism 1

The issue here is that it is still generating a fixed number of tasks per node, which is equal to the socket count. The cpus-per-task is then calculated by taking the default cpu count (1 for this scale) and deviding by the task count, rounding down.
The error is of course the task-per-node count, which should simply be 1 for these cases. I'll fix this in the hook.

@casparvl
Copy link
Collaborator Author

Yep, that looks better. For 1_core:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=1:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=1
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 1 --inter-op-parallelism 1

for 2_cores:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a
export I_MPI_PIN_DOMAIN=2:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=2
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 2 --inter-op-parallelism 1

for 1_8 node:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=16:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=16
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 16 --inter-op-parallelism 1

for 1/2 node:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin --exclusive
source $EESSI_INIT
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=q
mpirun -np 1 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

So, this seems to work for both core count and node part specification by SCALES

Caspar van Leeuwen added 2 commits June 20, 2023 16:15
…rdinated between workers. This would result in lines being broken off and sanity patterns not matching. All printing is now done by rank 0
@casparvl
Copy link
Collaborator Author

Ugh, ran it on Vega, and got:

--- rfm_job.err (first 10 lines) ---
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE:IF-SUPPORTED
   Node:        cn0768
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
--- rfm_job.err ---

I know where this is coming from: Vega has hyperthreading enabled. Thus, their processor config looks like this:

...
  "num_cpus": 256,
  "num_cpus_per_core": 2,
  "num_cpus_per_socket": 128,
  "num_sockets": 2

We should really make the hooks that set the one-process-per-X process count aware of CPUs vs hwthreads. In case of one-process-per-CPU, it should really do one per physical CPU. That would probably also resolve this binding issue.

For context, the submitted job on Vega looks like this:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=128
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p cpu
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
export SLURM_EXPORT_ENV=ALL
module load TensorFlow/2.3.1-foss-2020a-Python-3.8.2
export I_MPI_PIN_DOMAIN=128:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=128
export SLURM_CPU_BIND=q
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 128 --inter-op-parallelism 1

@casparvl
Copy link
Collaborator Author

Oh, yet another challenge: I need to convince SLURM to ask for 128 'cpus' (-c 64), but in the mpirun command only launch 64 tasks...
I might be able to set both to 64 if I also set --threads-per-core=1... The question is: should the test do this? And if it's the test: how do I set this from ReFrame? Probably it has to be an options or extras or whatever it is called in ReFrame. What if a ReFrame config doesn't define that 'extra'? Will the test still run and ignore the request for extras? Otherwise, setting it in a test will mean essentially requiring that everyone sets this in their config...

@casparvl
Copy link
Collaborator Author

Ok, I think using cores as processing elements makes the most sense. That means changing the hook with:

    # If hyperthreading is enabled, we need to change our process binding
    num_cpus_per_core = test.current_partition.processor.num_cpus_per_core
    if num_cpus_per_core is None:
        raise AttributeError(PROCESSOR_INFO_MISSING)

    # Do binding for intel and OpenMPI's mpirun, and srun
    # Other launchers may or may not do the correct binding
    test.env_vars['I_MPI_PIN_CELL'] = 'core'  # Don't bind to hyperthreads, only to physcial cores
    test.env_vars['I_MPI_PIN_DOMAIN'] = '%s:compact' % test.num_cpus_per_task / num_cpus_per_core
    test.env_vars['OMPI_MCA_rmaps_base_mapping_policy'] = 'node:PE=%s' % test.num_cpus_per_task / num_cpus_per_core

By dividing through the num_cpus_per_core, we only number cores, instead of hardware threads.

We'll still need to figure out how this will then work for e.g. pure MPI programs. On hybrid systems, you'd want those to set test.num_tasks_per_node equal to physical number of cores (I suppose), and set test.num_cpus_per_task equal to two, in order to make sure that SLURM makes a large enough allocation (since SLURM does consider each hyperthread a 'cpu'). We'll worry about that later...

@casparvl
Copy link
Collaborator Author

Fun fact: the EESSI TensorFlow version is too old for this test... Well, that's a temporary problem, I'm not going to change the test to fit our extremely old 2.3 version of TF :) One can test it on newer (local) modules....

@casparvl casparvl marked this pull request as ready for review June 28, 2023 15:51
@casparvl
Copy link
Collaborator Author

Ok, I checked on my system now if at least the binding and process spawning was done properly:

reframe -C test-suite/config/surf_snellius.py -c test-suite/eessi/testsuite/tests/apps/tensorflow/ -r -n /5b5c1cb6
...
[----------] start processing checks
[ RUN      ] TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /5b5c1cb6 @snellius:thin+default
[       OK ] (1/1) TENSORFLOW_EESSI %scale=1_node %module_name=TensorFlow/2.11.0-foss-2022a %device_type=cpu /5b5c1cb6 @snellius:thin+default
P: perf: 58705.772664812015 img/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped)

The generated job script looked like this:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=64
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p thin
#SBATCH --export=None
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
module load TensorFlow/2.11.0-foss-2022a
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=64:compact
export OMPI_MCA_rmaps_base_mapping_policy=node:PE=64
export SLURM_CPU_BIND=q
mpirun -np 2 python tf_test.py --device cpu --intra-op-parallelism 64 --inter-op-parallelism 1

Which is as expected: we indeed have 2 sockets, so want 2 tasks. We have 64 cores per socket, so expect --cpus-per-task=64 as well as --intra-op-parallelism 64.

Checking the binding during the run shows the tasks where bound correctly to their own set of cores

[casparl@tcn359 ~]$ taskset -pc 3654012
pid 3654012's current affinity list: 64-127
[casparl@tcn359 ~]$ taskset -pc 3654011
pid 3654011's current affinity list: 0-63

I think this is ready to be reviewed :)

@casparvl casparvl changed the title WIP: Tensorflow Tensorflow Jun 28, 2023
Copy link
Collaborator

@smoors smoors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tested this on cpu and gpu nodes with --exclusive set to make sure binding is correct.

it seems to work as intended, with tasks bound to sockets.
number of active threads is slightly higher than requested with cpus-per-task, which is expected for TF.

for the gpu node, it actually ran faster with 1 gpu than with 2, is that due to the small size of the system?

i did not check if the gpus were actually used

config/izum_vega.py Outdated Show resolved Hide resolved
eessi/testsuite/tests/apps/tensorflow/src/tf_test.py Outdated Show resolved Hide resolved
eessi/testsuite/tests/apps/tensorflow/src/tf_test.py Outdated Show resolved Hide resolved
eessi/testsuite/tests/apps/tensorflow/tensorflow.py Outdated Show resolved Hide resolved
eessi/testsuite/hooks.py Show resolved Hide resolved
eessi/testsuite/hooks.py Outdated Show resolved Hide resolved
eessi/testsuite/hooks.py Outdated Show resolved Hide resolved
@casparvl casparvl requested a review from smoors July 28, 2023 15:48
Copy link
Collaborator

@smoors smoors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
still some formatting issues, but we can deal with those later

@smoors smoors merged commit 313645f into EESSI:main Jul 31, 2023
9 checks passed
@casparvl
Copy link
Collaborator Author

casparvl commented Aug 1, 2023

Yeah, we should do a PR fixing all formatting issues in this repo, and introducing a style check in the CI... It's the best way to ensure proper code style.

Note: I tried to do better by installing black in my VS code, but for some reason it doesn't work properly. I have a bit of a complicated setup with remote-ssh, so maybe that's why...

@boegel boegel mentioned this pull request Aug 9, 2023
@boegel boegel changed the title Tensorflow add Tensorflow test Sep 21, 2023
@casparvl casparvl deleted the tensorflow branch September 4, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants