Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spektral (+ horovod) #390

Open
9 tasks
laraPPr opened this issue Aug 20, 2024 · 11 comments
Open
9 tasks

spektral (+ horovod) #390

laraPPr opened this issue Aug 20, 2024 · 11 comments
Assignees
Labels
difficulty: easy software that should be easy to support easyconfig Easyconfig is available GPU priority: ASAP Python site:ugent Software installation request for UGent Tier-2 update

Comments

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 20, 2024

  • link to support ticket: #2024082060007794
  • website: https://graphneural.network/
  • installation docs: https://graphneural.network/
  • toolchain: foss/2023a or older
  • easyblock to use: PythonPackage
  • required dependencies:
    • Python
    • lxml
    • SciPy-bundle
    • networkx
    • tqdm
    • scikit-learn
    • CUDA
    • TensorFlow
  • optional dependencies:
    • ...
  • notes:
    • ...
  • effort: (TBD)
  • other install methods
    • conda: yes (link?) / no
    • container image: yes (link?) / no
    • pre-built binaries (RHEL8 Linux x86_64): yes (link?) / no
    • easyconfig outside EasyBuild: yes / no
@laraPPr laraPPr added difficulty: easy software that should be easy to support priority: ASAP Python update site:ugent Software installation request for UGent Tier-2 GPU easyconfig Easyconfig is available labels Aug 20, 2024
@boegel boegel self-assigned this Aug 22, 2024
@boegel
Copy link
Contributor

boegel commented Aug 22, 2024

I'm looking into this myself, seems to be pretty trivial

@boegel
Copy link
Contributor

boegel commented Aug 22, 2024

@boegel
Copy link
Contributor

boegel commented Aug 22, 2024

I'm also looking into updating Horovod for TensorFlow 2.15.1 with foss/2023a, but it's being a PITA:

  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:17:2: errorr
: #error This file was generated by an older version of protoc which is
     17 | #error This file was generated by an older version of protoc which is
        |  ^~~~~
  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:18:2: errorr
: #error incompatible with your Protocol Buffer headers. Please
     18 | #error incompatible with your Protocol Buffer headers. Please
        |  ^~~~~
  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:19:2: errorr
: #error regenerate this file with a newer version of protoc.
     19 | #error regenerate this file with a newer version of protoc.
        |  ^~~~~

@boegel
Copy link
Contributor

boegel commented Aug 22, 2024

PR for spektral with fosscuda/2020b (since getting Horovod working with foss/2023a is providing to be difficult):

@boegel boegel closed this as completed Aug 22, 2024
@Flamefire
Copy link

I'm also looking into updating Horovod for TensorFlow 2.15.1 with foss/2023a, but it's being a PITA:

  /apps/gent/RHEL8/cascadelake-ampere-ib/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/tensorflow/include/tensorflow/compiler/tf2xla/host_compute_metadata.pb.h:17:2: error
     17 | #error This file was generated by an older version of protoc which is
     18 | #error incompatible with your Protocol Buffer headers. Please
     19 | #error regenerate this file with a newer version of protoc.

This header is generated by the protobuf compiler (protoc) during the TF build using this rule: https://github.com/tensorflow/tensorflow/blob/c16161a1cb6ecdef55bf8fc4a2074a5aa8bd4ed0/tensorflow/compiler/tf2xla/BUILD#L106-L114

Protobuf (like too many others) is downloaded during the build of TF but not installed (only the generated files are required. Possibly there is some runtime library, but not sure)

According to this they use protobuf 3.21.9

Hence I guess the solution to the error is to use the same protobuf version as a build dependency in the EC that causes the above error (which one was that?)

It might also be worth looking into using our protobuf as a "SYSTEM_LIB" during the TF build but I expect some patch to be required. At least we can update the issue if we still run into "File already exists in database" and hope they finally answer it.

@laraPPr laraPPr reopened this Aug 27, 2024
@laraPPr laraPPr closed this as completed Aug 27, 2024
@boegel
Copy link
Contributor

boegel commented Aug 27, 2024

Keeping this open for now, would like to look into the Horovod issue again...

@boegel boegel reopened this Aug 27, 2024
@boegel
Copy link
Contributor

boegel commented Aug 27, 2024

I looked into trying with protobuf 3.21.9 as build dependency for Horovod, but didn't get very far since something already depends on a different version of protobuf, leading to:

A different version of the 'protobuf' module is already loaded (see output of 'ml').
You should load another 'protobuf-python' module for that is compatible with the currently loaded version of 'protobuf'.
Use 'ml spider protobuf-python' to get an overview of the available versions.


If you don't understand the warning or error, contact the helpdesk at [email protected]
While processing the following module(s):
    Module fullname                           Module Filename
    ---------------                           ---------------
    protobuf-python/4.24.0-GCCcore-12.3.0     /modules/all/protobuf-python/4.24.0-GCCcore-12.3.0.lua
    grpcio/1.57.0-GCCcore-12.3.0              /modules/all/grpcio/1.57.0-GCCcore-12.3.0.lua
    TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1  /modules/all/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1.lua

That's because we have Lmod configured with LMOD_DISABLE_SAME_NAME_AUTOSWAP

@boegel
Copy link
Contributor

boegel commented Aug 27, 2024

I got the installation of Horovod working, but only through a nasty hack in the easyconfig, since listing an alternative protobuf version in builddependencies doesn't work:

easyblock = 'PythonBundle'

name = 'Horovod'
version = '0.28.1'
local_tf_version = '2.15.1'
local_cuda_suffix = '-CUDA-%(cudaver)s'
versionsuffix = local_cuda_suffix + '-TensorFlow-%s' % local_tf_version

homepage = 'https://github.com/uber/horovod'
description = "Horovod is a distributed training framework for TensorFlow."

toolchain = {'name': 'foss', 'version': '2023a'}

builddependencies = [
    ('CMake', '3.26.3'),
    # ('protobuf', '3.21.9'),
]
dependencies = [
    ('Python', '3.11.3'),
    ('PyYAML', '6.0'),
    ('CUDA', '12.1.1', '', SYSTEM),
    ('NCCL', '2.18.3', local_cuda_suffix),
    ('TensorFlow', local_tf_version, local_cuda_suffix),
]

use_pip = True
sanity_pip_check = True

preinstallopts = 'module swap protobuf/3.21.9-GCCcore-12.3.0 && HOROVOD_WITH_MPI=1 HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL '
preinstallopts += 'HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 '

exts_list = [
    ('cloudpickle', '2.2.1', {
        'checksums': ['d89684b8de9e34a2a43b3460fbca07d09d6e25ce858df4d5a44240403b6178f5'],
    }),
    ('horovod', version, {
        'patches': ['Horovod-0.28.1_support_flatbuffers_2.0.6.patch'],
        'checksums': [
            '92a43f5a94c43907a56805bad15f19700c62ffc83b7ca483f9e104e229f67ef0',
            '9696ffb3b2bad1d6dd5a9f37bc58078ca7c585f933bcbec037036ad9fc0b297d',
        ],
    }),
]

sanity_check_paths = {
    'files': ['bin/horovodrun'],
    'dirs': ['lib/python%(pyshortver)s/site-packages'],
}

sanity_check_commands = ["horovodrun --help"]

moduleclass = 'tools'

see the module swap command in preinstallopts.

In order to do this properly, we would need to add support to EasyBuild framework to swap in a particular module for a (build) dependency rather than just loading it...

@Flamefire
Copy link

I'm wondering if this can cause further problems as there is a runtime library for protobuf. Let's hope they are "compatible enough"

I.e. the TF header expects protobuf 3.21.0-3.21.9 but we use 4.24.0 (at runtime). And I found

The runtime library must have the same version with the protocol compiler you use.

So I guess we are actually confined by the protobuf version used by TensorFlow and need to use the same one in all dependents and dependencies. The best approach is likely to fix and use our protobuf when building TF. Otherwise we need to always check the version in the TF sources before deciding on one for the toolchain.

@boegel
Copy link
Contributor

boegel commented Aug 28, 2024

Isn't the protobuf used by TensorFlow "baked in", so that whatever protobuf module is loaded doesn't really matter for TensorFlow itself?

@Flamefire
Copy link

Not sure how it could be. Possibly it is statically linked, can't remember. But even when linking statically there can be symbol clashes if a shared protobuf is loaded (e.g. by grpcio), can't there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty: easy software that should be easy to support easyconfig Easyconfig is available GPU priority: ASAP Python site:ugent Software installation request for UGent Tier-2 update
Projects
None yet
Development

No branches or pull requests

3 participants