-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spektral (+ horovod) #390
Comments
I'm looking into this myself, seems to be pretty trivial |
I'm also looking into updating
|
PR for |
This header is generated by the protobuf compiler ( Protobuf (like too many others) is downloaded during the build of TF but not installed (only the generated files are required. Possibly there is some runtime library, but not sure) According to this they use protobuf 3.21.9 Hence I guess the solution to the error is to use the same protobuf version as a build dependency in the EC that causes the above error (which one was that?) It might also be worth looking into using our protobuf as a "SYSTEM_LIB" during the TF build but I expect some patch to be required. At least we can update the issue if we still run into "File already exists in database" and hope they finally answer it. |
Keeping this open for now, would like to look into the Horovod issue again... |
I looked into trying with
That's because we have Lmod configured with |
I got the installation of easyblock = 'PythonBundle'
name = 'Horovod'
version = '0.28.1'
local_tf_version = '2.15.1'
local_cuda_suffix = '-CUDA-%(cudaver)s'
versionsuffix = local_cuda_suffix + '-TensorFlow-%s' % local_tf_version
homepage = 'https://github.com/uber/horovod'
description = "Horovod is a distributed training framework for TensorFlow."
toolchain = {'name': 'foss', 'version': '2023a'}
builddependencies = [
('CMake', '3.26.3'),
# ('protobuf', '3.21.9'),
]
dependencies = [
('Python', '3.11.3'),
('PyYAML', '6.0'),
('CUDA', '12.1.1', '', SYSTEM),
('NCCL', '2.18.3', local_cuda_suffix),
('TensorFlow', local_tf_version, local_cuda_suffix),
]
use_pip = True
sanity_pip_check = True
preinstallopts = 'module swap protobuf/3.21.9-GCCcore-12.3.0 && HOROVOD_WITH_MPI=1 HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL '
preinstallopts += 'HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 '
exts_list = [
('cloudpickle', '2.2.1', {
'checksums': ['d89684b8de9e34a2a43b3460fbca07d09d6e25ce858df4d5a44240403b6178f5'],
}),
('horovod', version, {
'patches': ['Horovod-0.28.1_support_flatbuffers_2.0.6.patch'],
'checksums': [
'92a43f5a94c43907a56805bad15f19700c62ffc83b7ca483f9e104e229f67ef0',
'9696ffb3b2bad1d6dd5a9f37bc58078ca7c585f933bcbec037036ad9fc0b297d',
],
}),
]
sanity_check_paths = {
'files': ['bin/horovodrun'],
'dirs': ['lib/python%(pyshortver)s/site-packages'],
}
sanity_check_commands = ["horovodrun --help"]
moduleclass = 'tools' see the In order to do this properly, we would need to add support to EasyBuild framework to swap in a particular module for a (build) dependency rather than just loading it... |
I'm wondering if this can cause further problems as there is a runtime library for protobuf. Let's hope they are "compatible enough" I.e. the TF header expects protobuf 3.21.0-3.21.9 but we use 4.24.0 (at runtime). And I found
So I guess we are actually confined by the protobuf version used by TensorFlow and need to use the same one in all dependents and dependencies. The best approach is likely to fix and use our protobuf when building TF. Otherwise we need to always check the version in the TF sources before deciding on one for the toolchain. |
Isn't the protobuf used by TensorFlow "baked in", so that whatever protobuf module is loaded doesn't really matter for TensorFlow itself? |
Not sure how it could be. Possibly it is statically linked, can't remember. But even when linking statically there can be symbol clashes if a shared protobuf is loaded (e.g. by grpcio), can't there? |
foss/2023a
or olderPythonPackage
The text was updated successfully, but these errors were encountered: