Skip to content

feat: add a megatron extra to support megatron environments #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

terrykong
Copy link
Collaborator

@terrykong terrykong commented May 2, 2025

Closes #276

This change is similar in spirit to how BioNeMo has chosen to work with uv + megatron.

It is now a requirement for the user to clone both megatron-core / NeMo if they wish to use --extra mcore since we need both pyproject.toml's to be there at the time uv is run (even if we don't need them b/c we're using native pytorch path). If the user clones with --recursive, they shouldn't have to do anything extra.

I've added some tests and verified this works inside and outside a container (with cuda 12.4 drivers).

The only thing that's not ideal, is the megatron submodule is based off of my fork b/c we need to add @SahilJain314 's changes (makes it easier than asking users to git apply <mcore.patch when they don't even need mcore). I'll need to update to an official codebase once that's available.

Nemo also needs a non-main branch, b/c I add in a pyproject.toml that has a slimmed down dependency set + it has @SahilJain314 's patches. At least NeMo is not on a personal fork, but for megatron, this seems like the easiest solution for now.

Not using megatron?

If the user isn't using mcore, we still install a dummy package and two py_modules named is_megatron_installed and is_nemo_installed. These can be imported and you can check if the packages were actually installed.

If you aren't using megatron and decide to start, so you do git submodule update --init --recursive, you do have to pass uv the --reinstall arg to get it to properly find the new package (that you recently downloaded). Users don't have to do this directly, since it's done in PY_EXECUTABLES.MCORE

@github-actions github-actions bot added the CI Relating to CI label May 2, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 2, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 2, 2025
@terrykong
Copy link
Collaborator Author

CI is failing for some reason. looking into it

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 3, 2025
@terrykong
Copy link
Collaborator Author

I fixed the previous issue, which was due to namespace collision. Now a different issue is that the tests expect fp8 compute capability, so need to relax the TE test

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 4, 2025
@SahilJain314
Copy link
Collaborator

Do you think there's any way to get rid of the clone dependency for the non-mcore path? (i.e. maybe just have a copy of the pyproject or similar)

@terrykong
Copy link
Collaborator Author

So I tried several different ways of cloning:

[email protected]:NVIDIA/NeMo.git

(
NAME=vanilla
echo CASE=$NAME; rm -rf /tmp/Nemo-$NAME 
time git clone $R /tmp/Nemo-$NAME >/tmp/${NAME}.log; du -sh /tmp/Nemo-$NAME | tee -a /tmp/${NAME}.log

NAME=depth1
echo CASE=$NAME; rm -rf /tmp/Nemo-$NAME 
time git clone --depth 1 $R /tmp/Nemo-$NAME >/tmp/${NAME}.log; du -sh /tmp/Nemo-$NAME | tee -a /tmp/${NAME}.log

NAME=rl-notshallow
echo CASE=$NAME; rm -rf /tmp/Nemo-$NAME 
time git clone --recursive -b tk/megatron-extra-notshallow [email protected]:NVIDIA/NeMo-RL.git /tmp/Nemo-$NAME >/tmp/${NAME}.log; 
du -sh /tmp/Nemo-$NAME | tee -a /tmp/${NAME}.log
du -sh /tmp/Nemo-$NAME/3rdparty/* | tee -a /tmp/${NAME}.log

NAME=rl
echo CASE=$NAME; rm -rf /tmp/Nemo-$NAME 
time git clone --recursive -b tk/megatron-extra [email protected]:NVIDIA/NeMo-RL.git /tmp/Nemo-$NAME >/tmp/${NAME}.log; 
du -sh /tmp/Nemo-$NAME | tee -a /tmp/${NAME}.log
du -sh /tmp/Nemo-$NAME/3rdparty/* | tee -a /tmp/${NAME}.log

NAME=rl-depth1
echo CASE=$NAME; rm -rf /tmp/Nemo-$NAME 
time git clone --depth 1 --recursive -b tk/megatron-extra [email protected]:NVIDIA/NeMo-RL.git /tmp/Nemo-$NAME >/tmp/${NAME}.log
du -sh /tmp/Nemo-$NAME | tee -a /tmp/${NAME}.log
du -sh /tmp/Nemo-$NAME/3rdparty/* | tee -a /tmp/${NAME}.log

NAME=rl-blobnone
echo CASE=$NAME; rm -rf /tmp/Nemo-$NAME 
time git clone --filter=blob:none --also-filter-submodules --recursive -b tk/megatron-extra [email protected]:NVIDIA/NeMo-RL.git /tmp/Nemo-$NAME >/tmp/${NAME}.log
du -sh /tmp/Nemo-$NAME | tee -a /tmp/${NAME}.log
du -sh /tmp/Nemo-$NAME/3rdparty/* | tee -a /tmp/${NAME}.log

) 2>&1 | tee log | egrep "CASE|[^']/tmp|real" | tee summary
CASE=vanilla
real    0m27.641s
579M    /tmp/Nemo-vanilla
CASE=depth1
real    0m6.661s
188M    /tmp/Nemo-depth1
CASE=rl-notshallow
real    0m27.311s
644M    /tmp/Nemo-rl-notshallow
CASE=rl
real    0m20.366s
490M    /tmp/Nemo-rl
CASE=rl-depth1
real    0m29.037s
349M    /tmp/Nemo-rl-depth1
CASE=rl-blobnone
real    0m14.638s
242M    /tmp/Nemo-rl-blobnone

So the smallest variation would look like this to clone:

git clone --filter=blob:none --also-filter-submodules --recursive [email protected]:NVIDIA/NeMo-RL.git
git backfill  # backfill to restore the blobs in NeMo-RL

@SahilJain314 are you okay with ^? the other workflows i can think of aren't very user friendly and require folks to know that they need to update the submodule before running (so may lead to silent errors); maybe okay for long time users, but maybe not so obvious for first time users.

@terrykong terrykong force-pushed the tk/megatron-extra branch 2 times, most recently from 669a624 to 88b445a Compare May 7, 2025 02:33
@SahilJain314
Copy link
Collaborator

I'm leaning 'default' clone has no submodules (for hf users) and megatron users should init the submodules. The difference is huge even with optimizer submodule cloning (NeMo-RL is 7.5M and takes 3 sec to clone vs 242MB/15+ sec with the optimized command).
git submodule update --init --recursive --filter=blob:none seems pretty fast. Can we just fail loudly (instructing to do this clone) if a user tries to run megatron without this?

also git backfill looks like it's super new (my machine didn't have it for sure) so I don't want that to be the default.

I know mentioned we need the megatron and nemo pyprojects even if we don't use them, but maybe we can make minimal dummy ones? With #321, we won't import them unless used.

@terrykong terrykong marked this pull request as draft May 8, 2025 03:15
@terrykong
Copy link
Collaborator Author

moving into draft to iterate on this

@terrykong terrykong force-pushed the tk/megatron-extra branch 2 times, most recently from 435f620 to 0b58290 Compare May 8, 2025 07:02
@terrykong terrykong marked this pull request as ready for review May 8, 2025 07:02
@terrykong terrykong force-pushed the tk/megatron-extra branch from 0b58290 to 40f111a Compare May 17, 2025 00:20
terrykong added 7 commits May 17, 2025 17:53
Signed-off-by: Terry Kong <[email protected]>

wip

Signed-off-by: Terry Kong <[email protected]>

fix it

Signed-off-by: Terry Kong <[email protected]>

patthing fix

Signed-off-by: Terry Kong <[email protected]>

wip

Signed-off-by: Terry Kong <[email protected]>

doesn't look like i needed that

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

revert stuff

Signed-off-by: Terry Kong <[email protected]>

make it better

Signed-off-by: Terry Kong <[email protected]>

go

Signed-off-by: Terry Kong <[email protected]>

cleanup

Signed-off-by: Terry Kong <[email protected]>

mix it up

Signed-off-by: Terry Kong <[email protected]>

touch up

Signed-off-by: Terry Kong <[email protected]>

clean

Signed-off-by: Terry Kong <[email protected]>

better

Signed-off-by: Terry Kong <[email protected]>

clean up

Signed-off-by: Terry Kong <[email protected]>

add it in

Signed-off-by: Terry Kong <[email protected]>

mcore extra

Signed-off-by: Terry Kong <[email protected]>

instructions

Signed-off-by: Terry Kong <[email protected]>

works

Signed-off-by: Terry Kong <[email protected]>

revert to 3.10, 3.12 didn't seem necessary

Signed-off-by: Terry Kong <[email protected]>

ci has to recursively clone

Signed-off-by: Terry Kong <[email protected]>

bump build workflow

Signed-off-by: Terry Kong <[email protected]>

add megatron.core import

Signed-off-by: Terry Kong <[email protected]>

potential fix for unit test on CI

Signed-off-by: Terry Kong <[email protected]>

fix the test

Signed-off-by: Terry Kong <[email protected]>

this should fix test (it was a collision of namespace)

Signed-off-by: Terry Kong <[email protected]>

remove fp8 from test

Signed-off-by: Terry Kong <[email protected]>

add shallow

Signed-off-by: Terry Kong <[email protected]>

fix base build

Signed-off-by: Terry Kong <[email protected]>

fix instructions

Signed-off-by: Terry Kong <[email protected]>

fix the messed up indenting

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

try nesting

Signed-off-by: Terry Kong <[email protected]>

okay, got it to work

Signed-off-by: Terry Kong <[email protected]>

fix up the readme

Signed-off-by: Terry Kong <[email protected]>

ok

Signed-off-by: Terry Kong <[email protected]>

touchup

Signed-off-by: Terry Kong <[email protected]>

add the lock file back

Signed-off-by: Terry Kong <[email protected]>

got

Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
@terrykong terrykong force-pushed the tk/megatron-extra branch from e48ab8d to c5f4305 Compare May 18, 2025 00:53
terrykong added 2 commits May 17, 2025 18:00
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label May 18, 2025
Signed-off-by: Terry Kong <[email protected]>
@SahilJain314
Copy link
Collaborator

I get the following if I run the test if mcore is installed functional test script from a non-hermetic container:

      Built megatron-core @ file:///opt/NeMo-RL/3rdparty/Megatron-LM-workspace
      Built nemo-tron @ file:///opt/NeMo-RL/3rdparty/NeMo-workspace
    Updated https://github.com/NVIDIA/NeMo-Run (414f0077c648fde2c71bb1186e97ccbf96d6844c)
      Built nemo-rl @ file:///opt/NeMo-RL
      Built wget==3.2
      Built nemo-run @ git+https://github.com/NVIDIA/NeMo-Run@414f0077c648fde2c71bb1186e97ccbf96d6844c
  x Failed to build `transformer-engine-torch==1.13.0`
  |-> The build backend returned an error
  `-> Call to `setuptools.build_meta:__legacy__.build_wheel` failed (exit status: 1)

      [stdout]
      running bdist_wheel
      running build
      running build_ext
      building 'transformer_engine_torch' extension
      creating build/temp.linux-x86_64-cpython-312/csrc
      creating build/temp.linux-x86_64-cpython-312/csrc/extensions
      creating build/temp.linux-x86_64-cpython-312/csrc/extensions/multi_tensor
      c++ -pthread -fno-strict-overflow -Wsign-compare -Wunreachable-code -DNDEBUG -g -O3 -Wall -fPIC -fPIC
      -I/root/.cache/uv/sdists-v9/pypi/transformer-engine-torch/1.13.0/_02_MnTjQ9deetHtryREW/src/common_headers
      -I/root/.cache/uv/sdists-v9/pypi/transformer-engine-torch/1.13.0/_02_MnTjQ9deetHtryREW/src/common_headers/common
      -I/root/.cache/uv/sdists-v9/pypi/transformer-engine-torch/1.13.0/_02_MnTjQ9deetHtryREW/src/common_headers/common/include
      -I/root/.cache/uv/sdists-v9/pypi/transformer-engine-torch/1.13.0/_02_MnTjQ9deetHtryREW/src/csrc -I/opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/include
      -I/opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/include/TH
      -I/opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/NeMo-RL/.venv/include
      -I/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/include/python3.12 -c csrc/common.cpp -o build/temp.linux-x86_64-cpython-312/csrc/common.o -O3
      -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\"
      -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17

      [stderr]
      /opt/NeMo-RL/.venv/lib/python3.12/site-packages/setuptools/_distutils/dist.py:289: UserWarning: Unknown distribution option: 'tests_require'
        warnings.warn(msg)
      /opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:529: UserWarning: Attempted to use ninja as the BuildExtension backend but we could
      not find ninja.. Falling back to using the slow distutils backend.
        warnings.warn(msg.format('we could not find ninja.'))
      /opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:448: UserWarning: The detected CUDA version (12.8) has a minor version mismatch with
      the version that was used to compile PyTorch (12.4). Most likely this shouldn't be a problem.
        warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
      /opt/NeMo-RL/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:458: UserWarning: There are no c++ version bounds defined for CUDA version 12.8
        warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
      c++: internal compiler error: Segmentation fault signal terminated program cc1plus
      Please submit a full bug report, with preprocessed source (by using -freport-bug).
      See <file:///usr/share/doc/gcc-13/README.Bugs> for instructions.
      error: command '/usr/bin/c++' failed with exit code 4

      hint: This usually indicates a problem with the package or the build environment.
  help: `transformer-engine-torch` (v1.13.0) was included because `nemo-rl[mcore]` depends on `transformer-engine[pytorch]` (v1.13.0) which depends on
        `transformer-engine-torch`
        ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Megatron Environment
2 participants