-
Notifications
You must be signed in to change notification settings - Fork 12
feat: add a megatron extra to support megatron environments #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
CI is failing for some reason. looking into it |
I fixed the previous issue, which was due to namespace collision. Now a different issue is that the tests expect fp8 compute capability, so need to relax the TE test |
Do you think there's any way to get rid of the clone dependency for the non-mcore path? (i.e. maybe just have a copy of the pyproject or similar) |
So I tried several different ways of cloning:
So the smallest variation would look like this to clone: git clone --filter=blob:none --also-filter-submodules --recursive [email protected]:NVIDIA/NeMo-RL.git
git backfill # backfill to restore the blobs in NeMo-RL @SahilJain314 are you okay with ^? the other workflows i can think of aren't very user friendly and require folks to know that they need to update the submodule before running (so may lead to silent errors); maybe okay for long time users, but maybe not so obvious for first time users. |
669a624
to
88b445a
Compare
I'm leaning 'default' clone has no submodules (for hf users) and megatron users should init the submodules. The difference is huge even with optimizer submodule cloning (NeMo-RL is 7.5M and takes 3 sec to clone vs 242MB/15+ sec with the optimized command). also I know mentioned we need the megatron and nemo pyprojects even if we don't use them, but maybe we can make minimal dummy ones? With #321, we won't import them unless used. |
moving into draft to iterate on this |
435f620
to
0b58290
Compare
0b58290
to
40f111a
Compare
Signed-off-by: Terry Kong <[email protected]> wip Signed-off-by: Terry Kong <[email protected]> fix it Signed-off-by: Terry Kong <[email protected]> patthing fix Signed-off-by: Terry Kong <[email protected]> wip Signed-off-by: Terry Kong <[email protected]> doesn't look like i needed that Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> revert stuff Signed-off-by: Terry Kong <[email protected]> make it better Signed-off-by: Terry Kong <[email protected]> go Signed-off-by: Terry Kong <[email protected]> cleanup Signed-off-by: Terry Kong <[email protected]> mix it up Signed-off-by: Terry Kong <[email protected]> touch up Signed-off-by: Terry Kong <[email protected]> clean Signed-off-by: Terry Kong <[email protected]> better Signed-off-by: Terry Kong <[email protected]> clean up Signed-off-by: Terry Kong <[email protected]> add it in Signed-off-by: Terry Kong <[email protected]> mcore extra Signed-off-by: Terry Kong <[email protected]> instructions Signed-off-by: Terry Kong <[email protected]> works Signed-off-by: Terry Kong <[email protected]> revert to 3.10, 3.12 didn't seem necessary Signed-off-by: Terry Kong <[email protected]> ci has to recursively clone Signed-off-by: Terry Kong <[email protected]> bump build workflow Signed-off-by: Terry Kong <[email protected]> add megatron.core import Signed-off-by: Terry Kong <[email protected]> potential fix for unit test on CI Signed-off-by: Terry Kong <[email protected]> fix the test Signed-off-by: Terry Kong <[email protected]> this should fix test (it was a collision of namespace) Signed-off-by: Terry Kong <[email protected]> remove fp8 from test Signed-off-by: Terry Kong <[email protected]> add shallow Signed-off-by: Terry Kong <[email protected]> fix base build Signed-off-by: Terry Kong <[email protected]> fix instructions Signed-off-by: Terry Kong <[email protected]> fix the messed up indenting Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> try nesting Signed-off-by: Terry Kong <[email protected]> okay, got it to work Signed-off-by: Terry Kong <[email protected]> fix up the readme Signed-off-by: Terry Kong <[email protected]> ok Signed-off-by: Terry Kong <[email protected]> touchup Signed-off-by: Terry Kong <[email protected]> add the lock file back Signed-off-by: Terry Kong <[email protected]> got Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
e48ab8d
to
c5f4305
Compare
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
I get the following if I run the test if mcore is installed functional test script from a non-hermetic container:
|
Closes #276
This change is similar in spirit to how BioNeMo has chosen to work with uv + megatron.
It is now a requirement for the user to clone both megatron-core / NeMo if they wish to use
--extra mcore
since we need both pyproject.toml's to be there at the timeuv
is run (even if we don't need them b/c we're using native pytorch path). If the user clones with--recursive
, they shouldn't have to do anything extra.I've added some tests and verified this works inside and outside a container (with cuda 12.4 drivers).
The only thing that's not ideal, is the megatron submodule is based off of my fork b/c we need to add @SahilJain314 's changes (makes it easier than asking users to
git apply <mcore.patch
when they don't even need mcore). I'll need to update to an official codebase once that's available.Nemo also needs a non-main branch, b/c I add in a pyproject.toml that has a slimmed down dependency set + it has @SahilJain314 's patches. At least NeMo is not on a personal fork, but for megatron, this seems like the easiest solution for now.
Not using megatron?
If the user isn't using mcore, we still install a dummy package and two py_modules named
is_megatron_installed
andis_nemo_installed
. These can be imported and you can check if the packages were actually installed.If you aren't using megatron and decide to start, so you do
git submodule update --init --recursive
, you do have to passuv
the--reinstall
arg to get it to properly find the new package (that you recently downloaded). Users don't have to do this directly, since it's done inPY_EXECUTABLES.MCORE