Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm script for: configs/official/OLMo-7B.yaml #699

Open
andymvp2018 opened this issue Aug 13, 2024 · 3 comments
Open

slurm script for: configs/official/OLMo-7B.yaml #699

andymvp2018 opened this issue Aug 13, 2024 · 3 comments
Labels
type/question An issue that's a question

Comments

@andymvp2018
Copy link

andymvp2018 commented Aug 13, 2024

❓ The question

do you know the slurm script for configs/official/OLMo-7B.yaml?
looking for multi-node slurm script

@andymvp2018 andymvp2018 added the type/question An issue that's a question label Aug 13, 2024
@2015aroras
Copy link
Collaborator

I'm not sure what exact script was used, but something like https://github.com/allenai/OLMo/blob/main/scripts/lumi/mitchish70.sh may be adaptable to your purposes. That script does not set an architecture-related settings.

@andymvp2018
Copy link
Author

Thanks @2015aroras , two questions:

  1. If I set micro_train_device batch size, this will over-ride the global batch size right?
  2. what are these?

B"$PROJECT_DIR:$PROJECT_DIR"
-B"$FLASH_DIR:$FLASH_DIR"
-B"$SCRATCH_DIR:$SCRATCH_DIR"
-B /opt/cray:/opt/cray
-B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1
-B /usr/lib64/libjson-c.so.3:/usr/lib64/libjson-c.so.3
$PROJECT_DIR/containers/$OLMO_CONTAINER \

@2015aroras
Copy link
Collaborator

  1. The global batch size carries the size of the batch in the current step. We split a batch across our GPUs, so each device has a smaller 'device' batch size (global size / num devices). A GPU doesn't have enough memory to do the whole device batch in 1 forward + backward pass, so we split the device into multiple micro batches and do separate forward + backward passes. After all the micro batches are done, we do the optimizer step.
    Overall, micro batch size is just about avoiding memory issues & getting good perf; it should not affect training results. You'll want the micro batch size to be a divisor of the device batch size.

  2. Our slurm jobs run in singularity containers (maybe there are ways to use other types of containers in your system). The -B is mounting directories from outside the container into the container. $PROJECT_DIR/containers/$OLMO_CONTAINER is the location of the container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants