-
Notifications
You must be signed in to change notification settings - Fork 60
Run NCCL tests on the JAX-specific base container #1284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
7e78af6
Run NCCL tests on the JAX-specific base container.
olupton 2236693
only test nccl on amd64
olupton 7a696ed
Merge branch 'main' into olupton/nccl-test-base-container
olupton fa90684
Merge branch 'main' into olupton/nccl-test-base-container
Steboss 5cb7cbc
try to reduce the number of calls to reusable workflows by employing …
Steboss a68080a
fix the secret
Steboss cae8640
forgot to checkout
Steboss 0fd23c1
test with actions + matrix
Steboss 7551c06
trigger an action to test build-jax
Steboss 3bbb116
Merge branch 'main' into olupton/nccl-test-base-container
Steboss 11f0ce9
test build-jax again
Steboss 4285156
fix bazel cache
Steboss 1a7387b
test build in parallel
Steboss 72fd964
fix inputs and outputs
Steboss 8fc30c8
fix needs outputs
Steboss c914408
fix tags
Steboss 7f82ea4
check comments
Steboss ef00e85
Merge branch 'main' into olupton/nccl-test-base-container
Steboss b3ac7d9
Fix nccl
Steboss 27efd72
fix nccl
Steboss 7c40052
check if we can run build-mpi-operator within reusable action
Steboss fa2924b
fix error
Steboss 3602166
add the needs step
Steboss 519e86d
fix nccl test
Steboss 9e32f77
correct typo
Steboss f08aa0d
does this run if we fix with previous step build?
Steboss 3e9b8cf
need to find out why this step starts
Steboss 24a8293
we do need nccl-k8s
Steboss f2b3744
reset to previous step
Steboss bc53fea
try to create the real cuda image and not just the tag
Steboss a57c682
revert this to build-base dependency
Steboss 8567b03
back to CUDA_IMAGE
Steboss f44c548
Merge branch 'main' into olupton/nccl-test-base-container
Steboss 985a574
try with a shorter name
Steboss a04b4e9
fix image usage
Steboss acab62f
fix description
Steboss 2b31baa
fix description part 2
Steboss af484fd
fix typo
Steboss 66fa511
Merge branch 'main' into olupton/nccl-test-base-container
Steboss File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
name: Build container | ||
|
||
description: "Builds a Docker container image for JAX-based projects using NVIDIA's Mealkit and uploads it to GitHub Container Registry." | ||
|
||
inputs: | ||
ARCHITECTURE: | ||
description: 'CPU architecture to build the image for, e.g. amd64, arm64' | ||
required: true | ||
BASE_IMAGE: | ||
description: 'Base docker image that provides JAX' | ||
required: false | ||
default: ghcr.io/nvidia/jax:mealkit | ||
BUILD_DATE: | ||
description: "Build date in YYYY-MM-DD format" | ||
required: false | ||
default: 'NOT SPECIFIED' | ||
ARTIFACT_NAME: | ||
description: 'Name of the artifact zip file, e.g. artifact-t5x-build' | ||
required: true | ||
BADGE_FILENAME: | ||
description: 'Name of the endpoint JSON file for shields.io badge, e.g. badge-t5x-build' | ||
required: true | ||
CONTAINER_NAME: | ||
description: "Container name, e.g. upstream-t5x" | ||
required: true | ||
DOCKERFILE: | ||
description: "Dockerfile to use, e.g. .github/container/Dockerfile.t5x" | ||
required: true | ||
DOCKER_CONTEXT: | ||
description: "Dockerfile context to build" | ||
default: '.github/container' | ||
required: false | ||
RUNNER_SIZE: | ||
description: "Size of the runner to use" | ||
required: false | ||
default: small | ||
EXTRA_BUILD_ARGS: | ||
description: "Extra build arguments to pass to the Docker build" | ||
required: false | ||
default: "" | ||
ssh-private-key: | ||
description: "SSH private key to use for building the image" | ||
required: true | ||
default: "" | ||
ssh-known-hosts: | ||
description: "SSH known hosts entries to use for building the image" | ||
required: true | ||
default: "" | ||
github-token: | ||
description: "GitHub token to use for authentication" | ||
required: true | ||
default: "" | ||
bazel-remote-cache-url: | ||
description: "URL of the Bazel remote cache to use for building the image" | ||
required: true | ||
default: "" | ||
|
||
outputs: | ||
DOCKER_TAG_MEALKIT: | ||
description: "Tags of the 'mealkit' image built" | ||
value: ${{ steps.export.outputs.DOCKER_TAG_MEALKIT }} | ||
DOCKER_TAG_FINAL: | ||
description: "Tags of the complete image built" | ||
value: ${{ steps.export.outputs.DOCKER_TAG_FINAL }} | ||
|
||
runs: | ||
using: 'composite' | ||
steps: | ||
- name: Set up environment variables | ||
shell: bash | ||
id: set-env | ||
run: | | ||
echo 'UPLD_IMAGE=ghcr.io/nvidia/jax-toolbox-internal' >> $GITHUB_ENV | ||
echo "BADGE_FILENAME_FULL=${{ inputs.BADGE_FILENAME }}-${{ inputs.ARCHITECTURE }}.json" >> $GITHUB_ENV | ||
|
||
- name: Setup SSH | ||
id: setup-ssh | ||
uses: ./.github/actions/setup-ssh | ||
with: | ||
ssh-private-key: ${{ inputs.ssh-private-key }} | ||
ssh-known-hosts: ${{ inputs.ssh-known-hosts }} | ||
|
||
- name: Login to GHCR | ||
uses: docker/login-action@v3 | ||
with: | ||
registry: ghcr.io | ||
username: ${{ github.repository_owner }} | ||
password: ${{ inputs.github-token }} | ||
|
||
- name: Set up Docker Buildx | ||
uses: docker/setup-buildx-action@v3 | ||
with: | ||
driver-opts: | | ||
image=moby/buildkit:v0.12.1 | ||
|
||
# MEALKIT BUILD | ||
- name: Set docker metadata - mealkit | ||
id: mealkit-metadata | ||
uses: docker/metadata-action@v5 | ||
with: | ||
images: | | ||
${{ env.UPLD_IMAGE }} | ||
flavor: | | ||
latest=false | ||
tags: | | ||
type=raw,value=${{ github.run_id }}-${{ inputs.CONTAINER_NAME }}-${{ inputs.ARCHITECTURE }}-mealkit | ||
labels: | ||
org.opencontainers.image.created=${{ inputs.BUILD_DATE }} | ||
|
||
- name: Build mealkit image | ||
id: mealkit-build | ||
uses: docker/build-push-action@v5 | ||
with: | ||
context: ${{ inputs.DOCKER_CONTEXT }} | ||
push: true | ||
file: ${{ inputs.DOCKERFILE }} | ||
platforms: linux/${{ inputs.ARCHITECTURE }} | ||
target: mealkit | ||
tags: ${{ steps.mealkit-metadata.outputs.tags }} | ||
labels: ${{ steps.mealkit-metadata.outputs.labels }} | ||
ssh: default | ||
secret-files: | | ||
"SSH_KNOWN_HOSTS=${{ steps.setup-ssh.outputs.known-hosts-file }}" | ||
build-args: | | ||
BASE_IMAGE=${{ inputs.BASE_IMAGE }} | ||
BAZEL_CACHE=${{ inputs.bazel-remote-cache-url }} | ||
BUILD_DATE=${{ inputs.BUILD_DATE }} | ||
${{ inputs.EXTRA_BUILD_ARGS }} | ||
# FINAL IMAGE BUILD | ||
- name: Set docker metadata - final | ||
id: final-metadata | ||
uses: docker/metadata-action@v5 | ||
with: | ||
images: | | ||
${{ env.UPLD_IMAGE }} | ||
flavor: | | ||
latest=false | ||
tags: | | ||
type=raw,value=${{ github.run_id }}-${{ inputs.CONTAINER_NAME }}-${{ inputs.ARCHITECTURE }} | ||
labels: | ||
org.opencontainers.image.created=${{ inputs.BUILD_DATE }} | ||
|
||
- name: Build final image | ||
id: final-build | ||
uses: docker/build-push-action@v5 | ||
with: | ||
context: ${{ inputs.DOCKER_CONTEXT }} | ||
push: true | ||
file: ${{ inputs.DOCKERFILE }} | ||
platforms: linux/${{ inputs.ARCHITECTURE }} | ||
tags: ${{ steps.final-metadata.outputs.tags }} | ||
labels: ${{ steps.final-metadata.outputs.labels }} | ||
target: final | ||
ssh: default | ||
secret-files: | | ||
"SSH_KNOWN_HOSTS=${{ steps.setup-ssh.outputs.known-hosts-file }}" | ||
build-args: | | ||
BASE_IMAGE=${{ inputs.BASE_IMAGE }} | ||
BAZEL_CACHE=${{ inputs.bazel-remote-cache-url }} | ||
BUILD_DATE=${{ inputs.BUILD_DATE }} | ||
${{ inputs.EXTRA_BUILD_ARGS }} | ||
|
||
# SITREP GENERATION | ||
- name: Generate sitrep | ||
if: "!cancelled()" | ||
shell: bash -x -e {0} | ||
run: | | ||
# bring in utility functions | ||
source .github/workflows/scripts/to_json.sh | ||
|
||
badge_label='${{ inputs.CONTAINER_NAME }} ${{ inputs.ARCHITECTURE }} build' | ||
tags="${{ steps.final-metadata.outputs.tags }}" | ||
digest="${{ steps.final-build.outputs.digest }}" | ||
outcome="${{ steps.final-build.outcome }}" | ||
|
||
if [[ ${outcome} == "success" ]]; then | ||
badge_message="pass" | ||
badge_color=brightgreen | ||
summary="${{ inputs.CONTAINER_NAME }} build on ${{ inputs.ARCHITECTURE }}: $badge_message" | ||
else | ||
badge_message="fail" | ||
badge_color=red | ||
summary="${{ inputs.CONTAINER_NAME }} build on ${{ inputs.ARCHITECTURE }}: $badge_message" | ||
fi | ||
|
||
to_json \ | ||
summary \ | ||
badge_label tags digest outcome \ | ||
> sitrep.json | ||
|
||
schemaVersion=1 \ | ||
label="${badge_label}" \ | ||
message="${badge_message}" \ | ||
color="${badge_color}" \ | ||
to_json schemaVersion label message color \ | ||
> ${{ env.BADGE_FILENAME_FULL }} | ||
|
||
- name: Upload sitrep and badge | ||
if: "!cancelled()" | ||
uses: actions/upload-artifact@v4 | ||
with: | ||
name: ${{ inputs.ARTIFACT_NAME }}-${{ inputs.ARCHITECTURE }} | ||
path: | | ||
sitrep.json | ||
${{ env.BADGE_FILENAME_FULL }} | ||
|
||
- name: Export outputs | ||
id: export | ||
shell: bash | ||
run: | | ||
echo "DOCKER_TAG_MEALKIT=${{ steps.mealkit-metadata.outputs.tags }}" >> "$GITHUB_OUTPUT" | ||
echo "DOCKER_TAG_FINAL=${{ steps.final-metadata.outputs.tags }}" >> "$GITHUB_OUTPUT" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.