Skip to content

Run NCCL tests on the JAX-specific base container #1284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jun 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
7e78af6
Run NCCL tests on the JAX-specific base container.
olupton Feb 3, 2025
2236693
only test nccl on amd64
olupton Mar 26, 2025
7a696ed
Merge branch 'main' into olupton/nccl-test-base-container
olupton Apr 23, 2025
fa90684
Merge branch 'main' into olupton/nccl-test-base-container
Steboss Jun 2, 2025
5cb7cbc
try to reduce the number of calls to reusable workflows by employing …
Steboss Jun 2, 2025
a68080a
fix the secret
Steboss Jun 2, 2025
cae8640
forgot to checkout
Steboss Jun 2, 2025
0fd23c1
test with actions + matrix
Steboss Jun 2, 2025
7551c06
trigger an action to test build-jax
Steboss Jun 3, 2025
3bbb116
Merge branch 'main' into olupton/nccl-test-base-container
Steboss Jun 3, 2025
11f0ce9
test build-jax again
Steboss Jun 3, 2025
4285156
fix bazel cache
Steboss Jun 3, 2025
1a7387b
test build in parallel
Steboss Jun 3, 2025
72fd964
fix inputs and outputs
Steboss Jun 3, 2025
8fc30c8
fix needs outputs
Steboss Jun 3, 2025
c914408
fix tags
Steboss Jun 3, 2025
7f82ea4
check comments
Steboss Jun 4, 2025
ef00e85
Merge branch 'main' into olupton/nccl-test-base-container
Steboss Jun 4, 2025
b3ac7d9
Fix nccl
Steboss Jun 4, 2025
27efd72
fix nccl
Steboss Jun 4, 2025
7c40052
check if we can run build-mpi-operator within reusable action
Steboss Jun 4, 2025
fa2924b
fix error
Steboss Jun 4, 2025
3602166
add the needs step
Steboss Jun 4, 2025
519e86d
fix nccl test
Steboss Jun 4, 2025
9e32f77
correct typo
Steboss Jun 4, 2025
f08aa0d
does this run if we fix with previous step build?
Steboss Jun 4, 2025
3e9b8cf
need to find out why this step starts
Steboss Jun 4, 2025
24a8293
we do need nccl-k8s
Steboss Jun 5, 2025
f2b3744
reset to previous step
Steboss Jun 5, 2025
bc53fea
try to create the real cuda image and not just the tag
Steboss Jun 5, 2025
a57c682
revert this to build-base dependency
Steboss Jun 5, 2025
8567b03
back to CUDA_IMAGE
Steboss Jun 5, 2025
f44c548
Merge branch 'main' into olupton/nccl-test-base-container
Steboss Jun 5, 2025
985a574
try with a shorter name
Steboss Jun 5, 2025
a04b4e9
fix image usage
Steboss Jun 6, 2025
acab62f
fix description
Steboss Jun 6, 2025
2b31baa
fix description part 2
Steboss Jun 6, 2025
af484fd
fix typo
Steboss Jun 6, 2025
66fa511
Merge branch 'main' into olupton/nccl-test-base-container
Steboss Jun 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 212 additions & 0 deletions .github/actions/build-container/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
name: Build container

description: "Builds a Docker container image for JAX-based projects using NVIDIA's Mealkit and uploads it to GitHub Container Registry."

inputs:
ARCHITECTURE:
description: 'CPU architecture to build the image for, e.g. amd64, arm64'
required: true
BASE_IMAGE:
description: 'Base docker image that provides JAX'
required: false
default: ghcr.io/nvidia/jax:mealkit
BUILD_DATE:
description: "Build date in YYYY-MM-DD format"
required: false
default: 'NOT SPECIFIED'
ARTIFACT_NAME:
description: 'Name of the artifact zip file, e.g. artifact-t5x-build'
required: true
BADGE_FILENAME:
description: 'Name of the endpoint JSON file for shields.io badge, e.g. badge-t5x-build'
required: true
CONTAINER_NAME:
description: "Container name, e.g. upstream-t5x"
required: true
DOCKERFILE:
description: "Dockerfile to use, e.g. .github/container/Dockerfile.t5x"
required: true
DOCKER_CONTEXT:
description: "Dockerfile context to build"
default: '.github/container'
required: false
RUNNER_SIZE:
description: "Size of the runner to use"
required: false
default: small
EXTRA_BUILD_ARGS:
description: "Extra build arguments to pass to the Docker build"
required: false
default: ""
ssh-private-key:
description: "SSH private key to use for building the image"
required: true
default: ""
ssh-known-hosts:
description: "SSH known hosts entries to use for building the image"
required: true
default: ""
github-token:
description: "GitHub token to use for authentication"
required: true
default: ""
bazel-remote-cache-url:
description: "URL of the Bazel remote cache to use for building the image"
required: true
default: ""

outputs:
DOCKER_TAG_MEALKIT:
description: "Tags of the 'mealkit' image built"
value: ${{ steps.export.outputs.DOCKER_TAG_MEALKIT }}
DOCKER_TAG_FINAL:
description: "Tags of the complete image built"
value: ${{ steps.export.outputs.DOCKER_TAG_FINAL }}

runs:
using: 'composite'
steps:
- name: Set up environment variables
shell: bash
id: set-env
run: |
echo 'UPLD_IMAGE=ghcr.io/nvidia/jax-toolbox-internal' >> $GITHUB_ENV
echo "BADGE_FILENAME_FULL=${{ inputs.BADGE_FILENAME }}-${{ inputs.ARCHITECTURE }}.json" >> $GITHUB_ENV

- name: Setup SSH
id: setup-ssh
uses: ./.github/actions/setup-ssh
with:
ssh-private-key: ${{ inputs.ssh-private-key }}
ssh-known-hosts: ${{ inputs.ssh-known-hosts }}

- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ inputs.github-token }}

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver-opts: |
image=moby/buildkit:v0.12.1

# MEALKIT BUILD
- name: Set docker metadata - mealkit
id: mealkit-metadata
uses: docker/metadata-action@v5
with:
images: |
${{ env.UPLD_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ github.run_id }}-${{ inputs.CONTAINER_NAME }}-${{ inputs.ARCHITECTURE }}-mealkit
labels:
org.opencontainers.image.created=${{ inputs.BUILD_DATE }}

- name: Build mealkit image
id: mealkit-build
uses: docker/build-push-action@v5
with:
context: ${{ inputs.DOCKER_CONTEXT }}
push: true
file: ${{ inputs.DOCKERFILE }}
platforms: linux/${{ inputs.ARCHITECTURE }}
target: mealkit
tags: ${{ steps.mealkit-metadata.outputs.tags }}
labels: ${{ steps.mealkit-metadata.outputs.labels }}
ssh: default
secret-files: |
"SSH_KNOWN_HOSTS=${{ steps.setup-ssh.outputs.known-hosts-file }}"
build-args: |
BASE_IMAGE=${{ inputs.BASE_IMAGE }}
BAZEL_CACHE=${{ inputs.bazel-remote-cache-url }}
BUILD_DATE=${{ inputs.BUILD_DATE }}
${{ inputs.EXTRA_BUILD_ARGS }}
# FINAL IMAGE BUILD
- name: Set docker metadata - final
id: final-metadata
uses: docker/metadata-action@v5
with:
images: |
${{ env.UPLD_IMAGE }}
flavor: |
latest=false
tags: |
type=raw,value=${{ github.run_id }}-${{ inputs.CONTAINER_NAME }}-${{ inputs.ARCHITECTURE }}
labels:
org.opencontainers.image.created=${{ inputs.BUILD_DATE }}

- name: Build final image
id: final-build
uses: docker/build-push-action@v5
with:
context: ${{ inputs.DOCKER_CONTEXT }}
push: true
file: ${{ inputs.DOCKERFILE }}
platforms: linux/${{ inputs.ARCHITECTURE }}
tags: ${{ steps.final-metadata.outputs.tags }}
labels: ${{ steps.final-metadata.outputs.labels }}
target: final
ssh: default
secret-files: |
"SSH_KNOWN_HOSTS=${{ steps.setup-ssh.outputs.known-hosts-file }}"
build-args: |
BASE_IMAGE=${{ inputs.BASE_IMAGE }}
BAZEL_CACHE=${{ inputs.bazel-remote-cache-url }}
BUILD_DATE=${{ inputs.BUILD_DATE }}
${{ inputs.EXTRA_BUILD_ARGS }}

# SITREP GENERATION
- name: Generate sitrep
if: "!cancelled()"
shell: bash -x -e {0}
run: |
# bring in utility functions
source .github/workflows/scripts/to_json.sh

badge_label='${{ inputs.CONTAINER_NAME }} ${{ inputs.ARCHITECTURE }} build'
tags="${{ steps.final-metadata.outputs.tags }}"
digest="${{ steps.final-build.outputs.digest }}"
outcome="${{ steps.final-build.outcome }}"

if [[ ${outcome} == "success" ]]; then
badge_message="pass"
badge_color=brightgreen
summary="${{ inputs.CONTAINER_NAME }} build on ${{ inputs.ARCHITECTURE }}: $badge_message"
else
badge_message="fail"
badge_color=red
summary="${{ inputs.CONTAINER_NAME }} build on ${{ inputs.ARCHITECTURE }}: $badge_message"
fi

to_json \
summary \
badge_label tags digest outcome \
> sitrep.json

schemaVersion=1 \
label="${badge_label}" \
message="${badge_message}" \
color="${badge_color}" \
to_json schemaVersion label message color \
> ${{ env.BADGE_FILENAME_FULL }}

- name: Upload sitrep and badge
if: "!cancelled()"
uses: actions/upload-artifact@v4
with:
name: ${{ inputs.ARTIFACT_NAME }}-${{ inputs.ARCHITECTURE }}
path: |
sitrep.json
${{ env.BADGE_FILENAME_FULL }}

- name: Export outputs
id: export
shell: bash
run: |
echo "DOCKER_TAG_MEALKIT=${{ steps.mealkit-metadata.outputs.tags }}" >> "$GITHUB_OUTPUT"
echo "DOCKER_TAG_FINAL=${{ steps.final-metadata.outputs.tags }}" >> "$GITHUB_OUTPUT"
Loading
Loading