Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HyperPod nodes #1020

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

surajkota
Copy link

@surajkota surajkota commented Oct 26, 2024

Description:

SageMaker HyperPod recently launched EKS integration. This commit adds SageMaker instance types and toleration for running DeepHealthChecks so customers can install EFA helm chart without modifications unless required

MANDATORY: PR needs test run output

only adding configuration based on https://github.com/aws-neuron/neuron-helm-charts/tree/main/charts/neuron-helm-chart which is copy of this manifest. Not updating the deamonset.

Test Run Output:
Please specify the release version, instance size and type, OS type and test output.

Training tutorial:
Convergence graph for training tutorials
Performance metrics average_throughput, latency_p50, latency_p99 and MFU% if available

Please make sure this PR contains correct classification terms (Alpha, Beta, and Stable).

If possible, provide your results or a link to them for the reviewer to check your work.

Issue #, sim, or t.corp if available:

Link to RTD for my changes:
https://awsdocs-neuron-staging.readthedocs-hosted.com/en/YOUR_BRANCH_NAME/

Please recommend if there any docs to update for this change.

PR Checklist

  • I've completely filled out the form above!
  • (If applicable) I've automated a test to safegaurd my changes from regression.
  • (If applicable) I've posted test collateral to prove my change was effective and not harmful.
  • (If applicable) I've added someone from QA to the list of reviewers. Do this if you didn't make an automated test or feel it's appropriate for another reason.
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the pre-approved Amazon license list. See https://inside.amazon.com/en/services/legal/us/OpenSource/Pages/BlessedOpenSourceLicenses.aspx.

Pytest Marker Checklist

(Coming soon...)

Reviewer Checklist

  • For reviewer, before merging, please cross-check:
  • I've verified the changes render correctly on RTD (link above)
  • I've ensured the submitter completed the form
  • (If appropriate) I've run tests to verify the contents of the change

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants