Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[ CI ] Fan Out Strategy #325

Closed
wants to merge 17 commits into from
Closed

[ CI ] Fan Out Strategy #325

wants to merge 17 commits into from

Conversation

robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jun 22, 2024

SUMMARY:

  • update nm-build-test workflow to run each test group on separate gpu
  • update nm-test to receive a specific test directory
  • convert nm-test-whl action to receive one test directory for test group
  • remove concept of skip lists
  • remove gptq models (marlin is too flaky)

ADVANTAGES:

  • much faster wall clock time for the test
  • ability to re-run just the failed jobs for spurious cases
  • possible to

DISADVANTAGES:

  • testmo tracking is a bit more complex --- since each test group is no a separate run
  • code coverage tracking is now very complex --- since no single run goes over all tests
    ---> mitigant: we could have a single run for code coverage which runs over all the tests that we run ~weekly

FOLLOW UP PR:

  • enable DISTRIBUTED
  • randomly assign various python versions
  • update the names of the TEST workflow somehow so that in the GH UI we can see more easily which test group failed

@robertgshaw2-redhat robertgshaw2-redhat changed the title [ CI ] Fan Out Scaffolding [ CI ] Fan Out Strategy Jun 23, 2024
@dbarbuzzi
Copy link

Can we update the test job’s name property to include something dynamic that is relevant to that specific instance so they can be differentiated in the GitHub UI list (e.g., inputs.test_directory)? This has to happen at the job-level in the last workflow that is called (e.g., the TEST job in .github/workflows/nm-test.yml could have something like name: TEST (${{ inputs.test_directory }})).

Also, they all have separate test runs in Testmo; is that the desired result, or would we want to maintain the previous behavior of having them consolidated into a single run? If using a single run, we could still submit results individually since we're already submitting results as threads, which is appropriate for the new approach.

Copy link
Member

@andy-neuma andy-neuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please hold off until, the one whl updates get merged.

Copy link
Member

@andy-neuma andy-neuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. after the one whl approach gets merged, let's update this to use input parameter(s) so we can avoid the explicit job enumeration.

test_directory: entrypoints
secrets: inherit

KERNELS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make the "fan out" dynamic so we can drive it via input parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg - ill let you finish off the PR

@robertgshaw2-redhat
Copy link
Collaborator Author

Can we update the test job’s name property to include something dynamic that is relevant to that specific instance so they can be differentiated in the GitHub UI list (e.g., inputs.test_directory)? This has to happen at the job-level in the last workflow that is called (e.g., the TEST job in .github/workflows/nm-test.yml could have something like name: TEST (${{ inputs.test_directory }})).

Also, they all have separate test runs in Testmo; is that the desired result, or would we want to maintain the previous behavior of having them consolidated into a single run? If using a single run, we could still submit results individually since we're already submitting results as threads, which is appropriate for the new approach.

@dbarbuzzi

It would be better if these could all be part of a single run (and ideally if we could add the lm-eval tests to that run as well --- which are not currently tracked in testmo at all). Is this something you could take on?

I think we should do this as part of a separate PR though

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants