[ CI ] Fan Out Strategy #325

robertgshaw2-redhat · 2024-06-22T21:38:06Z

SUMMARY:

update nm-build-test workflow to run each test group on separate gpu
update nm-test to receive a specific test directory
convert nm-test-whl action to receive one test directory for test group
remove concept of skip lists
remove gptq models (marlin is too flaky)

ADVANTAGES:

much faster wall clock time for the test
ability to re-run just the failed jobs for spurious cases
possible to

DISADVANTAGES:

testmo tracking is a bit more complex --- since each test group is no a separate run
code coverage tracking is now very complex --- since no single run goes over all tests
---> mitigant: we could have a single run for code coverage which runs over all the tests that we run ~weekly

FOLLOW UP PR:

enable DISTRIBUTED
randomly assign various python versions
update the names of the TEST workflow somehow so that in the GH UI we can see more easily which test group failed

… fan-out

dbarbuzzi · 2024-06-24T15:17:55Z

Can we update the test job’s name property to include something dynamic that is relevant to that specific instance so they can be differentiated in the GitHub UI list (e.g., inputs.test_directory)? This has to happen at the job-level in the last workflow that is called (e.g., the TEST job in .github/workflows/nm-test.yml could have something like name: TEST (${{ inputs.test_directory }})).

Also, they all have separate test runs in Testmo; is that the desired result, or would we want to maintain the previous behavior of having them consolidated into a single run? If using a single run, we could still submit results individually since we're already submitting results as threads, which is appropriate for the new approach.

andy-neuma

please hold off until, the one whl updates get merged.

andy-neuma

cool. after the one whl approach gets merged, let's update this to use input parameter(s) so we can avoid the explicit job enumeration.

andy-neuma · 2024-06-24T19:11:14Z

.github/workflows/nm-build-test.yml

+            test_directory: entrypoints
+        secrets: inherit
+
+    KERNELS:


we should make the "fan out" dynamic so we can drive it via input parameter.

sg - ill let you finish off the PR

robertgshaw2-redhat · 2024-06-25T02:47:26Z

Can we update the test job’s name property to include something dynamic that is relevant to that specific instance so they can be differentiated in the GitHub UI list (e.g., inputs.test_directory)? This has to happen at the job-level in the last workflow that is called (e.g., the TEST job in .github/workflows/nm-test.yml could have something like name: TEST (${{ inputs.test_directory }})).

Also, they all have separate test runs in Testmo; is that the desired result, or would we want to maintain the previous behavior of having them consolidated into a single run? If using a single run, we could still submit results individually since we're already submitting results as threads, which is appropriate for the new approach.

@dbarbuzzi

It would be better if these could all be part of a single run (and ideally if we could add the lm-eval tests to that run as well --- which are not currently tracked in testmo at all). Is this something you could take on?

I think we should do this as part of a separate PR though

robertgshaw2-redhat and others added 12 commits June 22, 2024 21:37

fan out scaffold

d3bc976

run-tests

0989809

re-added

ff8eab7

full list

2c7a3df

make sure code coverage upload is okay

6ebecc5

disable distributed

aaac9a7

Update full.txt

31ac6ee

Update full.txt

aea1b34

remove changes to remote push

a7d5ec1

remove distributed from build-test workflow

ff6921e

Merge branch 'fan-out' of https://github.com/neuralmagic/nm-vllm into…

f76b254

… fan-out

readded upload

829a7a1

robertgshaw2-redhat changed the title ~~[ CI ] Fan Out Scaffolding~~ [ CI ] Fan Out Strategy Jun 23, 2024

robertgshaw2-redhat mentioned this pull request Jun 23, 2024

[ CI ] Enable Distributed Tests #324

Closed

robertgshaw2-redhat requested review from andy-neuma, mgoin, dhuangnm and dbarbuzzi June 23, 2024 20:39

robertgshaw2-redhat and others added 5 commits June 23, 2024 21:05

Merge branch 'main' into fan-out

01ba293

remove gptq - flaky

1d2005e

Merge branch 'fan-out' of https://github.com/neuralmagic/nm-vllm into…

fb68e4b

… fan-out

trigger ci

4628b56

fix

b02c24c

andy-neuma suggested changes Jun 24, 2024

View reviewed changes

robertgshaw2-redhat closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ CI ] Fan Out Strategy #325

[ CI ] Fan Out Strategy #325

robertgshaw2-redhat commented Jun 22, 2024 •

edited

Loading

dbarbuzzi commented Jun 24, 2024

andy-neuma left a comment

andy-neuma left a comment

andy-neuma Jun 24, 2024

robertgshaw2-redhat Jun 25, 2024

robertgshaw2-redhat commented Jun 25, 2024

[ CI ] Fan Out Strategy #325

[ CI ] Fan Out Strategy #325

Conversation

robertgshaw2-redhat commented Jun 22, 2024 • edited Loading

dbarbuzzi commented Jun 24, 2024

andy-neuma left a comment

Choose a reason for hiding this comment

andy-neuma left a comment

Choose a reason for hiding this comment

andy-neuma Jun 24, 2024

Choose a reason for hiding this comment

robertgshaw2-redhat Jun 25, 2024

Choose a reason for hiding this comment

robertgshaw2-redhat commented Jun 25, 2024

robertgshaw2-redhat commented Jun 22, 2024 •

edited

Loading