Switch eval and llama tests to use the default hip device. #725

ScottTodd · 2024-12-20T22:43:41Z

Some of these workflows have been failing with

>           hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"]
E           IndexError: list index out of range
sharktank/sharktank/utils/vmfb_runner.py:38: IndexError

Example logs: https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain

Rather than assume that self-hosted runners will have multiple GPUs available and having each workflow use a specific device index, we can use the default device and have the runners themselves choose which devices to make visible.

.github/workflows/ci_eval_short.yaml

archana-ramalingam · 2024-12-23T10:05:40Z

ci-llama-quick-tests.yaml needs to be updated too.

archana-ramalingam · 2025-01-02T23:02:26Z

ci-llama-quick-tests.yaml needs to be updated too.

This can be merged once the above test is updated.

ScottTodd · 2025-01-02T23:06:28Z

ci-llama-quick-tests.yaml needs to be updated too.

This can be merged once the above test is updated.

Thanks for the ping. I'll sync this.

ScottTodd · 2025-01-02T23:08:52Z

Actually what do you mean? This is already using the default device (index 0):

shark-ai/.github/workflows/ci-llama-quick-tests.yaml

Line 75 in 56f3d21

    
                     pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --iree-hip-target=gfx942 --iree-device=hip://0 --run-quick-llama-test

archana-ramalingam · 2025-01-02T23:29:02Z

Actually what do you mean? This is already using the default device (index 0):

shark-ai/.github/workflows/ci-llama-quick-tests.yaml

Line 75 in 56f3d21

pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --iree-hip-target=gfx942 --iree-device=hip://0 --run-quick-llama-test

Ok, so it can be either --iree-device=hip or --iree-device=hip://0 ?

ScottTodd · 2025-01-02T23:30:15Z

Index 0 is typically the default, yes.

archana-ramalingam

@saienduri can take care of the runner side changes.

saienduri · 2025-01-02T23:44:56Z

.github/workflows/ci-llama-large-tests.yaml

@@ -73,7 +73,7 @@ jobs:
      - name: Run llama tests
        run: |
          source ${VENV_DIR}/bin/activate
-          pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --run-nightly-llama-tests --iree-hip-target=gfx942 --iree-device=hip://7 --html=out/llm/llama/benchmark/index.html
+          pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --run-nightly-llama-tests --iree-hip-target=gfx942 --iree-device=hip://0 --html=out/llm/llama/benchmark/index.html


can we also use --iree-device=hip here. let's not specify any index in the workflow files

Okay, updated both locations. I think that will still work but I don't remember if I tried and it failed before.

Here are the other places still using index 0, which I think are okay:

seems okay to me, thanks!

https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:121

Ah. That's why I didn't change this:

else: > hip_device_arg = int(hip_device_id.split("://")[1]) E IndexError: list index out of range

Switched all workflows to use hip://0 for now.

ScottTodd · 2025-01-03T00:14:18Z

I don't have bandwidth right now to look at the failing workflows to tell if they are preexisting errors. Can someone advise on that? (We need those all passing on main, I can't make any changes or release with confidence right now)

archana-ramalingam · 2025-01-03T00:38:30Z

I don't have bandwidth right now to look at the failing workflows to tell if they are preexisting errors. Can someone advise on that? (We need those all passing on main, I can't make any changes or release with confidence right now)

Except Llama Benchmarking 8B Tests the rest of the CIs have been failing for a while now. The team is aware and investigating separately. @aviator19941 will look into fixing the above failing test in a separate PR, which is caused by the changes made here.

ScottTodd · 2025-01-03T16:37:21Z

The team is aware and investigating separately. @aviator19941 will look into fixing the above failing test in a separate PR

We're going to need to get much more principled about failing tests and workflows. For example: If a test is failing for more than two hours and a revert is not possible to fix it, file an issue then disable the test or stop running that workflow on pull requests.

ScottTodd · 2025-01-03T16:43:04Z

Looked through the logs. This gets sharktank/tests/evaluate/perplexity_iree_test.py::PerplexityTest::test_llama3_8B_f16_decomposed closer to passing.

Recent run on main: https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:47 (IndexError: list index out of range when trying to use that HIP device)
Run on this PR: https://github.com/nod-ai/shark-ai/actions/runs/12600786970/job/35120561465?pr=725#step:6:127 (Fatal Python error: Segmentation fault in the IREE runtime?)

This reverts commit d2ff724.

archana-ramalingam · 2025-01-03T17:29:08Z

Looked through the logs. This gets sharktank/tests/evaluate/perplexity_iree_test.py::PerplexityTest::test_llama3_8B_f16_decomposed closer to passing.

Recent run on main: https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:47 (IndexError: list index out of range when trying to use that HIP device) Run on this PR: https://github.com/nod-ai/shark-ai/actions/runs/12600786970/job/35120561465?pr=725#step:6:127 (Fatal Python error: Segmentation fault in the IREE runtime?)

For both Perplexity pre-submit and nightly CIs, the IndexError is because of runners fighting for same GPU, resolved by this PR. Segmentation fault is the actual error, which I am looking into.

Some of these workflows have been failing with ``` > hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"] E IndexError: list index out of range sharktank/sharktank/utils/vmfb_runner.py:38: IndexError ``` Example logs: https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain Rather than assume that self-hosted runners will have multiple GPUs available and having each workflow use a specific device index, we can use the default device and have the runners themselves choose which devices to make visible.

ScottTodd added 2 commits December 20, 2024 14:40

Switch eval and llama tests to use the default hip device.

99f957e

Switch tests using benchmark_amdgpu_test.py to device index 0.

19f44a0

ScottTodd commented Dec 20, 2024

View reviewed changes

.github/workflows/ci_eval_short.yaml Outdated Show resolved Hide resolved

ScottTodd marked this pull request as ready for review December 20, 2024 23:09

ScottTodd requested review from saienduri and archana-ramalingam December 20, 2024 23:09

ScottTodd mentioned this pull request Dec 20, 2024

Bump IREE to iree-3.1.0rc20241220. #721

Merged

archana-ramalingam requested review from aviator19941 and removed request for archana-ramalingam December 23, 2024 10:06

Merge remote-tracking branch 'upstream/main' into ci-hip-devices

6fcf24b

archana-ramalingam approved these changes Jan 2, 2025

View reviewed changes

saienduri reviewed Jan 2, 2025

View reviewed changes

Omit ://0 from hip device ids when only one is used.

d2ff724

saienduri self-requested a review January 2, 2025 23:51

saienduri approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'main' into ci-hip-devices

ea4f416

ScottTodd added 2 commits January 3, 2025 08:59

Revert "Omit ://0 from hip device ids when only one is used."

0d5e61c

This reverts commit d2ff724.

Use hip://0 consistently instead of just hip.

b09a7af

ScottTodd merged commit 644c98d into nod-ai:main Jan 3, 2025
21 of 24 checks passed

ScottTodd deleted the ci-hip-devices branch January 3, 2025 17:22

archana-ramalingam mentioned this pull request Jan 31, 2025

Make get_iree_devices read IREE_DEVICE env var if provided #891

Open

ScottTodd mentioned this pull request Jan 31, 2025

Add Flux transformer benchmarking #870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch eval and llama tests to use the default hip device. #725

Switch eval and llama tests to use the default hip device. #725

ScottTodd commented Dec 20, 2024

archana-ramalingam commented Dec 23, 2024

archana-ramalingam commented Jan 2, 2025

ScottTodd commented Jan 2, 2025

ScottTodd commented Jan 2, 2025

archana-ramalingam commented Jan 2, 2025

ScottTodd commented Jan 2, 2025

archana-ramalingam left a comment

saienduri Jan 2, 2025

ScottTodd Jan 2, 2025

saienduri Jan 2, 2025

ScottTodd Jan 3, 2025 •

edited

Loading

ScottTodd Jan 3, 2025

ScottTodd commented Jan 3, 2025

archana-ramalingam commented Jan 3, 2025 •

edited

Loading

ScottTodd commented Jan 3, 2025

ScottTodd commented Jan 3, 2025

archana-ramalingam commented Jan 3, 2025

Switch eval and llama tests to use the default hip device. #725

Switch eval and llama tests to use the default hip device. #725

Conversation

ScottTodd commented Dec 20, 2024

archana-ramalingam commented Dec 23, 2024

archana-ramalingam commented Jan 2, 2025

ScottTodd commented Jan 2, 2025

ScottTodd commented Jan 2, 2025

archana-ramalingam commented Jan 2, 2025

ScottTodd commented Jan 2, 2025

archana-ramalingam left a comment

Choose a reason for hiding this comment

saienduri Jan 2, 2025

Choose a reason for hiding this comment

ScottTodd Jan 2, 2025

Choose a reason for hiding this comment

saienduri Jan 2, 2025

Choose a reason for hiding this comment

ScottTodd Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

ScottTodd Jan 3, 2025

Choose a reason for hiding this comment

ScottTodd commented Jan 3, 2025

archana-ramalingam commented Jan 3, 2025 • edited Loading

ScottTodd commented Jan 3, 2025

ScottTodd commented Jan 3, 2025

archana-ramalingam commented Jan 3, 2025

ScottTodd Jan 3, 2025 •

edited

Loading

archana-ramalingam commented Jan 3, 2025 •

edited

Loading