Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch eval and llama tests to use the default hip device. #725

Merged
merged 7 commits into from
Jan 3, 2025

Conversation

ScottTodd
Copy link
Member

Some of these workflows have been failing with

>           hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"]
E           IndexError: list index out of range
sharktank/sharktank/utils/vmfb_runner.py:38: IndexError

Example logs: https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain

Rather than assume that self-hosted runners will have multiple GPUs available and having each workflow use a specific device index, we can use the default device and have the runners themselves choose which devices to make visible.

@ScottTodd ScottTodd marked this pull request as ready for review December 20, 2024 23:09
@archana-ramalingam
Copy link
Collaborator

ci-llama-quick-tests.yaml needs to be updated too.

@archana-ramalingam archana-ramalingam requested review from aviator19941 and removed request for archana-ramalingam December 23, 2024 10:06
@archana-ramalingam
Copy link
Collaborator

ci-llama-quick-tests.yaml needs to be updated too.

This can be merged once the above test is updated.

@ScottTodd
Copy link
Member Author

ci-llama-quick-tests.yaml needs to be updated too.

This can be merged once the above test is updated.

Thanks for the ping. I'll sync this.

@ScottTodd
Copy link
Member Author

Actually what do you mean? This is already using the default device (index 0):

pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --iree-hip-target=gfx942 --iree-device=hip://0 --run-quick-llama-test

@archana-ramalingam
Copy link
Collaborator

Actually what do you mean? This is already using the default device (index 0):

pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --iree-hip-target=gfx942 --iree-device=hip://0 --run-quick-llama-test

Ok, so it can be either --iree-device=hip or --iree-device=hip://0 ?

@ScottTodd
Copy link
Member Author

Index 0 is typically the default, yes.

Copy link
Collaborator

@archana-ramalingam archana-ramalingam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saienduri can take care of the runner side changes.

@@ -73,7 +73,7 @@ jobs:
- name: Run llama tests
run: |
source ${VENV_DIR}/bin/activate
pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --run-nightly-llama-tests --iree-hip-target=gfx942 --iree-device=hip://7 --html=out/llm/llama/benchmark/index.html
pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --run-nightly-llama-tests --iree-hip-target=gfx942 --iree-device=hip://0 --html=out/llm/llama/benchmark/index.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also use --iree-device=hip here. let's not specify any index in the workflow files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, updated both locations. I think that will still work but I don't remember if I tried and it failed before.

Here are the other places still using index 0, which I think are okay:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems okay to me, thanks!

Copy link
Member Author

@ScottTodd ScottTodd Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:121

Ah. That's why I didn't change this:

        else:
>           hip_device_arg = int(hip_device_id.split("://")[1])
E           IndexError: list index out of range

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched all workflows to use hip://0 for now.

@saienduri saienduri self-requested a review January 2, 2025 23:51
@ScottTodd
Copy link
Member Author

I don't have bandwidth right now to look at the failing workflows to tell if they are preexisting errors. Can someone advise on that? (We need those all passing on main, I can't make any changes or release with confidence right now)

@archana-ramalingam
Copy link
Collaborator

archana-ramalingam commented Jan 3, 2025

I don't have bandwidth right now to look at the failing workflows to tell if they are preexisting errors. Can someone advise on that? (We need those all passing on main, I can't make any changes or release with confidence right now)

Except Llama Benchmarking 8B Tests the rest of the CIs have been failing for a while now. The team is aware and investigating separately. @aviator19941 will look into fixing the above failing test in a separate PR, which is caused by the changes made here.

@ScottTodd
Copy link
Member Author

The team is aware and investigating separately. @aviator19941 will look into fixing the above failing test in a separate PR

We're going to need to get much more principled about failing tests and workflows. For example: If a test is failing for more than two hours and a revert is not possible to fix it, file an issue then disable the test or stop running that workflow on pull requests.

@ScottTodd
Copy link
Member Author

Looked through the logs. This gets sharktank/tests/evaluate/perplexity_iree_test.py::PerplexityTest::test_llama3_8B_f16_decomposed closer to passing.

Recent run on main: https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:47 (IndexError: list index out of range when trying to use that HIP device)
Run on this PR: https://github.com/nod-ai/shark-ai/actions/runs/12600786970/job/35120561465?pr=725#step:6:127 (Fatal Python error: Segmentation fault in the IREE runtime?)

@ScottTodd ScottTodd merged commit 644c98d into nod-ai:main Jan 3, 2025
21 of 24 checks passed
@ScottTodd ScottTodd deleted the ci-hip-devices branch January 3, 2025 17:22
@archana-ramalingam
Copy link
Collaborator

Looked through the logs. This gets sharktank/tests/evaluate/perplexity_iree_test.py::PerplexityTest::test_llama3_8B_f16_decomposed closer to passing.

Recent run on main: https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:47 (IndexError: list index out of range when trying to use that HIP device) Run on this PR: https://github.com/nod-ai/shark-ai/actions/runs/12600786970/job/35120561465?pr=725#step:6:127 (Fatal Python error: Segmentation fault in the IREE runtime?)

For both Perplexity pre-submit and nightly CIs, the IndexError is because of runners fighting for same GPU, resolved by this PR. Segmentation fault is the actual error, which I am looking into.

monorimet pushed a commit that referenced this pull request Jan 8, 2025
Some of these workflows have been failing with
```
>           hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"]
E           IndexError: list index out of range
sharktank/sharktank/utils/vmfb_runner.py:38: IndexError
```

Example logs:
https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain

Rather than assume that self-hosted runners will have multiple GPUs
available and having each workflow use a specific device index, we can
use the default device and have the runners themselves choose which
devices to make visible.
monorimet pushed a commit that referenced this pull request Jan 8, 2025
Some of these workflows have been failing with
```
>           hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"]
E           IndexError: list index out of range
sharktank/sharktank/utils/vmfb_runner.py:38: IndexError
```

Example logs:
https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain

Rather than assume that self-hosted runners will have multiple GPUs
available and having each workflow use a specific device index, we can
use the default device and have the runners themselves choose which
devices to make visible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants