-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch eval and llama tests to use the default hip device. #725
Conversation
|
This can be merged once the above test is updated. |
Thanks for the ping. I'll sync this. |
Actually what do you mean? This is already using the default device (index 0):
|
Ok, so it can be either |
Index 0 is typically the default, yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saienduri can take care of the runner side changes.
@@ -73,7 +73,7 @@ jobs: | |||
- name: Run llama tests | |||
run: | | |||
source ${VENV_DIR}/bin/activate | |||
pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --run-nightly-llama-tests --iree-hip-target=gfx942 --iree-device=hip://7 --html=out/llm/llama/benchmark/index.html | |||
pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s --run-nightly-llama-tests --iree-hip-target=gfx942 --iree-device=hip://0 --html=out/llm/llama/benchmark/index.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also use --iree-device=hip
here. let's not specify any index in the workflow files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems okay to me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/nod-ai/shark-ai/actions/runs/12590318387/job/35091620891#step:6:121
Ah. That's why I didn't change this:
else:
> hip_device_arg = int(hip_device_id.split("://")[1])
E IndexError: list index out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched all workflows to use hip://0
for now.
I don't have bandwidth right now to look at the failing workflows to tell if they are preexisting errors. Can someone advise on that? (We need those all passing on |
Except |
We're going to need to get much more principled about failing tests and workflows. For example: If a test is failing for more than two hours and a revert is not possible to fix it, file an issue then disable the test or stop running that workflow on pull requests. |
Looked through the logs. This gets Recent run on |
For both Perplexity pre-submit and nightly CIs, the |
Some of these workflows have been failing with ``` > hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"] E IndexError: list index out of range sharktank/sharktank/utils/vmfb_runner.py:38: IndexError ``` Example logs: https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain Rather than assume that self-hosted runners will have multiple GPUs available and having each workflow use a specific device index, we can use the default device and have the runners themselves choose which devices to make visible.
Some of these workflows have been failing with ``` > hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"] E IndexError: list index out of range sharktank/sharktank/utils/vmfb_runner.py:38: IndexError ``` Example logs: https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain Rather than assume that self-hosted runners will have multiple GPUs available and having each workflow use a specific device index, we can use the default device and have the runners themselves choose which devices to make visible.
Some of these workflows have been failing with
Example logs: https://github.com/nod-ai/shark-ai/actions/workflows/ci_eval_short.yaml?query=branch%3Amain
Rather than assume that self-hosted runners will have multiple GPUs available and having each workflow use a specific device index, we can use the default device and have the runners themselves choose which devices to make visible.