-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on HTTP 50x errors #603
base: branch-25.04
Are you sure you want to change the base?
Retry on HTTP 50x errors #603
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @TomAugspurger.
I agree, it would be good to make the code list configurable. Or at least, define it as a constexpr somewhere.
This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.
4913430
to
e2934fa
Compare
Apologies for the force-commit. The commits made from my devcontainer last week weren't being signed for some reason. |
The two CI failures appear to be from the 6-hour timeout on the github action: https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434?pr=603#step:9:1556
I assume that's unrelated to the changes here. If possible, it might be best to rerun just those failed jobs? |
If there are jobs with hangs, we need to diagnose those offline and not rerun them. Consuming a GPU runner for 6 hours is not good, especially with our limited supply of ARM nodes. |
Makes sense. https://github.com/rapidsai/kvikio/actions/runs/13117029669 (from #1465) also took much longer than normal on these same two matrix entries. https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707886?pr=603 is one of the slow jobs. That
A bit strange it passed on conda though. I'll take a look. |
That said, https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434 (testing #608) also timed out after 6 hours on the same test, and it was running at around the same time. That test seems to use the |
I wish I were more confident, but the hang is probably happening in kvikio/python/kvikio/tests/conftest.py Lines 22 to 24 in 74653a3
subprocess.call . However, that's not the easiest to integrate into the rest of that run_cmd fixture, since it's using blocking calls to .send() and .recv() to send test commands and receive results, and those don't have timeout parameters. If we raise a TimeoutError there, run_cmd would hang on the .recv() since the server never writes anything to the pipe.
I'd recommend two things
|
The two wheel test failures are from segfaults, somewhere in the call to Looking into it. Edit: I'm not able to reproduce this locally. kvikio/cpp/src/shim/libcurl.cpp Lines 101 to 104 in 74653a3
but we are using b641240 updates the timeout to use threads instead. |
Fixed the merge commits. This should be ready for another review / good to go. Edit: must have messed up the merge. Looking into CI now. |
@kingcrimsontianyu do you see anything obviously wrong with how I've structured the
Edit: perhaps because I didn't add my new file to CMakeLists.txt. Trying that now. |
CI is passing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @TomAugspurger
Looking into the CI failure at https://github.com/rapidsai/kvikio/actions/runs/13444589202/job/37567619135?pr=603, with failures like
Which would surprise me. It's almost like the code being tested was from |
https://github.com/rapidsai/kvikio/actions/runs/13444589202/job/37567618823?pr=603#step:9:359 shows we got
Looking at https://anaconda.org/rapidsai-nightly/kvikio/files?page=2, I see a package with that name from 40 hours ago: https://anaconda.org/rapidsai-nightly/kvikio/25.04.00a25/download/linux-64/kvikio-25.04.00a25-cuda11_py310_250219_g77da587_25.conda. So it does seem like we were running an older version of the code, rather than this branch. I'll keep looking into why that might be. |
@TomAugspurger I think CI is failing because the packages built by this PR pin to |
I don't agree with the theory that the most recent builds are failing here because of the proxy cache issues with conda packages. Across all of pip devcontainers, conda devcontainers, C++ wheel builds, and C++ conda builds, I see the same compilation error:
There was also a successful nightly test run 11 hours ago, where |
Yep, I was. Fixed in ae2ea11. |
Ok yeah, on a re-run it failed in the same way as #603 (comment), and
https://github.com/rapidsai/kvikio/actions/runs/13466571002/job/37633868639?pr=603 So I think @bdice was right, and this caching thing was the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with one note. We should merge this once CI passes -- I merged in the upstream now that the CI conda package cache is disabled (which caused problems earlier).
|
||
@contextlib.contextmanager | ||
def set_http_max_attempts(attempts: int): | ||
"""Context for resetting the maximum number of HTTP attempts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing to say that a context "resets" a value. What is the difference between "setting" and "resetting" the value? I also think the wording of http_max_attempts_reset(attempts: int)
is awkward.
However, I recognize this naming is consistent with other kvikio APIs so I don't want to block on anything here. Maybe we can look for another library API that we agree is clear about this kind of behavior (setting, getting, setting-in-context) and refactor accordingly.
This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.
Closes #601