Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on HTTP 50x errors #603

Open
wants to merge 37 commits into
base: branch-25.04
Choose a base branch
from

Conversation

TomAugspurger
Copy link

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

Closes #601

Copy link

copy-pr-bot bot commented Jan 29, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link

copy-pr-bot bot commented Jan 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@TomAugspurger TomAugspurger marked this pull request as ready for review January 29, 2025 22:31
@TomAugspurger TomAugspurger requested review from a team as code owners January 29, 2025 22:31
@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 30, 2025
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @TomAugspurger.
I agree, it would be good to make the code list configurable. Or at least, define it as a constexpr somewhere.

This updates our remote IO HTTP handler to check the status code of the
response. If we get a 50x error, we'll retry up to some limit.
@TomAugspurger
Copy link
Author

Apologies for the force-commit. The commits made from my devcontainer last week weren't being signed for some reason.

@TomAugspurger
Copy link
Author

The two CI failures appear to be from the 6-hour timeout on the github action: https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434?pr=603#step:9:1556

context canceled
python/kvikio/tests/test_benchmarks.py::test_http_io[cupy] 
Error: The operation was canceled.

I assume that's unrelated to the changes here. If possible, it might be best to rerun just those failed jobs?

@bdice
Copy link
Contributor

bdice commented Feb 3, 2025

If there are jobs with hangs, we need to diagnose those offline and not rerun them. Consuming a GPU runner for 6 hours is not good, especially with our limited supply of ARM nodes.

@TomAugspurger
Copy link
Author

Makes sense. https://github.com/rapidsai/kvikio/actions/runs/13117029669 (from #1465) also took much longer than normal on these same two matrix entries.

https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707886?pr=603 is one of the slow jobs. That

  • 16:23:01 started Run tests
  • 16:23:53-16:26:25 compiling numcodecs at
    • something to look into: Why are we compiling numcodecs? See if it can provide a wheel and save us some time.
  • 16:27:13 started pytest
  • 16:28:05 last successful test finished
  • 22:20:06 Run canceled while running python/kvikio/tests/test_benchmarks.py::test_http_io[cupy]
  • ~1 second ago: I realized my code is almost surely to blame :)

A bit strange it passed on conda though. I'll take a look.

@TomAugspurger
Copy link
Author

That said, https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434 (testing #608) also timed out after 6 hours on the same test, and it was running at around the same time.

That test seems to use the run_cmd fixture to run a benchmark in a subprocess. I don't think we have logs to confirm it, but it's almost surely hanging while starting that subprocess or within it. I'll look into adding a timeout mechanism to run_cmd (cc @kingcrimsontianyu, just in case your PR hits that timeout again, no need for you to investigate too).

@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 4, 2025

I wish I were more confident, but the hang is probably happening in

res: subprocess.CompletedProcess = subprocess.run(
cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=cwd
) # type: ignore
. We could probably catch most of these by setting a timeout in that subprocess.call. However, that's not the easiest to integrate into the rest of that run_cmd fixture, since it's using blocking calls to .send() and .recv() to send test commands and receive results, and those don't have timeout parameters. If we raise a TimeoutError there, run_cmd would hang on the .recv() since the server never writes anything to the pipe.

I'd recommend two things

  1. Add pytest-timeout as a test dependency, and ensure that these tests have a timeout. With small timeouts and added time.sleep commands in the http_io.py file I've confirmed that pytest-timeout does interrupt the individual tests and the test process finishes.
  2. Investigate the cause of the hangs in the first place. I'm like 99% sure that we should be setting CURLOPT_TIMEOUT somewhere in libcurl.cpp before we perform any requests. Which means we would need to pick a default and expose that up through to the user as a configuration value / parameter for requests made by kvikio. That should probably be done as a separate PR (Set timeouts for HTTP requests #613).

@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 4, 2025

The two wheel test failures are from segfaults, somewhere in the call to open_http while running python/kvikio/tests/test_examples.py::test_http_io: https://github.com/rapidsai/kvikio/actions/runs/13137420435/job/36656808281?pr=603#step:9:1578

Looking into it.

Edit: I'm not able to reproduce this locally. pytest-timeout works by setting a SIGALRM timer at test start and clearing it at test end. The only thing related to signals I see in kvikio is us setting CURLOPT_NOSIGNAL =1 at

// Need CURLOPT_NOSIGNAL to support threading, see
// <https://curl.se/libcurl/c/CURLOPT_NOSIGNAL.html>
setopt(CURLOPT_NOSIGNAL, 1L);
. Based on the docs, it sounds like there's a risk for clashing in the use of SIGALRM

This option may cause libcurl to use the SIGALRM signal to timeout system calls on builds not using asynch DNS. In Unix-like systems, this might cause signals to be used unless CURLOPT_NOSIGNAL is set.

but we are using CURLOPT_NOSIGNAL so I'm not sure.

b641240 updates the timeout to use threads instead.

- added regression test
- adjust the initial request count to 0
- adjust the error message to print the attempted count
@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 18, 2025

Fixed the merge commits. This should be ready for another review / good to go.

Edit: must have messed up the merge. Looking into CI now.

@TomAugspurger
Copy link
Author

TomAugspurger commented Feb 18, 2025

@kingcrimsontianyu do you see anything obviously wrong with how I've structured the parse_http_status_codes function? I've tried to follow how you did parse_compat_mode, but my local build is failing with

$ build-all
[3/3] Linking CXX executable gtests/cpp_tests
FAILED: gtests/cpp_tests 
: && /home/coder/.conda/envs/rapids/bin/x86_64-conda-linux-gnu-c++ -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/coder/.conda/envs/rapids/include  -I/home/coder/.conda/envs/rapids/targets/x86_64-linux/include -O3 -DNDEBUG -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/coder/.conda/envs/rapids/lib -Wl,-rpath-link,/home/coder/.conda/envs/rapids/lib -L/home/coder/.conda/envs/rapids/lib  -L/home/coder/.conda/envs/rapids/targets/x86_64-linux/lib -L/home/coder/.conda/envs/rapids/targets/x86_64-linux/lib/stubs     -Wl,--dependency-file=tests/CMakeFiles/cpp_tests.dir/link.d tests/CMakeFiles/cpp_tests.dir/main.cpp.o tests/CMakeFiles/cpp_tests.dir/test_basic_io.cpp.o tests/CMakeFiles/cpp_tests.dir/test_defaults.cpp.o -o gtests/cpp_tests  -Wl,-rpath,/home/coder/kvikio/cpp/build/conda/cuda-12.8/release:  libkvikio.so  lib/libgmock.a  lib/libgtest.a  /home/coder/.conda/envs/rapids/lib/libcudart.so  -lpthread  -ldl  /home/coder/.conda/envs/rapids/x86_64-conda-linux-gnu/sysroot/usr/lib/librt.so && :
/home/coder/.conda/envs/rapids/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: tests/CMakeFiles/cpp_tests.dir/test_defaults.cpp.o: in function `Defaults_parse_http_status_codes_Test::TestBody()':
test_defaults.cpp:(.text._ZN37Defaults_parse_http_status_codes_Test8TestBodyEv+0x5ad): undefined reference to `kvikio::detail::parse_http_status_codes(std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
/home/coder/.conda/envs/rapids/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: test_defaults.cpp:(.text._ZN37Defaults_parse_http_status_codes_Test8TestBodyEv+0x9ba): undefined reference to `kvikio::detail::parse_http_status_codes(std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Edit: perhaps because I didn't add my new file to CMakeLists.txt. Trying that now.

@TomAugspurger
Copy link
Author

CI is passing now.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @TomAugspurger

@TomAugspurger
Copy link
Author

Looking into the CI failure at https://github.com/rapidsai/kvikio/actions/runs/13444589202/job/37567619135?pr=603, with failures like

    def test_http_max_attempts():
>       before = kvikio.defaults.http_max_attempts()
E       AttributeError: module 'kvikio.defaults' has no attribute 'http_max_attempts'

Which would surprise me. It's almost like the code being tested was from main or something.

@TomAugspurger
Copy link
Author

https://github.com/rapidsai/kvikio/actions/runs/13444589202/job/37567618823?pr=603#step:9:359 shows we got

kvikio                    25.04.00a25     cuda11_py310_250219_g77da587_25    rapidsai-nightly

Looking at https://anaconda.org/rapidsai-nightly/kvikio/files?page=2, I see a package with that name from 40 hours ago: https://anaconda.org/rapidsai-nightly/kvikio/25.04.00a25/download/linux-64/kvikio-25.04.00a25-cuda11_py310_250219_g77da587_25.conda. So it does seem like we were running an older version of the code, rather than this branch. I'll keep looking into why that might be.

@bdice
Copy link
Contributor

bdice commented Feb 21, 2025

@TomAugspurger I think CI is failing because the packages built by this PR pin to nvcomp==4.2.0.11. We are having an issue with the CI package cache not having up-to-date repodata, and we are attempting to debug that (https://github.com/nv-gha-runners/roadmap/issues/192). Because it can't find nvcomp==4.2.0.11, it falls back to the kvikio packages on rapidsai-nightly.

@jameslamb
Copy link
Member

I don't agree with the theory that the most recent builds are failing here because of the proxy cache issues with conda packages.

Across all of pip devcontainers, conda devcontainers, C++ wheel builds, and C++ conda builds, I see the same compilation error:

/home/coder/kvikio/cpp/include/kvikio/http_status_codes.hpp:34:6: error: ‘vector’ in namespace ‘std’ does not name a template type
   34 | std::vector<int> parse_http_status_codes(std::string_view env_var_name,
      |      ^~~~~~
/home/coder/kvikio/cpp/include/kvikio/http_status_codes.hpp:1:1: note: ‘std::vector’ is defined in header ‘<vector>’; did you forget to ‘#include <vector>’?
  +++ |+#include <vector>
    1 | /*
ninja: build stopped: subcommand failed.

(pip devcontainer build)

pip devcontainers, for example, would be totally unaffected by any conda-specific issue.

There was also a successful nightly test run 11 hours ago, where conda-python-tests jobs pulled in nvcomp==4.2.0.11 (with proxy cache enabled). Think builds are failing here because you're genuinely missing an #include somewhere

@TomAugspurger
Copy link
Author

Think builds are failing here because you're genuinely missing an #include somewhere

Yep, I was. Fixed in ae2ea11.

@jameslamb
Copy link
Member

Ok yeah, on a re-run it failed in the same way as #603 (comment), and nvcomp==4.2.0.11 was not pulled:

nvcomp                    4.1.1.1              hf3d1f9a_0    conda-forge

https://github.com/rapidsai/kvikio/actions/runs/13466571002/job/37633868639?pr=603

So I think @bdice was right, and this caching thing was the problem.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one note. We should merge this once CI passes -- I merged in the upstream now that the CI conda package cache is disabled (which caused problems earlier).


@contextlib.contextmanager
def set_http_max_attempts(attempts: int):
"""Context for resetting the maximum number of HTTP attempts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing to say that a context "resets" a value. What is the difference between "setting" and "resetting" the value? I also think the wording of http_max_attempts_reset(attempts: int) is awkward.

However, I recognize this naming is consistent with other kvikio APIs so I don't want to block on anything here. Maybe we can look for another library API that we agree is clear about this kind of behavior (setting, getting, setting-in-context) and refactor accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote IO should retry certain error codes
6 participants