Retry on HTTP 50x errors #603

TomAugspurger · 2025-01-29T17:16:26Z

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

Closes #601

copy-pr-bot · 2025-01-29T17:16:30Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpp/src/shim/libcurl.cpp

python/kvikio/kvikio/utils.py

cpp/src/shim/libcurl.cpp

copy-pr-bot · 2025-01-29T20:57:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/doxygen/main_page.md

cpp/include/kvikio/defaults.hpp

cpp/src/shim/libcurl.cpp

madsbk

Looks good @TomAugspurger.
I agree, it would be good to make the code list configurable. Or at least, define it as a constexpr somewhere.

cpp/src/shim/libcurl.cpp

python/kvikio/tests/test_http_io.py

cpp/src/defaults.cpp

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

TomAugspurger · 2025-02-03T13:56:09Z

Apologies for the force-commit. The commits made from my devcontainer last week weren't being signed for some reason.

TomAugspurger · 2025-02-03T22:26:12Z

The two CI failures appear to be from the 6-hour timeout on the github action: https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434?pr=603#step:9:1556

context canceled
python/kvikio/tests/test_benchmarks.py::test_http_io[cupy] 
Error: The operation was canceled.

I assume that's unrelated to the changes here. If possible, it might be best to rerun just those failed jobs?

bdice · 2025-02-03T23:19:29Z

If there are jobs with hangs, we need to diagnose those offline and not rerun them. Consuming a GPU runner for 6 hours is not good, especially with our limited supply of ARM nodes.

TomAugspurger · 2025-02-04T02:45:42Z

Makes sense. https://github.com/rapidsai/kvikio/actions/runs/13117029669 (from #1465) also took much longer than normal on these same two matrix entries.

https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707886?pr=603 is one of the slow jobs. That

16:23:01 started Run tests
16:23:53-16:26:25 compiling numcodecs at
- something to look into: Why are we compiling numcodecs? See if it can provide a wheel and save us some time.
16:27:13 started pytest
16:28:05 last successful test finished
22:20:06 Run canceled while running python/kvikio/tests/test_benchmarks.py::test_http_io[cupy]
~1 second ago: I realized my code is almost surely to blame :)

A bit strange it passed on conda though. I'll take a look.

TomAugspurger · 2025-02-04T12:43:07Z

That said, https://github.com/rapidsai/kvikio/actions/runs/13117029669/job/36594707434 (testing #608) also timed out after 6 hours on the same test, and it was running at around the same time.

That test seems to use the run_cmd fixture to run a benchmark in a subprocess. I don't think we have logs to confirm it, but it's almost surely hanging while starting that subprocess or within it. I'll look into adding a timeout mechanism to run_cmd (cc @kingcrimsontianyu, just in case your PR hits that timeout again, no need for you to investigate too).

TomAugspurger · 2025-02-04T14:01:48Z

I wish I were more confident, but the hang is probably happening in

kvikio/python/kvikio/tests/conftest.py

Lines 22 to 24 in 74653a3

    
           res: subprocess.CompletedProcess = subprocess.run( 
        
               cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=cwd 
        
           )  # type: ignore

. We could probably catch most of these by setting a timeout in that subprocess.call. However, that's not the easiest to integrate into the rest of that run_cmd fixture, since it's using blocking calls to .send() and .recv() to send test commands and receive results, and those don't have timeout parameters. If we raise a TimeoutError there, run_cmd would hang on the .recv() since the server never writes anything to the pipe.

I'd recommend two things

Add pytest-timeout as a test dependency, and ensure that these tests have a timeout. With small timeouts and added time.sleep commands in the http_io.py file I've confirmed that pytest-timeout does interrupt the individual tests and the test process finishes.
Investigate the cause of the hangs in the first place. I'm like 99% sure that we should be setting CURLOPT_TIMEOUT somewhere in libcurl.cpp before we perform any requests. Which means we would need to pick a default and expose that up through to the user as a configuration value / parameter for requests made by kvikio. That should probably be done as a separate PR (Set timeouts for HTTP requests #613).

TomAugspurger · 2025-02-04T16:18:58Z

The two wheel test failures are from segfaults, somewhere in the call to open_http while running python/kvikio/tests/test_examples.py::test_http_io: https://github.com/rapidsai/kvikio/actions/runs/13137420435/job/36656808281?pr=603#step:9:1578

Looking into it.

Edit: I'm not able to reproduce this locally. pytest-timeout works by setting a SIGALRM timer at test start and clearing it at test end. The only thing related to signals I see in kvikio is us setting CURLOPT_NOSIGNAL =1 at

kvikio/cpp/src/shim/libcurl.cpp

Lines 101 to 104 in 74653a3

    
           // Need CURLOPT_NOSIGNAL to support threading, see 
        
           // <https://curl.se/libcurl/c/CURLOPT_NOSIGNAL.html> 
        
           setopt(CURLOPT_NOSIGNAL, 1L);

. Based on the docs, it sounds like there's a risk for clashing in the use of SIGALRM

This option may cause libcurl to use the SIGALRM signal to timeout system calls on builds not using asynch DNS. In Unix-like systems, this might cause signals to be used unless CURLOPT_NOSIGNAL is set.

but we are using CURLOPT_NOSIGNAL so I'm not sure.

b641240 updates the timeout to use threads instead.

cpp/src/shim/libcurl.cpp

- added regression test - adjust the initial request count to 0 - adjust the error message to print the attempted count

TomAugspurger · 2025-02-18T12:18:28Z

Fixed the merge commits. This should be ready for another review / good to go.

Edit: must have messed up the merge. Looking into CI now.

TomAugspurger · 2025-02-18T13:23:54Z

@kingcrimsontianyu do you see anything obviously wrong with how I've structured the parse_http_status_codes function? I've tried to follow how you did parse_compat_mode, but my local build is failing with

$ build-all
[3/3] Linking CXX executable gtests/cpp_tests
FAILED: gtests/cpp_tests 
: && /home/coder/.conda/envs/rapids/bin/x86_64-conda-linux-gnu-c++ -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/coder/.conda/envs/rapids/include  -I/home/coder/.conda/envs/rapids/targets/x86_64-linux/include -O3 -DNDEBUG -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/coder/.conda/envs/rapids/lib -Wl,-rpath-link,/home/coder/.conda/envs/rapids/lib -L/home/coder/.conda/envs/rapids/lib  -L/home/coder/.conda/envs/rapids/targets/x86_64-linux/lib -L/home/coder/.conda/envs/rapids/targets/x86_64-linux/lib/stubs     -Wl,--dependency-file=tests/CMakeFiles/cpp_tests.dir/link.d tests/CMakeFiles/cpp_tests.dir/main.cpp.o tests/CMakeFiles/cpp_tests.dir/test_basic_io.cpp.o tests/CMakeFiles/cpp_tests.dir/test_defaults.cpp.o -o gtests/cpp_tests  -Wl,-rpath,/home/coder/kvikio/cpp/build/conda/cuda-12.8/release:  libkvikio.so  lib/libgmock.a  lib/libgtest.a  /home/coder/.conda/envs/rapids/lib/libcudart.so  -lpthread  -ldl  /home/coder/.conda/envs/rapids/x86_64-conda-linux-gnu/sysroot/usr/lib/librt.so && :
/home/coder/.conda/envs/rapids/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: tests/CMakeFiles/cpp_tests.dir/test_defaults.cpp.o: in function `Defaults_parse_http_status_codes_Test::TestBody()':
test_defaults.cpp:(.text._ZN37Defaults_parse_http_status_codes_Test8TestBodyEv+0x5ad): undefined reference to `kvikio::detail::parse_http_status_codes(std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
/home/coder/.conda/envs/rapids/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: test_defaults.cpp:(.text._ZN37Defaults_parse_http_status_codes_Test8TestBodyEv+0x9ba): undefined reference to `kvikio::detail::parse_http_status_codes(std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Edit: perhaps because I didn't add my new file to CMakeLists.txt. Trying that now.

TomAugspurger · 2025-02-18T14:03:04Z

CI is passing now.

madsbk

Looks great, thanks @TomAugspurger

cpp/src/defaults.cpp

cpp/src/shim/libcurl.cpp

- loop++ - const order

TomAugspurger · 2025-02-20T21:43:20Z

Looking into the CI failure at https://github.com/rapidsai/kvikio/actions/runs/13444589202/job/37567619135?pr=603, with failures like

    def test_http_max_attempts():
>       before = kvikio.defaults.http_max_attempts()
E       AttributeError: module 'kvikio.defaults' has no attribute 'http_max_attempts'

Which would surprise me. It's almost like the code being tested was from main or something.

TomAugspurger · 2025-02-20T21:51:49Z

https://github.com/rapidsai/kvikio/actions/runs/13444589202/job/37567618823?pr=603#step:9:359 shows we got

kvikio                    25.04.00a25     cuda11_py310_250219_g77da587_25    rapidsai-nightly

Looking at https://anaconda.org/rapidsai-nightly/kvikio/files?page=2, I see a package with that name from 40 hours ago: https://anaconda.org/rapidsai-nightly/kvikio/25.04.00a25/download/linux-64/kvikio-25.04.00a25-cuda11_py310_250219_g77da587_25.conda. So it does seem like we were running an older version of the code, rather than this branch. I'll keep looking into why that might be.

bdice · 2025-02-21T21:19:35Z

@TomAugspurger I think CI is failing because the packages built by this PR pin to nvcomp==4.2.0.11. We are having an issue with the CI package cache not having up-to-date repodata, and we are attempting to debug that (https://github.com/nv-gha-runners/roadmap/issues/192). Because it can't find nvcomp==4.2.0.11, it falls back to the kvikio packages on rapidsai-nightly.

jameslamb · 2025-02-21T22:42:32Z

I don't agree with the theory that the most recent builds are failing here because of the proxy cache issues with conda packages.

Across all of pip devcontainers, conda devcontainers, C++ wheel builds, and C++ conda builds, I see the same compilation error:

/home/coder/kvikio/cpp/include/kvikio/http_status_codes.hpp:34:6: error: ‘vector’ in namespace ‘std’ does not name a template type
   34 | std::vector<int> parse_http_status_codes(std::string_view env_var_name,
      |      ^~~~~~
/home/coder/kvikio/cpp/include/kvikio/http_status_codes.hpp:1:1: note: ‘std::vector’ is defined in header ‘<vector>’; did you forget to ‘#include <vector>’?
  +++ |+#include <vector>
    1 | /*
ninja: build stopped: subcommand failed.

(pip devcontainer build)

pip devcontainers, for example, would be totally unaffected by any conda-specific issue.

There was also a successful nightly test run 11 hours ago, where conda-python-tests jobs pulled in nvcomp==4.2.0.11 (with proxy cache enabled). Think builds are failing here because you're genuinely missing an #include somewhere

TomAugspurger · 2025-02-21T22:50:39Z

Think builds are failing here because you're genuinely missing an #include somewhere

Yep, I was. Fixed in ae2ea11.

jameslamb · 2025-02-21T23:49:59Z

Ok yeah, on a re-run it failed in the same way as #603 (comment), and nvcomp==4.2.0.11 was not pulled:

nvcomp                    4.1.1.1              hf3d1f9a_0    conda-forge

https://github.com/rapidsai/kvikio/actions/runs/13466571002/job/37633868639?pr=603

So I think @bdice was right, and this caching thing was the problem.

bdice

Approving with one note. We should merge this once CI passes -- I merged in the upstream now that the CI conda package cache is disabled (which caused problems earlier).

bdice · 2025-02-22T21:15:33Z

python/kvikio/kvikio/defaults.py

+
+@contextlib.contextmanager
+def set_http_max_attempts(attempts: int):
+    """Context for resetting the maximum number of HTTP attempts.


It's a little confusing to say that a context "resets" a value. What is the difference between "setting" and "resetting" the value? I also think the wording of http_max_attempts_reset(attempts: int) is awkward.

However, I recognize this naming is consistent with other kvikio APIs so I don't want to block on anything here. Maybe we can look for another library API that we agree is clear about this kind of behavior (setting, getting, setting-in-context) and refactor accordingly.

TomAugspurger commented Jan 29, 2025

View reviewed changes

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

TomAugspurger marked this pull request as ready for review January 29, 2025 22:31

TomAugspurger requested review from a team as code owners January 29, 2025 22:31

bdice reviewed Jan 29, 2025

View reviewed changes

cpp/doxygen/main_page.md Outdated Show resolved Hide resolved

cpp/include/kvikio/defaults.hpp Outdated Show resolved Hide resolved

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 30, 2025

madsbk reviewed Jan 30, 2025

View reviewed changes

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

python/kvikio/tests/test_http_io.py Outdated Show resolved Hide resolved

TomAugspurger commented Jan 30, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

madsbk reviewed Jan 31, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

Retry on HTTP 50x errors

e2934fa

This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

TomAugspurger force-pushed the tom/retry-http branch from 4913430 to e2934fa Compare February 3, 2025 13:55

TomAugspurger added 2 commits February 3, 2025 15:00

Throw 500 errors on HEAD too

2ef47e4

Added C++ tests for parse_http_status_codes

bf33697

Added timeouts to benchmark tests

d7d377b

TomAugspurger mentioned this pull request Feb 4, 2025

Set timeouts for HTTP requests #613

Open

TomAugspurger requested a review from a team as a code owner February 4, 2025 14:14

TomAugspurger requested a review from AyodeAwe February 4, 2025 14:14

TomAugspurger added 2 commits February 4, 2025 17:15

docfix

d89bf5e

Use threads

b641240

kingcrimsontianyu reviewed Feb 13, 2025

View reviewed changes

cpp/src/shim/libcurl.cpp Show resolved Hide resolved

kingcrimsontianyu reviewed Feb 13, 2025

View reviewed changes

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

TomAugspurger added 2 commits February 13, 2025 19:02

Fixup

17d5fe0

formatting

17895c4

TomAugspurger added the DO NOT MERGE label Feb 14, 2025

TomAugspurger added 3 commits February 14, 2025 21:06

Fixup

85d4fd7

- added regression test - adjust the initial request count to 0 - adjust the error message to print the attempted count

Merge remote-tracking branch 'upstream/branch-25.04' into tom/retry-http

522e5d8

removed comments

88133fa

TomAugspurger removed the DO NOT MERGE label Feb 14, 2025

Merge remote-tracking branch 'upstream/branch-25.04' into tom/retry-http

0fc805b

restructure http status code parsing

db58957

Added to cmakelists

c489129

madsbk approved these changes Feb 19, 2025

View reviewed changes

cpp/src/defaults.cpp Outdated Show resolved Hide resolved

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

cpp/src/shim/libcurl.cpp Outdated Show resolved Hide resolved

TomAugspurger added 3 commits February 19, 2025 13:25

Merge remote-tracking branch 'upstream/branch-25.04' into tom/retry-http

43bd3af

Fixups

b3eb984

- loop++ - const order

Merge remote-tracking branch 'upstream/branch-25.04' into tom/retry-http

74febf6

Merge remote-tracking branch 'upstream/branch-25.04' into tom/retry-http

0cc4887

Added include to new module

ae2ea11

jameslamb mentioned this pull request Feb 21, 2025

turn off package proxy cache on conda-based jobs rapidsai/shared-workflows#282

Merged

Merge branch 'branch-25.04' into tom/retry-http

0ea0918

bdice approved these changes Feb 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on HTTP 50x errors #603

Retry on HTTP 50x errors #603

TomAugspurger commented Jan 29, 2025

copy-pr-bot bot commented Jan 29, 2025

copy-pr-bot bot commented Jan 29, 2025

madsbk left a comment •

edited

Loading

TomAugspurger commented Feb 3, 2025

TomAugspurger commented Feb 3, 2025

bdice commented Feb 3, 2025

TomAugspurger commented Feb 4, 2025

TomAugspurger commented Feb 4, 2025

TomAugspurger commented Feb 4, 2025 •

edited

Loading

TomAugspurger commented Feb 4, 2025 •

edited

Loading

TomAugspurger commented Feb 18, 2025 •

edited

Loading

TomAugspurger commented Feb 18, 2025 •

edited

Loading

TomAugspurger commented Feb 18, 2025

madsbk left a comment

TomAugspurger commented Feb 20, 2025

TomAugspurger commented Feb 20, 2025

bdice commented Feb 21, 2025 •

edited

Loading

jameslamb commented Feb 21, 2025

TomAugspurger commented Feb 21, 2025

jameslamb commented Feb 21, 2025

bdice left a comment

bdice Feb 22, 2025

Retry on HTTP 50x errors #603

Are you sure you want to change the base?

Retry on HTTP 50x errors #603

Conversation

TomAugspurger commented Jan 29, 2025

copy-pr-bot bot commented Jan 29, 2025

copy-pr-bot bot commented Jan 29, 2025

madsbk left a comment • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Feb 3, 2025

TomAugspurger commented Feb 3, 2025

bdice commented Feb 3, 2025

TomAugspurger commented Feb 4, 2025

TomAugspurger commented Feb 4, 2025

TomAugspurger commented Feb 4, 2025 • edited Loading

TomAugspurger commented Feb 4, 2025 • edited Loading

TomAugspurger commented Feb 18, 2025 • edited Loading

TomAugspurger commented Feb 18, 2025 • edited Loading

TomAugspurger commented Feb 18, 2025

madsbk left a comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 20, 2025

TomAugspurger commented Feb 20, 2025

bdice commented Feb 21, 2025 • edited Loading

jameslamb commented Feb 21, 2025

TomAugspurger commented Feb 21, 2025

jameslamb commented Feb 21, 2025

bdice left a comment

Choose a reason for hiding this comment

bdice Feb 22, 2025

Choose a reason for hiding this comment

madsbk left a comment •

edited

Loading

TomAugspurger commented Feb 4, 2025 •

edited

Loading

TomAugspurger commented Feb 4, 2025 •

edited

Loading

TomAugspurger commented Feb 18, 2025 •

edited

Loading

TomAugspurger commented Feb 18, 2025 •

edited

Loading

bdice commented Feb 21, 2025 •

edited

Loading