Help diagnosing hang in unstable network environments #1210

pfnsec · 2024-10-30T11:43:56Z

pfnsec
Oct 30, 2024

Hi all, I am running the latest published crates and attempting to use the AWS Route53 SDK with Rust. In general, the functionality is excellent. However, I have been facing a very frustrating issue when operating on an unstable connection, such as a hotel room or coffeeshop.

Simply put, a request, such as ListHostedZones or ListResourceRecordSets will hang for no reason, seemingly indefinitely. I have tried all combinations of stalled stream protection and connect/operation/operation attempt/read timeout settings to no avail. No error is thrown, it just hangs forever! The debugger shows all tokio threads parked waiting for a futex.

Frustratingly, even wrapping it in tokio::time:timeout still hangs!

                        Duration::from_secs(5),
                        self.client
                            .list_resource_record_sets()
                            .set_hosted_zone_id(Some(hz.id.clone()))
                            .set_start_record_name(Some(name.to_string()))
                            .start_record_type(RrType::try_parse(&r#type)?)
                            .send()
                        ).await??;

The behaviour seems to be very nondeterministic. The number of requests I can successfully issue changes every time. Delaying my requests by any number of milliseconds to reduce traffic doesn't seem to help, either.

I have set RUST_LOG=trace, and no output is emitted after the hang occurs.

I tried analysing with Wireshark and discovered that the failure always occurs after the client sends the server a TCP reset:
(The dest IP here is confirmed to be AWS)

This leads me to believe that this is caused by flaky or unstable network conditions. Unfortunately this is a complete show-stopper for me, as I expect to be on the road for a while longer. Can you advise on how to configure the AWS SDK for these conditions? If my conclusions are incorrect, what else could be causing this?

Thank you and best regards.

Answered by pfnsec

Nov 4, 2024

Hi all, my apologies for the noise. It turns out it was a complete wild goose chase. After going slightly mad, I found that the issue still persisted even when I replaced calls to the AWS Route53 SDK with a simple tokio::time::sleep().await .
Every future was being awaited correctly, and yet, the hang didn't happen with a synchronous sleep - only when there was an await point in a certain "leaf" function!
I thought this was just so ridiculous, and so in a hail-mary, migrated the app from axum to actix-web. And... the bug disappeared! I am puzzled. However, everything is now working correctly!

I want to thank you all sincerely for your attention on this matter. My hypotheses about any kind…

View full answer

ysaito1001 · 2024-10-30T16:21:55Z

ysaito1001
Oct 30, 2024
Maintainer

Thank you for reporting this! Before we proceed, I have clarifying questions:

What is the expected behavior for the SDK under that circumstance? Should it exit with an error without the hang, but you don't expect operations like ListHostedZones to succeed, correct?
Given the circumstance in which the program runs, I suspect the hang might not be a specific issue for the SDK. If you run another Rust application using Tokio as an async runtime, do you see a similar hang in the environment you're in?

1 reply

pfnsec Oct 30, 2024
Author

Hi, thanks for the speedy reply.
I'd expect the response to succeed - the other responses succeed, and there's nothing inherently different about the particular request in the for loop that should cause it to fail. In fact, whatever response fails or succeeds seems to depend on whatever artificial delays I introduce. The only common factor seems to be that I observe a TCP RST in wireshark just before whatever request fails and hangs indefinitely.

I've never observed this failure mode with any other app that uses tokio. I'm currently at a bar in Taipei and the connection doesn't seem particularly unstable, but I'm still very consistently seeing this failure. I'm not actually able to enumerate all of my Route53 hosted zones, that is, to fetch every hosted zone and its respective record set. Every single attempt fails, quite consistently. I think this may not be an issue of network stability after all...

ip link -s [interface] doesn't even show that many dropped packets. Maybe 1 in a million? Not enough to be fragging things this badly. But I consistently see a TCP RST from the client to an IP listed as belonging to AWS before every failure.

If I add arbitrary delays before each request, the particular hosted zone/record set at which it fails is different. It gets further along in the process when I reduce the delay, but when I add a delay of 2 seconds before any request, it hits this error mode after nearly the first hosted zone's record set fetch.

Here is the end of the trace logs:

2024-10-30T16:35:50.520409Z TRACE invoke{service=route53 operation=ListResourceRecordSets sdk_invocation_id=1642773}:try_op:try_attempt: take? ("https", route53.amazonaws.com): expiration = Some(90s)
2024-10-30T16:35:50.520423Z DEBUG invoke{service=route53 operation=ListResourceRecordSets sdk_invocation_id=1642773}:try_op:try_attempt: reuse idle connection for ("https", route53.amazonaws.com)
2024-10-30T16:35:50.520464Z TRACE encode_headers: Client::encode method=GET, body=None
2024-10-30T16:35:50.520555Z DEBUG flushed 646 bytes
2024-10-30T16:35:50.520563Z TRACE flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2024-10-30T16:35:50.521389Z TRACE callback receiver has dropped
2024-10-30T16:35:50.521412Z TRACE dispatch no longer receiving messages
2024-10-30T16:35:50.521420Z TRACE State::close_read()
2024-10-30T16:35:50.521428Z TRACE State::close_write()
2024-10-30T16:35:50.521434Z TRACE flushed({role=client}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2024-10-30T16:35:50.521450Z DEBUG Sending warning alert CloseNotify    
2024-10-30T16:35:50.521513Z TRACE shut down IO complete
2024-10-30T16:35:50.521555Z TRACE pool closed, canceling idle interval
2024-10-30T16:35:50.521570Z TRACE worker polling for next message
2024-10-30T16:35:50.521622Z DEBUG Sending warning alert CloseNotify

I'll provide literally whatever trace/logging info you need for this. This is totally killing me and I can't figure out why this is happening. I've tried with single/multithreaded tokio to no avail. I was trying to set up native-tls thinking it was a rustls issue but I can't figure out how to do that either. I've been fiddling with my kernel/sysctl settings for MTU sizes and all kinds of other stuff to no avail.

pfnsec · 2024-10-31T03:27:34Z

pfnsec
Oct 31, 2024
Author

Ok, so I think this might be related to this:

hyperium/hyper#2312

So I'm trying to test that hypothesis by setting pool_max_idle_per_host(0). Except I can't figure out how to do that.

#448
This is apparently deprecated because the aws-smithy-client crate is deprecated.

Meanwhile this:
https://docs.aws.amazon.com/fr_fr/sdk-for-rust/latest/dg/hyper1.html

...doesn't expose the pool_max_idle_per_host() or much of anything else in the builder.

Nor does this:

#965

... in fact the method HyperClientBuilder::new().build() no longer even exists. So I'm not sure how one is even supposed to use that.

And this:
https://tikv.github.io/doc/reqwest/blocking/struct.ClientBuilder.html

...complains that the trait bound reqwest::blocking::Client: HttpClient is not satisfied...

So now my question becomes, how do I set up a custom http_client and set pool_max_idle_per_host on the latest version of the SDK?

5 replies

ysaito1001 Oct 31, 2024
Maintainer

Thanks for digging and providing detailed analysis.

how do I set up a custom http_client and set pool_max_idle_per_host on the latest version of the SDK?

I believe this code snippet should help get you pretty far. But to configure pool_max_idle_per_host for the SDK, we need to prepare a hyper::client::Builder like so

    use hyper::Client;
    use aws_smithy_runtime::client::http::hyper_014::HyperClientBuilder;

    let mut builder = Client::builder();
    let hyper_builder = builder/* possibly other config calls go here */.pool_max_idle_per_host(0);
    let hyper_builder = std::mem::take(hyper_builder); // so that the owned value can later be passed to `HyperClientBuilder::hyper_builder`

and pass hyper_builder to HyperClientBuilder in the snippet linked above. So instead of

let hyper_client = HyperClientBuilder::new().build(tls_connector);

in the snippet, it will be

let hyper_client = HyperClientBuilder::new().hyper_builder(hyper_buider).build(tls_connector);

pfnsec Nov 1, 2024
Author

Argh, so close!

        let mut builder = hyper::Client::builder();

        let hyper_builder = builder.pool_max_idle_per_host(0);
        let hyper_builder = std::mem::take(hyper_builder);

        let rustls_connector = hyper_rustls::HttpsConnectorBuilder::new()
            .with_webpki_roots()
            .https_only()
            .enable_http1()
            .enable_http2()
            .build();
        let hyper_client = HyperClientBuilder::new().hyper_builder(hyper_builder).build(rustls_connector);

the trait bound `HttpsConnector<hyper_util::client::legacy::connect::http::HttpConnector>: hyper::service::Service<hyper::Uri>` is not satisfied
the trait `hyper::service::Service<http::Uri>` is implemented for `HttpsConnector<hyper_util::client::legacy::connect::http::HttpConnector>`
for that trait implementation, expected `http::Uri`, found `hyper::Uri`

ysaito1001 Nov 1, 2024
Maintainer

Hmm, the snippet you provided above was built successfully on my end. Did you use HyperClientBuilder from aws_smithy_runtime::client::http::hyper_014::HyperClientBuilder (cargo features client and connector-hyper-0-14-x need to be enabled in the aws_smithy_runtine crate) ? And the version of hyper also needs to be 0.14.x for this to work.

pfnsec Nov 1, 2024
Author

Unfortunately, I got it to compile, but it didn't fix my issue :( I'm afraid I'm at a loss.
callback receiver has dropped
and dispatch no longer receiving messages don't seem to be reported anywhere else by anyone... I wonder if it is just something screwy with my setup? Running it in a docker container didn't seem to affect it. I'm at a complete loss. I no longer even think this is down to a flaky connection.

ysaito1001 Nov 1, 2024
Maintainer

Interesting. If the issue seems no longer related to a flaky connection, could you somehow provide a small reproducible example that consistently fails even in a stable connection environment?

pfnsec · 2024-11-04T16:30:24Z

pfnsec
Nov 4, 2024
Author

Hi all, my apologies for the noise. It turns out it was a complete wild goose chase. After going slightly mad, I found that the issue still persisted even when I replaced calls to the AWS Route53 SDK with a simple tokio::time::sleep().await .
Every future was being awaited correctly, and yet, the hang didn't happen with a synchronous sleep - only when there was an await point in a certain "leaf" function!
I thought this was just so ridiculous, and so in a hail-mary, migrated the app from axum to actix-web. And... the bug disappeared! I am puzzled. However, everything is now working correctly!

I want to thank you all sincerely for your attention on this matter. My hypotheses about any kind of network issue or hyper pool parameters causing this appear to be have been completely false.

Best wishes for this excellent project going forward.

0 replies

2024-11-04T20:03:13Z

github-actions[bot]
bot Nov 4, 2024

Hello! Reopening this discussion to make it searchable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help diagnosing hang in unstable network environments #1210

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help diagnosing hang in unstable network environments #1210

pfnsec Oct 30, 2024

Replies: 4 comments · 6 replies

ysaito1001 Oct 30, 2024 Maintainer

pfnsec Oct 30, 2024 Author

pfnsec Oct 31, 2024 Author

ysaito1001 Oct 31, 2024 Maintainer

pfnsec Nov 1, 2024 Author

ysaito1001 Nov 1, 2024 Maintainer

pfnsec Nov 1, 2024 Author

ysaito1001 Nov 1, 2024 Maintainer

pfnsec Nov 4, 2024 Author

github-actions[bot] bot Nov 4, 2024

pfnsec
Oct 30, 2024

Replies: 4 comments 6 replies

ysaito1001
Oct 30, 2024
Maintainer

pfnsec Oct 30, 2024
Author

pfnsec
Oct 31, 2024
Author

ysaito1001 Oct 31, 2024
Maintainer

pfnsec Nov 1, 2024
Author

ysaito1001 Nov 1, 2024
Maintainer

pfnsec Nov 1, 2024
Author

ysaito1001 Nov 1, 2024
Maintainer

pfnsec
Nov 4, 2024
Author

github-actions[bot]
bot Nov 4, 2024