-
Notifications
You must be signed in to change notification settings - Fork 5k
[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @dotnet/ncl Issue Details
There are about 120 failures in last 2 weeks (as of 28-02-2023)
|
Gives me this list:
|
Moving discussion:
|
So I have experimentally increased |
I'd really like to understand the scenario more. That commit is essentially saying to QUIC that it's ok to be stuck for 2.5 seconds before sending an ACK to the peer, even though we've told them we will respond within 25 ms. This will result in the peer in sending a retransmit after the 25 ms (likely multiple times) and will result in different test behavior. Best case, the number of packets required is a little different. Worst case, depending on the configuration, the connection is killed for a different reason ("disconnect timeout'). |
We're testing parallel connection attempts to make sure our logic to limit the number of incoming, established connections, that haven't been accepted by the user yet, never outgrows the configured limit. So this test introduces parallel processing, bane of all Quic tests apparently, which in this case is necessary for the test to test what it should. Once this test gets run on "slow" enough system, it starts failing due to connections being discarded by msquic and the observed counts of accepted and rejected connections not corresponding to the expected numbers. We have encountered this problem multiple times in our tests and it keeps resurfacing. Setting up this parameter is the only way to prevent msquic from discarding connections as far as I understand. I just want to make our tests are more stable. If you know of better way to achieve that, I'll gladly jump on that. |
I'm wondering if we can get MsQuic trace and/or packet captures to verify that the logic works as expected. |
Can you simply use a smaller limit at your layer on these slower machines? As we've discussed before, MsQuic is operating as expected. It dynamically responds to the capabilities of the machines. A 2-CPU VM obviously isn't going to be able to handle thousands of handshakes per second. What are the exact numbers you are testing, both in terms of connections and CPUs? Based on our perf tests (no .NET included 😄) our perf machines can only handle about 250 connections/per core/per second, best case. I know .NET does more complicated things on top compared to our perf tests, and with VMs (not baremetal) you're likely to have even lower limits. |
100 parallel connections are made at the beginning of the test and handshake is successfully finished for all of them. We have other combinations of numbers of attempted and accepted connections, but those are the ones that were failing. EDIT: handshake should be finished for them, but some are discarded before that. |
And we have tests with lower numbers as well. Now we have to somehow figure out where and when suppress those more demanding, as well as other tests like #83101. |
Uh oh!
There was an error while loading. Please reload this page.
e.g. https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-96bd05a3f91a4661b1/System.Net.Quic.Functional.Tests/3/console.60e17a73.log?helixlogtype=result
Outerloop test. There are about 180 failures in last 2 weeks (as of 28-02-2023)
Kusto query
The text was updated successfully, but these errors were encountered: