[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769

rzikm · 2023-02-28T15:58:58Z

e.g. https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-96bd05a3f91a4661b1/System.Net.Quic.Functional.Tests/3/console.60e17a73.log?helixlogtype=result

Outerloop test. There are about 180 failures in last 2 weeks (as of 28-02-2023)

----- start Tue 28 Feb 2023 12:18:50 PM UTC =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Quic.Functional.Tests.runtimeconfig.json --depsfile System.Net.Quic.Functional.Tests.deps.json xunit.console.dll System.Net.Quic.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Quic.Functional.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Quic.Functional.Tests (found 8 of 122 test cases)
  Starting:    System.Net.Quic.Functional.Tests (parallel test collections = on, max threads = 2)
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 99) [FAIL]
      Assert.Equal() Failure
      Expected: 99
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(250,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 101) [FAIL]
      Assert.Equal() Failure
      Expected: 100
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(244,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
  Finished:    System.Net.Quic.Functional.Tests

Kusto query

cluster('engsrvprod.kusto.windows.net').database('engineeringdata').AzureDevOpsTests
    //| where TestName has 'ConnectWithRevocation_WithCallback'
    | where TestName startswith "System.Net"
        and Outcome == 'Failed' or Outcome == 'PassedOnRerun'
    | extend i = indexof_regex(TestName, @"[^.]+$")
    | extend Method = substring(TestName, i)
    | extend Type = substring(TestName, 0, i - 1)
    | project-away i, TestName
    | where Method has 'Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow'
    //| where Arguments == 'protocol: Ssl3'
    | distinct JobId, WorkItemId, Message, StackTrace, Method, Type, Arguments, Outcome
    | join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
        | where Finished > now(-14d)
        | extend TargetBranch = extractjson("$.['System.PullRequest.TargetBranch']", Properties)
        | where (Branch == 'refs/heads/main') //or (Source startswith "pr/" and TargetBranch == 'main')
        | where Type startswith "test/functional/cli/" and not(Properties contains "runtime-staging")
        | summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName, Attempt, Build, TargetBranch) by JobId
    | project-rename JobType = Type) on JobId
    | extend PropertiesJson = parse_json(Properties)
    | extend OS = replace_regex(tostring(PropertiesJson.operatingSystem), @'\((.*)\).*|([^\(].*)', @'\1\2')
    | extend Runtime = iif(PropertiesJson.runtimeFlavor == "mono", "Mono", iif(PropertiesJson.DefinitionName contains "coreclr", "CoreCLR", "")) 
    | extend Architecture = PropertiesJson.architecture
    | extend DefinitionName = PropertiesJson.DefinitionName
    | project-away JobId
    | extend JobId = extractjson("$.['System.JobId']", Properties)
    //| where JobId == '913fd104-6683-4129-8fd6-02f22466d821'
    //| where Architecture == 'arm64'
    //| where QueueName == 'windows.amd64.server2022.open.svc'
    //| where TargetBranch == 'release/6.0'
    | join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').WorkItems
        | project-keep ConsoleUri, WorkItemId) on WorkItemId
    | project-away PropertiesJson
    | project-reorder Started, Type, Method, Arguments, StackTrace, ConsoleUri, OS, Architecture, DefinitionName, Branch, TargetBranch, Attempt, Build, * 
;

The text was updated successfully, but these errors were encountered:

ghost · 2023-02-28T15:59:06Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

e.g. https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-96bd05a3f91a4661b1/System.Net.Quic.Functional.Tests/3/console.60e17a73.log?helixlogtype=result

----- start Tue 28 Feb 2023 12:18:50 PM UTC =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Quic.Functional.Tests.runtimeconfig.json --depsfile System.Net.Quic.Functional.Tests.deps.json xunit.console.dll System.Net.Quic.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Quic.Functional.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Quic.Functional.Tests (found 8 of 122 test cases)
  Starting:    System.Net.Quic.Functional.Tests (parallel test collections = on, max threads = 2)
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 99) [FAIL]
      Assert.Equal() Failure
      Expected: 99
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(250,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 101) [FAIL]
      Assert.Equal() Failure
      Expected: 100
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(244,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
  Finished:    System.Net.Quic.Functional.Tests

There are about 120 failures in last 2 weeks (as of 28-02-2023)

Author:	rzikm
Assignees:	-
Labels:	`area-System.Net.Quic`
Milestone:	-

ManickaP · 2023-02-28T18:03:25Z

AzureDevOpsTests
| where TestName has "Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows" and Outcome == "Failed" and RunCompleted >= ago(1day)
| order by BuildDefinitionName, QueueName, TestRunName, RunCompleted
| distinct TestRunName

Gives me this list:

net8.0-linux-Release-arm64-CoreCLR_Release-(Ubuntu.1804.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8
net8.0-linux-Release-arm64-CoreCLR_Release-(Debian.11.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-arm64v8
net8.0-linux-Release-arm64-CoreCLR_Release-(Debian.10.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:debian-10-helix-arm64v8
net8.0-linux-Release-arm-CoreCLR_Release-(Debian.11.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-arm32v7
net8.0-linux-Release-arm-CoreCLR_Release-(Debian.10.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:debian-10-helix-arm32v7

ManickaP · 2023-03-07T14:49:53Z

Moving discussion:

@ManickaP:

This is another occurrence of #55979, i.e. msquic is silently dropping connections due to the machine being too slow.
I've temporarily added statistics (6709e8e) and they show multiple dropped connections.
I'm moving this to future as we don't have any solution for this problem. Maybe we could even close this against #55979

@wfurt:

That is pretty sad that we cannot make single test reliable. I'm wondering if we can use the counter to adjust the expectation or decrease the concurrency for ARM. (or tune the drop threshold)

@nibanks:

You can't run a single test without failing? From my previous understanding, QUIC was getting overwhelmed by too much getting queued, and it was (by design) aborting the connection because it could no longer provide the time guarantees to have a reliable connection. But this wasn't caused by just 1 test.

@wfurt:

In this particular test, there is single listener and many parallel attempts to negotiate connection - this test was added recently to verify handling accept queue inside .NET.

@wfurt:

Failed for me on x64 as well while running tests on my VM over the weekend

    System.Net.Quic.Tests.MsQuicPlatformDetectionTests.UnsupportedPlatforms_ThrowsPlatformNotSupportedException [SKIP]
      Condition(s) not met: "IsQuicUnsupported"
      Assert.Equal() Failure
      Expected: 100
      Actual:   94
      Stack Trace:
        /home/furt/github/wfurt-runtime/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(251,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /home/furt/github/wfurt-runtime/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(268,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)

ManickaP · 2023-03-07T14:52:31Z

So I have experimentally increased MaxWorkerQueueDelayUs in 561d61a and it helped. We might consider doing this for the tests only. Any thoughts @wfurt @CarnaViire @nibanks?

nibanks · 2023-03-07T14:55:30Z

I'd really like to understand the scenario more. That commit is essentially saying to QUIC that it's ok to be stuck for 2.5 seconds before sending an ACK to the peer, even though we've told them we will respond within 25 ms. This will result in the peer in sending a retransmit after the 25 ms (likely multiple times) and will result in different test behavior. Best case, the number of packets required is a little different. Worst case, depending on the configuration, the connection is killed for a different reason ("disconnect timeout').

ManickaP · 2023-03-07T16:17:48Z

We're testing parallel connection attempts to make sure our logic to limit the number of incoming, established connections, that haven't been accepted by the user yet, never outgrows the configured limit. So this test introduces parallel processing, bane of all Quic tests apparently, which in this case is necessary for the test to test what it should. Once this test gets run on "slow" enough system, it starts failing due to connections being discarded by msquic and the observed counts of accepted and rejected connections not corresponding to the expected numbers.

We have encountered this problem multiple times in our tests and it keeps resurfacing. Setting up this parameter is the only way to prevent msquic from discarding connections as far as I understand. I just want to make our tests are more stable. If you know of better way to achieve that, I'll gladly jump on that.

wfurt · 2023-03-07T16:23:28Z

I'm wondering if we can get MsQuic trace and/or packet captures to verify that the logic works as expected.

nibanks · 2023-03-07T16:38:44Z

Can you simply use a smaller limit at your layer on these slower machines? As we've discussed before, MsQuic is operating as expected. It dynamically responds to the capabilities of the machines. A 2-CPU VM obviously isn't going to be able to handle thousands of handshakes per second. What are the exact numbers you are testing, both in terms of connections and CPUs? Based on our perf tests (no .NET included 😄) our perf machines can only handle about 250 connections/per core/per second, best case. I know .NET does more complicated things on top compared to our perf tests, and with VMs (not baremetal) you're likely to have even lower limits.

ManickaP · 2023-03-07T18:23:57Z

100 parallel connections are made at the beginning of the test and handshake is successfully finished for all of them. We have other combinations of numbers of attempted and accepted connections, but those are the ones that were failing.

EDIT: handshake should be finished for them, but some are discarded before that.

ManickaP · 2023-03-07T18:26:02Z

And we have tests with lower numbers as well. Now we have to somehow figure out where and when suppress those more demanding, as well as other tests like #83101.

ghost added area-System.Net.Quic untriaged New issue has not been triaged by the area owner labels Feb 28, 2023

rzikm added the test-failure label Feb 28, 2023

ManickaP mentioned this issue Feb 28, 2023

[QUIC] Disable failing test on arm machines #82780

Merged

ManickaP self-assigned this Feb 28, 2023

ManickaP removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2023

ManickaP added this to the 8.0.0 milestone Feb 28, 2023

ManickaP added the disabled-test The test is disabled in source code against the issue label Mar 2, 2023

ManickaP mentioned this issue Mar 7, 2023

[QUIC] IsSupportedIsTrue failing on s390x #82154

Closed

2 tasks

ManickaP modified the milestones: 8.0.0, Future Mar 7, 2023

wfurt mentioned this issue Mar 7, 2023

[Test Failure] System.Net.Quic.Tests.QuicStreamConformanceTests.Parallel_ReadWriteMultipleStreamsConcurrently on Alpine 3.17 #83101

Closed

ManickaP removed their assignment Mar 9, 2023

ManickaP mentioned this issue Mar 20, 2023

[QUIC] Fix failing parallel tests #83687

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 20, 2023

wfurt mentioned this issue Mar 23, 2023

enable SupportedLinuxPlatforms_IsSupportedIsTrue on arm32 Ubuntu #83799

Merged

ManickaP closed this as completed in #83687 Mar 24, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 24, 2023

ghost locked as resolved and limited conversation to collaborators Apr 23, 2023

karelz removed this from the Future milestone May 27, 2023

karelz added this to the 8.0.0 milestone May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769

[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769

rzikm commented Feb 28, 2023 •

edited

Loading

ghost commented Feb 28, 2023

Uh oh!

ManickaP commented Feb 28, 2023 •

edited

Loading

Uh oh!

ManickaP commented Mar 7, 2023

Uh oh!

ManickaP commented Mar 7, 2023 •

edited

Loading

Uh oh!

nibanks commented Mar 7, 2023

Uh oh!

ManickaP commented Mar 7, 2023

Uh oh!

wfurt commented Mar 7, 2023

Uh oh!

nibanks commented Mar 7, 2023 •

edited

Loading

Uh oh!

ManickaP commented Mar 7, 2023 •

edited

Loading

Uh oh!

ManickaP commented Mar 7, 2023

Uh oh!

[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769

[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769

Comments

rzikm commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ghost commented Feb 28, 2023

Uh oh!

ManickaP commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ManickaP commented Mar 7, 2023

Uh oh!

ManickaP commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nibanks commented Mar 7, 2023

Uh oh!

ManickaP commented Mar 7, 2023

Uh oh!

wfurt commented Mar 7, 2023

Uh oh!

nibanks commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ManickaP commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ManickaP commented Mar 7, 2023

Uh oh!

rzikm commented Feb 28, 2023 •

edited

Loading

ManickaP commented Feb 28, 2023 •

edited

Loading

ManickaP commented Mar 7, 2023 •

edited

Loading

nibanks commented Mar 7, 2023 •

edited

Loading

ManickaP commented Mar 7, 2023 •

edited

Loading