Fix TestOnStartWithArgsThenStop and make service tests more reliable #39153

danmoseley · 2020-07-12T06:32:19Z

There were two bugs

Test service does not mutually synchronize connect and start messages, but one of the test expects them to be ordered. This is the cause of the failure. I experimented with making the start message write a continuation of the connect message write, but this problem only occurs for these two messages. All other messages can be handled sequentially, because the test can wait on the matching SCM status. The simplest approach is simply to recognize these messages are unordered and allow either ordering in the test.
Test service did not wait for the client to connect before writing to the pipe. This was because when test services are being torn down, the test service installer will issue the Stop command to the service via the SCM, which will cause the test service to write a stop message to the pipe, which no longer has a client planning to connect to it, so in a previous change we stopped waiting on the connection, introducing flakiness. Fix: wait on the client connection, unless the message is a stop message. (Tests that may wait on a stop message all wait on some previous message before it, so the test service will always have a pipe when it writes to such tests.)

Using a debugger to figure out what is happening is painful because of the various threads and processes. Tracing is far more convenient and useful. I added a bunch of tracing to the test code. With luck this is the last timing problem in these tests, but we've had a series of such problems so I've left the tracing in for next time, disabled by default.

Also added a comment about Dispose, which was confusing.

danmoseley · 2020-07-13T22:35:24Z

@dotnet/dnceng the failure here seems to be a timeout in the result upload process - the tests passed.

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-39153-merge-224b3c449b084947aa/System.Text.Encoding.CodePages.Tests/console.b3609aa1.log?sv=2019-02-02&se=2020-08-01T07%3A16%3A44Z&sr=c&sp=rl&sig=%2FX9G8dCs3OSmyYacKcBkLTMQ6iPyI4FQAdR9xKoERS0%3D

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-39153-merge-224b3c449b084947aa/System.Text.Encoding.CodePages.Tests/9b43a59a-0b06-4d50-bf7b-96359ef2e522.log?sv=2019-02-02&se=2020-08-01T07%3A16%3A44Z&sr=c&sp=rl&sig=%2FX9G8dCs3OSmyYacKcBkLTMQ6iPyI4FQAdR9xKoERS0%3D

MattGal · 2020-07-13T22:41:44Z

@dotnet/dnceng the failure here seems to be a timeout in the result upload process - the tests passed.

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-39153-merge-224b3c449b084947aa/System.Text.Encoding.CodePages.Tests/console.b3609aa1.log?sv=2019-02-02&se=2020-08-01T07%3A16%3A44Z&sr=c&sp=rl&sig=%2FX9G8dCs3OSmyYacKcBkLTMQ6iPyI4FQAdR9xKoERS0%3D

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-39153-merge-224b3c449b084947aa/System.Text.Encoding.CodePages.Tests/9b43a59a-0b06-4d50-bf7b-96359ef2e522.log?sv=2019-02-02&se=2020-08-01T07%3A16%3A44Z&sr=c&sp=rl&sig=%2FX9G8dCs3OSmyYacKcBkLTMQ6iPyI4FQAdR9xKoERS0%3D

Hrm. I believe work item timeout is totally independent of log / result / dump upload; taking a look.

MattGal · 2020-07-13T22:54:26Z

@danmosemsft ah I see "upload" in this context means AzDO reporting and XUnit reporting (which is done in the context of the workitem) and it indeed seems to have hung inside the Azure Python SDK. I'll create an issue for this on the arcade backlog; if you see a bunch more please let me know and we can bump priority.

If you're not actively using XUnit result ingestion in Kusto, you could remove usage of xunit-reporter.py entirely by flipping a build property bool; let me know if you want to talk about that.

danmoseley · 2020-07-14T00:29:23Z

Thanks @MattGal

If you're not actively using XUnit result ingestion in Kusto, you could remove usage of xunit-reporter.py entirely by flipping a build property bool; let me know if you want to talk about that.

@ViktorHofer can you help answer that?

ViktorHofer · 2020-07-14T08:11:35Z

I know that none of our tools rely on it but maybe @wfurt's dashboard?

MattGal · 2020-07-14T15:46:39Z

I also expect, while this is a seeming bug in the azure-storage-blob python SDK, this specific failure is going to be pretty rare. Either way, if you actively use and want the Kusto XUnit support I'd appreciate a quick comment in dotnet/arcade#5786 for when we discuss it at Thursday's triage.

wfurt · 2020-07-14T21:07:54Z

I may not understand the comment properly @MattGal. I don't think we care about the reporter but we do use test results from Kusto. For dashboard as well as @alnikola put together process to monitor and analyze test failures so we can stay on top of them. This is also handy for trends and closing old test failures.
cc: @karelz

MattGal · 2020-07-14T22:14:40Z

I may not understand the comment properly @MattGal. I don't think we care about the reporter but we do use test results from Kusto. For dashboard as well as @alnikola put together process to monitor and analyze test failures so we can stay on top of them. This is also handy for trends and closing old test failures.
cc: @karelz

If you like Xunit Facts in your kusto, you like xunit-reporter.py. Some folks on our team less enthusiastic about it because it can be quite expensive to put that much into Kusto. If you just care about work items and their exit codes, you don't care about the reporter. Ping me on Teams or a quick call if you want to go deeper.

danmoseley · 2020-07-14T22:51:58Z

ping @Anipik

danmoseley added 3 commits July 11, 2020 22:04

Comment dispose

c87e75e

Fix service tests

7ab36b8

typo

2115d89

Dotnet-GitSync-Bot added the area-Meta label Jul 12, 2020

danmoseley requested review from Anipik and ericstj July 12, 2020 06:32

danmoseley requested a review from stephentoub July 13, 2020 22:35

MattGal mentioned this pull request Jul 13, 2020

Harden xunit-reporter.py against hangs in upload_blob dotnet/arcade#5783

Closed

2 tasks

Anipik approved these changes Jul 14, 2020

View reviewed changes

danmoseley merged commit ee41d4a into dotnet:master Jul 15, 2020

danmoseley deleted the svctests branch July 15, 2020 00:33

karelz added this to the 5.0.0 milestone Aug 18, 2020

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

danmoseley restored the svctests branch December 22, 2020 05:07

danmoseley deleted the svctests branch September 30, 2022 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TestOnStartWithArgsThenStop and make service tests more reliable #39153

Fix TestOnStartWithArgsThenStop and make service tests more reliable #39153

danmoseley commented Jul 12, 2020 •

edited

Loading

danmoseley commented Jul 13, 2020

MattGal commented Jul 13, 2020

MattGal commented Jul 13, 2020 •

edited

Loading

danmoseley commented Jul 14, 2020

ViktorHofer commented Jul 14, 2020

MattGal commented Jul 14, 2020

wfurt commented Jul 14, 2020

MattGal commented Jul 14, 2020 •

edited

Loading

danmoseley commented Jul 14, 2020

Fix TestOnStartWithArgsThenStop and make service tests more reliable #39153

Fix TestOnStartWithArgsThenStop and make service tests more reliable #39153

Conversation

danmoseley commented Jul 12, 2020 • edited Loading

danmoseley commented Jul 13, 2020

MattGal commented Jul 13, 2020

MattGal commented Jul 13, 2020 • edited Loading

danmoseley commented Jul 14, 2020

ViktorHofer commented Jul 14, 2020

MattGal commented Jul 14, 2020

wfurt commented Jul 14, 2020

MattGal commented Jul 14, 2020 • edited Loading

danmoseley commented Jul 14, 2020

danmoseley commented Jul 12, 2020 •

edited

Loading

MattGal commented Jul 13, 2020 •

edited

Loading

MattGal commented Jul 14, 2020 •

edited

Loading