-
Notifications
You must be signed in to change notification settings - Fork 5k
Fix TestOnStartWithArgsThenStop and make service tests more reliable #39153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dotnet/dnceng the failure here seems to be a timeout in the result upload process - the tests passed. |
Hrm. I believe work item timeout is totally independent of log / result / dump upload; taking a look. |
@danmosemsft ah I see "upload" in this context means AzDO reporting and XUnit reporting (which is done in the context of the workitem) and it indeed seems to have hung inside the Azure Python SDK. I'll create an issue for this on the arcade backlog; if you see a bunch more please let me know and we can bump priority. If you're not actively using XUnit result ingestion in Kusto, you could remove usage of |
Thanks @MattGal
@ViktorHofer can you help answer that? |
I know that none of our tools rely on it but maybe @wfurt's dashboard? |
I also expect, while this is a seeming bug in the azure-storage-blob python SDK, this specific failure is going to be pretty rare. Either way, if you actively use and want the Kusto XUnit support I'd appreciate a quick comment in dotnet/arcade#5786 for when we discuss it at Thursday's triage. |
I may not understand the comment properly @MattGal. I don't think we care about the reporter but we do use test results from Kusto. For dashboard as well as @alnikola put together process to monitor and analyze test failures so we can stay on top of them. This is also handy for trends and closing old test failures. |
If you like Xunit Facts in your kusto, you like xunit-reporter.py. Some folks on our team less enthusiastic about it because it can be quite expensive to put that much into Kusto. If you just care about work items and their exit codes, you don't care about the reporter. Ping me on Teams or a quick call if you want to go deeper. |
ping @Anipik |
Fixes #38945
There were two bugs
Test service does not mutually synchronize connect and start messages, but one of the test expects them to be ordered. This is the cause of the failure. I experimented with making the start message write a continuation of the connect message write, but this problem only occurs for these two messages. All other messages can be handled sequentially, because the test can wait on the matching SCM status. The simplest approach is simply to recognize these messages are unordered and allow either ordering in the test.
Test service did not wait for the client to connect before writing to the pipe. This was because when test services are being torn down, the test service installer will issue the Stop command to the service via the SCM, which will cause the test service to write a stop message to the pipe, which no longer has a client planning to connect to it, so in a previous change we stopped waiting on the connection, introducing flakiness. Fix: wait on the client connection, unless the message is a stop message. (Tests that may wait on a stop message all wait on some previous message before it, so the test service will always have a pipe when it writes to such tests.)
Using a debugger to figure out what is happening is painful because of the various threads and processes. Tracing is far more convenient and useful. I added a bunch of tracing to the test code. With luck this is the last timing problem in these tests, but we've had a series of such problems so I've left the tracing in for next time, disabled by default.
Also added a comment about Dispose, which was confusing.