Fix XUnit Log Fixer Irregularities #85183

ivdiazsa · 2023-04-21T21:21:55Z

Fixes #84940 and #85056. This PR addresses some irregularities with the XUnit Log Fixer that were causing some CI runs to not give the appropriate result.

84940: The log fixer was creating invalid XML's because of a problem with the <stack-trace> tag. It did not know tags could also have hyphens as part of their label. The regex in charge of this was adjusted accordingly, and now it recognizes all of them seamlessly.
85056: Some CI failures were being reported as passed because of how app exit return codes are handled in Helix. The last run command is the one that determines that specific test's status. Since now the final command is the log fixer, then this overshadowed the exit status of the test. In other words, if the test failed but the log fixer ran well, then it was treated as passed. The log fixer was adjusted to return the test's exit code, rather than one of its own, in order to preserve the status of that specific CI test.

…as well for Helix to process it.

ghost · 2023-04-21T21:22:08Z

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #84940 and #85056. This PR addresses some irregularities with the XUnit Log Fixer that were causing some CI runs to not give the appropriate result.

84940: The log fixer was creating invalid XML's because of a problem with the <stack-trace> tag. It did not know tags could also have hyphens as part of their label. The regex in charge of this was adjusted accordingly, and now it recognizes all of them seamlessly.
85056: Some CI failures were being reported as passed because of how app exit return codes are handled in Helix. The last run command is the one that determines that specific test's status. Since now the final command is the log fixer, then this overshadowed the exit status of the test. In other words, if the test failed but the log fixer ran well, then it was treated as passed. The log fixer was adjusted to return the test's exit code, rather than one of its own, in order to preserve the status of that specific CI test.

Author:	ivdiazsa
Assignees:	ivdiazsa
Labels:	`area-Infrastructure-coreclr`
Milestone:	8.0.0

ivdiazsa · 2023-04-21T21:24:54Z

/azp run runtime-coreclr outerloop

azure-pipelines · 2023-04-21T21:25:05Z

Azure Pipelines successfully started running 1 pipeline(s).

ivdiazsa · 2023-04-21T21:25:24Z

/azp run jit-cfg

azure-pipelines · 2023-04-21T21:25:33Z

Azure Pipelines successfully started running 1 pipeline(s).

trylek

LGTM, thanks Ivan for fixing this! I'm not sure about the HW intrinsics failures on arm, they don't seem related to your change but they might be worth retrying or checking other recent PR runs to see whether they're ambient or somehow indirectly caused by the slightly changed test execution logic.

BruceForstall

One comment/question

BruceForstall · 2023-04-21T23:00:22Z

src/tests/Common/XUnitLogChecker/XUnitLogChecker.cs

@@ -97,7 +104,7 @@ static int Main(string[] args)
        {
            Console.WriteLine("[XUnitLogChecker]: An error occurred. No stats csv"
                            + $" was found. The expected name would be '{statsCsvPath}'.");
-            return FAILURE;
+            return TestExitCode;


Is it possible for the test to return success, but the XUnitLogChecker to fail (i.e., places where it previously returned FAILURE before this PR)? In that case, should the CI consider that a failure? Or do you want various XUnitLogChecker failures to be ignored by the CI? If you want XUnitLogChecker failures to be shown as a CI failure, then the return FAILURE cases here should be preserved.

Thanks Bruce, that's a reasonable suggestion; in practice I think most of these special cases only occur when the test crashes halfway through and then it also returns an error exit code but I believe your suggestion is a general goodness for robustness of the infra.

This sounds like a reasonable suggestion to me, combined with Juan's in the next comment. I'll implement it.

hoyosjs

Suggestion that may decouple the fixer.

hoyosjs · 2023-04-23T19:16:42Z

src/tests/Common/helixpublishwitharcade.proj

@@ -405,8 +405,8 @@
      <XUnitLogCheckerHelixPath>$(XUnitLogCheckerHelixPath)XUnitLogChecker/</XUnitLogCheckerHelixPath>

      <XUnitLogCheckerArgs>$(_MergedWrapperRunScriptDirectoryRelative) $(_MergedWrapperName)</XUnitLogCheckerArgs>
-      <XUnitLogCheckerArgs Condition="'$(TestWrapperTargetsWindows)' != 'true'">$(XUnitLogCheckerArgs) %24HELIX_DUMP_FOLDER</XUnitLogCheckerArgs>
-      <XUnitLogCheckerArgs Condition="'$(TestWrapperTargetsWindows)' == 'true'">$(XUnitLogCheckerArgs) %25HELIX_DUMP_FOLDER%25</XUnitLogCheckerArgs>
+      <XUnitLogCheckerArgs Condition="'$(TestWrapperTargetsWindows)' != 'true'">$(XUnitLogCheckerArgs) %24%3F %24HELIX_DUMP_FOLDER</XUnitLogCheckerArgs>


Would this better be in the helix command? As in save the result, not pass it in, and then exit with that if non-zero or with the fixer's exit otherwise?

I agree with this. Something like:

// run test set _saved_errorlevel=%errorlevel% // run XunitLogChecker if %errorlevel% NEQ 0 exit 1 if %_saved_errorlevel% NEQ 0 exit 1 exit 0

?

It's unfortunate that building a script in this helixpublishwitharcade.proj is comically complex.

I like this idea. I just have one question:
Wouldn't it be better to make it like this:

# Run test test_exit_code=$? # Run XUnitLogChecker if [ $? -ne 0 ]; then exit 1; fi if [ $test_exit_code -ne 0 ]; then exit $test_exit_code; fi exit 0

I ask because as far as I understood the test infra, we would rather return the test's exit code on failure, rather than always '1'. Well, this assuming the log fixer finishes successfully of course.

Makes sense to me.

Another question is whether we should preferentially return the test failing code, which would require saving the XUnitLogChecker code:

# Run test test_exit_code=$? # Run XUnitLogChecker xunitlogchecker_exit_code=$? # If the test failed, return that exit code if [ $test_exit_code -ne 0 ]; then exit $test_exit_code; fi if [ $xunitlogchecker_exit_code -ne 0 ]; then exit $xunitlogchecker_exit_code; fi exit 0

I think this is the best way to proceed. Thanks for your observations Bruce.

tannergooding · 2023-04-25T06:02:35Z

There's some failures due to missing logs from test assemblies that are skipped for a given platform.

There's also some x86/x64 hwintrinsic failures that should all be resolved in #85281.

BruceForstall

I like how it looks now.

One more question:

In #85056, I see:

1259/2363 tests run.
* 1259 tests passed.
* 0 tests failed.
* 0 tests skipped.

If I see those numbers, passed + failed + skipped doesn't add up to tests run. Should that be considered a failure, and if so, should the XUnitLogChecker return an error code in that case? (Perhaps the test already did or should have returned an error code in that case?)

ivdiazsa · 2023-04-25T17:40:28Z

I like how it looks now.

One more question:

In #85056, I see:
1259/2363 tests run.
* 1259 tests passed.
* 0 tests failed.
* 0 tests skipped.
If I see those numbers, passed + failed + skipped doesn't add up to tests run. Should that be considered a failure, and if so, should the XUnitLogChecker return an error code in that case? (Perhaps the test already did or should have returned an error code in that case?)

Oh this might need some addressing, considering you're bringing up the question. The log checker here is just reporting the status until the crash. When the test process was terminated, 1259 tests had passed, and no others had been skipped or failed gracefully. Maybe we could add in a subsequent PR a fourth field called "tests missing"? We could have the log checker subtract passed+failed+skipped from the total expected for a more complete report.

tannergooding · 2023-04-25T18:13:48Z

If you pull in latest main, then the x64 hwIntrinsic tests should now pass.

You may still have to handle the other case I called out above of. That is, the HardwareIntrinsics_Arm_r tests are skipped on x64 and inversely the HardwareIntrinsics_X86_r tests are skipped on Arm64. CI was failing saying logs are missing, when they are intentionally "missing" due to the project being skipped.

ivdiazsa · 2023-04-25T19:31:26Z

If you pull in latest main, then the x64 hwIntrinsic tests should now pass.

You may still have to handle the other case I called out above of. That is, the HardwareIntrinsics_Arm_r tests are skipped on x64 and inversely the HardwareIntrinsics_X86_r tests are skipped on Arm64. CI was failing saying logs are missing, when they are intentionally "missing" due to the project being skipped.

Thanks Tanner. I will pull the latest main changes and be on the lookout for the other potential issue.

ivdiazsa · 2023-04-25T20:11:21Z

/azp run runtime-coreclr outerloop

azure-pipelines · 2023-04-25T20:11:34Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2023-04-25T23:09:38Z

@tannergooding The logs are now seeing, e.g.:

C:\h\w\9C720908\w\B4460A48\e>call JIT\HardwareIntrinsics\HardwareIntrinsics_Arm_r\HardwareIntrinsics_Arm_r.cmd -usewatcher 
'JIT\HardwareIntrinsics\HardwareIntrinsics_Arm_r\HardwareIntrinsics_Arm_r.cmd' is not recognized as an internal or external command,
operable program or batch file.

C:\h\w\9C720908\w\B4460A48\e>set test_exit_code=1 

C:\h\w\9C720908\w\B4460A48\e>dotnet C:\h\w\9C720908\p/XUnitLogChecker/XUnitLogChecker.dll JIT\HardwareIntrinsics\HardwareIntrinsics_Arm_r HardwareIntrinsics_Arm_r C:\cores 
[XUnitLogChecker]: No logs were found. This work item was skipped.
[XUnitLogChecker]: If this is a mistake, then something went very wrong. The expected temp log name would be: 'HardwareIntrinsics_Arm_r.tempLog.xml'

C:\h\w\9C720908\w\B4460A48\e>set xunitlogchecker_exit_code=0

So the XUnitLogChecker changes in this PR appear to be working correctly. However, presumably these failures were masked/hidden before, and now they are exposed. It doesn't appear to be the job of the XUnitLogChecker to fix this problem.

How should this get fixed? Someone is trying to execute a .cmd file that doesn't exist, and is actually expected to not exist (in this case).

BruceForstall · 2023-04-25T23:28:05Z

@trylek @markples In the example above, why do we create a Helix job for a merged test .cmd file that doesn't exist?

trylek · 2023-04-25T23:32:18Z

This might be due to a corner case where we don't know in advance that all tests in a particular merged wrapped will end up getting skipped e.g. due to their incompatiblity with GC stress or a given targeting architecture. Overall I would expect these cases to be rare and not particularly sensitive w.r.t. Helix performance. At the very least we should fix them so that they don't cause such spurious errors, ideally we should avoid sending them to Helix altogether (albeit I doubt it's generally doable in the presence of tests requiring process isolation as the merged wrapper logic simply has no way to know what the execution script contains).

trylek · 2023-04-25T23:34:33Z

On second thought, the corner case I described shouldn't manifest as a missing merged test execution script (cmd / sh) so I guess there must be something else at work here.

markples · 2023-04-25T23:38:05Z

[edited to hopefully improve readability]

Conjecture - we produce the marker files (.MergedTestAssembly / .MergedTestAsembly.x.y.MergedTestAssemblyForStress) because the build (for managed components) is AnyCPU/etc, but then we produce all .cmd files in separate builds on the actual test targets. Since HardwareIntrinsics_Arm_r is architecture-specific (we have conditional GCStressIncompatible and CLRTestTargetUnsupported on the merged group itself) it does not generate a .cmd but at this point we're already in the helix job that will have nothing more to do.

trylek · 2023-04-25T23:47:50Z

Thanks Mark, that sounds very reasonable. If that's indeed the case, we should be able to easily filter these out somewhere around

runtime/src/tests/Common/helixpublishwitharcade.proj

Line 360 in 2802c97

    
           <_MergedWrapperMarker Include="$(TestBinDir)**\*.MergedTestAssembly" Exclude="$(TestBinDir)**\supportFiles\*.MergedTestAssembly" />

by detecting that we don't have the equivalent execution scripts.

ivdiazsa · 2023-04-26T00:12:08Z

Thanks Mark, that sounds very reasonable. If that's indeed the case, we should be able to easily filter these out somewhere around

runtime/src/tests/Common/helixpublishwitharcade.proj

Line 360 in 2802c97

<_MergedWrapperMarker Include="$(TestBinDir)**\*.MergedTestAssembly" Exclude="$(TestBinDir)**\supportFiles\*.MergedTestAssembly" />

by detecting that we don't have the equivalent execution scripts.

Am I right in assuming this gets executed at some point after the sh/cmd scripts have been generated? If yes, then it might not be too difficult to address in a follow-up PR.

BruceForstall · 2023-04-26T00:13:17Z

I think the failure of missing .cmd files needs to be fixed before this PR can merge, or else there will be lots of Pri-0/outerloop test failures.

ivdiazsa · 2023-04-26T00:13:33Z

On second thought, it might not be that good if merging this would break all those legs while we fix that other problem. What are your thoughts everyone?

ivdiazsa · 2023-04-26T00:14:33Z

I think the failure of missing .cmd files needs to be fixed before this PR can merge, or else there will be lots of Pri-0/outerloop test failures.

Sounds reasonable to me. Let me know if you or someone in your team will do that Bruce. If not, then I can take on that work item.

trylek · 2023-04-26T00:21:19Z

My initial impression is that this should be a one-line change in the script helixpublishwitharcade.proj, basically going over the _MergedWrapperMarker item group and removing those items that don't have the matching cmd / sh in the same folder. It's getting late around here but I think I should be able to put up the PR for the fix tomorrow.

markples · 2023-04-26T00:25:17Z

I wrote "but then we produce all .cmd files in separate builds on the actual test targets", but I can't find when this actually happens. In run-test-job, there are 3 script: sections before sending to helix. "Generate test wrappers" is the obvious one, but I only see evidence of the old (not merged group) wrappers. I don't see (for example) the .cmd for JIT\Regression_1 in any of those logs or in the helix log. I also don't see the .cmd in the managed build (AnyCPU/AnyOS), as expected.

(Please don't let me distract you from an actual fix here, which I think you'll be able to do much faster than me, as I'm not completely following this part of the discussion. Thanks for your help!)

trylek · 2023-04-26T00:33:31Z

I believe that from the Helix viewpoint the merged test wrappers are just ordinary runtime tests so that the logic for generating their own cmd / sh execution scripts should mostly follow the normal logic around CLRTest.Execute.Bash/Batch.targets. In general we're indeed producing all test execution scripts in the test run job (IIRC it happens as part of copying the native test components to the test output tree) but that doesn't automatically mean that the merged test wrapper script is aware of all the details in the component scripts for out-of-process tests.

ivdiazsa · 2023-04-26T18:31:14Z

My initial impression is that this should be a one-line change in the script helixpublishwitharcade.proj, basically going over the _MergedWrapperMarker item group and removing those items that don't have the matching cmd / sh in the same folder. It's getting late around here but I think I should be able to put up the PR for the fix tomorrow.

Thanks a lot for your help Tomas! Let me know if there's anything you need my help with.

BruceForstall · 2023-04-27T22:53:03Z

@ivdiazsa Can you merge in #85476 and re-test this?

ivdiazsa · 2023-04-27T22:58:24Z

/azp run runtime-coreclr outerloop

azure-pipelines · 2023-04-27T22:58:35Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2023-04-28T17:22:57Z

Test run looks (basically) clean

ivdiazsa added 3 commits April 19, 2023 16:15

Fixed the log tags regular expressions.

5bcc000

Added test exit code to the XUnit Log Checker, and made it return it …

ff08687

…as well for Helix to process it.

Finished fixes.

dbd3fe5

ivdiazsa added the area-Infrastructure-coreclr label Apr 21, 2023

ivdiazsa added this to the 8.0.0 milestone Apr 21, 2023

ivdiazsa requested review from jkoritzinsky, BruceForstall and trylek April 21, 2023 21:21

ivdiazsa self-assigned this Apr 21, 2023

ivdiazsa linked an issue Apr 21, 2023 that may be closed by this pull request

Test failures not reported in test run #85056

Closed

dotnet deleted a comment from azure-pipelines bot Apr 21, 2023

jkoritzinsky approved these changes Apr 21, 2023

View reviewed changes

trylek approved these changes Apr 21, 2023

View reviewed changes

BruceForstall approved these changes Apr 21, 2023

View reviewed changes

hoyosjs reviewed Apr 23, 2023

View reviewed changes

Decoupled the test's exit code from the XUnit Log Checker.

172da68

build-analysis bot mentioned this pull request Apr 24, 2023

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

BruceForstall approved these changes Apr 25, 2023

View reviewed changes

This was referenced Apr 25, 2023

Expose various Convert intrinsics for Avx512F, Avx512BW, and Avx512DQ #85281

Merged

Fix test execution jobs in the coreclr-release-outerloop-nightly pipeline #85278

Merged

dotnet deleted a comment Apr 25, 2023

Merge branch 'main' into xml-in-check

2419b04

trylek mentioned this pull request Apr 27, 2023

Ignore merged test markers for tests without execution script #85441

Closed

Merge branch 'main' into xml-in-check

ba7fb10

build-analysis bot mentioned this pull request Apr 28, 2023

Tracking issue for CI build timeouts #76454

Closed

ivdiazsa merged commit 02b8071 into dotnet:main Apr 28, 2023

BruceForstall mentioned this pull request May 2, 2023

CI tests not running all commands #85621

Closed

ghost locked as resolved and limited conversation to collaborators May 29, 2023

Fix XUnit Log Fixer Irregularities #85183

Fix XUnit Log Fixer Irregularities #85183

Conversation

ivdiazsa commented Apr 21, 2023

ghost commented Apr 21, 2023

ivdiazsa commented Apr 21, 2023

azure-pipelines bot commented Apr 21, 2023

ivdiazsa commented Apr 21, 2023

azure-pipelines bot commented Apr 21, 2023

trylek left a comment

Choose a reason for hiding this comment

BruceForstall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hoyosjs left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Apr 25, 2023

BruceForstall left a comment

Choose a reason for hiding this comment

ivdiazsa commented Apr 25, 2023

tannergooding commented Apr 25, 2023

ivdiazsa commented Apr 25, 2023

ivdiazsa commented Apr 25, 2023

azure-pipelines bot commented Apr 25, 2023

BruceForstall commented Apr 25, 2023 • edited Loading

BruceForstall commented Apr 25, 2023

trylek commented Apr 25, 2023

trylek commented Apr 25, 2023

markples commented Apr 25, 2023 • edited Loading

trylek commented Apr 25, 2023

ivdiazsa commented Apr 26, 2023

BruceForstall commented Apr 26, 2023

ivdiazsa commented Apr 26, 2023

ivdiazsa commented Apr 26, 2023

trylek commented Apr 26, 2023

markples commented Apr 26, 2023

trylek commented Apr 26, 2023

ivdiazsa commented Apr 26, 2023

BruceForstall commented Apr 27, 2023

ivdiazsa commented Apr 27, 2023

azure-pipelines bot commented Apr 27, 2023

BruceForstall commented Apr 28, 2023

hoyosjs left a comment •

edited

Loading

BruceForstall commented Apr 25, 2023 •

edited

Loading

markples commented Apr 25, 2023 •

edited

Loading