Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XUnit Log Fixer Irregularities #85183

Merged
merged 6 commits into from
Apr 28, 2023
Merged

Fix XUnit Log Fixer Irregularities #85183

merged 6 commits into from
Apr 28, 2023

Conversation

ivdiazsa
Copy link
Contributor

Fixes #84940 and #85056. This PR addresses some irregularities with the XUnit Log Fixer that were causing some CI runs to not give the appropriate result.

  • 84940: The log fixer was creating invalid XML's because of a problem with the <stack-trace> tag. It did not know tags could also have hyphens as part of their label. The regex in charge of this was adjusted accordingly, and now it recognizes all of them seamlessly.
  • 85056: Some CI failures were being reported as passed because of how app exit return codes are handled in Helix. The last run command is the one that determines that specific test's status. Since now the final command is the log fixer, then this overshadowed the exit status of the test. In other words, if the test failed but the log fixer ran well, then it was treated as passed. The log fixer was adjusted to return the test's exit code, rather than one of its own, in order to preserve the status of that specific CI test.

@ivdiazsa ivdiazsa added this to the 8.0.0 milestone Apr 21, 2023
@ivdiazsa ivdiazsa self-assigned this Apr 21, 2023
@ghost
Copy link

ghost commented Apr 21, 2023

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #84940 and #85056. This PR addresses some irregularities with the XUnit Log Fixer that were causing some CI runs to not give the appropriate result.

  • 84940: The log fixer was creating invalid XML's because of a problem with the <stack-trace> tag. It did not know tags could also have hyphens as part of their label. The regex in charge of this was adjusted accordingly, and now it recognizes all of them seamlessly.
  • 85056: Some CI failures were being reported as passed because of how app exit return codes are handled in Helix. The last run command is the one that determines that specific test's status. Since now the final command is the log fixer, then this overshadowed the exit status of the test. In other words, if the test failed but the log fixer ran well, then it was treated as passed. The log fixer was adjusted to return the test's exit code, rather than one of its own, in order to preserve the status of that specific CI test.
Author: ivdiazsa
Assignees: ivdiazsa
Labels:

area-Infrastructure-coreclr

Milestone: 8.0.0

@ivdiazsa ivdiazsa linked an issue Apr 21, 2023 that may be closed by this pull request
@dotnet dotnet deleted a comment from azure-pipelines bot Apr 21, 2023
@ivdiazsa
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ivdiazsa
Copy link
Contributor Author

/azp run jit-cfg

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@trylek trylek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Ivan for fixing this! I'm not sure about the HW intrinsics failures on arm, they don't seem related to your change but they might be worth retrying or checking other recent PR runs to see whether they're ambient or somehow indirectly caused by the slightly changed test execution logic.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment/question

@@ -97,7 +104,7 @@ static int Main(string[] args)
{
Console.WriteLine("[XUnitLogChecker]: An error occurred. No stats csv"
+ $" was found. The expected name would be '{statsCsvPath}'.");
return FAILURE;
return TestExitCode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the test to return success, but the XUnitLogChecker to fail (i.e., places where it previously returned FAILURE before this PR)? In that case, should the CI consider that a failure? Or do you want various XUnitLogChecker failures to be ignored by the CI? If you want XUnitLogChecker failures to be shown as a CI failure, then the return FAILURE cases here should be preserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bruce, that's a reasonable suggestion; in practice I think most of these special cases only occur when the test crashes halfway through and then it also returns an error exit code but I believe your suggestion is a general goodness for robustness of the infra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a reasonable suggestion to me, combined with Juan's in the next comment. I'll implement it.

Copy link
Member

@hoyosjs hoyosjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion that may decouple the fixer.

@@ -405,8 +405,8 @@
<XUnitLogCheckerHelixPath>$(XUnitLogCheckerHelixPath)XUnitLogChecker/</XUnitLogCheckerHelixPath>

<XUnitLogCheckerArgs>$(_MergedWrapperRunScriptDirectoryRelative) $(_MergedWrapperName)</XUnitLogCheckerArgs>
<XUnitLogCheckerArgs Condition="'$(TestWrapperTargetsWindows)' != 'true'">$(XUnitLogCheckerArgs) %24HELIX_DUMP_FOLDER</XUnitLogCheckerArgs>
<XUnitLogCheckerArgs Condition="'$(TestWrapperTargetsWindows)' == 'true'">$(XUnitLogCheckerArgs) %25HELIX_DUMP_FOLDER%25</XUnitLogCheckerArgs>
<XUnitLogCheckerArgs Condition="'$(TestWrapperTargetsWindows)' != 'true'">$(XUnitLogCheckerArgs) %24%3F %24HELIX_DUMP_FOLDER</XUnitLogCheckerArgs>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this better be in the helix command? As in save the result, not pass it in, and then exit with that if non-zero or with the fixer's exit otherwise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Something like:

// run test
set _saved_errorlevel=%errorlevel%
// run XunitLogChecker
if %errorlevel% NEQ 0 exit 1
if %_saved_errorlevel% NEQ 0 exit 1
exit 0

?

It's unfortunate that building a script in this helixpublishwitharcade.proj is comically complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. I just have one question:
Wouldn't it be better to make it like this:

# Run test
test_exit_code=$?
# Run XUnitLogChecker

if [ $? -ne 0 ]; then exit 1; fi
if [ $test_exit_code -ne 0 ]; then exit $test_exit_code; fi
exit 0

I ask because as far as I understood the test infra, we would rather return the test's exit code on failure, rather than always '1'. Well, this assuming the log fixer finishes successfully of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

Another question is whether we should preferentially return the test failing code, which would require saving the XUnitLogChecker code:

# Run test
test_exit_code=$?
# Run XUnitLogChecker
xunitlogchecker_exit_code=$?
# If the test failed, return that exit code
if [ $test_exit_code -ne 0 ]; then exit $test_exit_code; fi
if [ $xunitlogchecker_exit_code -ne 0 ]; then exit $xunitlogchecker_exit_code; fi
exit 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the best way to proceed. Thanks for your observations Bruce.

@tannergooding
Copy link
Member

There's some failures due to missing logs from test assemblies that are skipped for a given platform.

There's also some x86/x64 hwintrinsic failures that should all be resolved in #85281.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how it looks now.

One more question:

In #85056, I see:

1259/2363 tests run.
* 1259 tests passed.
* 0 tests failed.
* 0 tests skipped.

If I see those numbers, passed + failed + skipped doesn't add up to tests run. Should that be considered a failure, and if so, should the XUnitLogChecker return an error code in that case? (Perhaps the test already did or should have returned an error code in that case?)

@ivdiazsa
Copy link
Contributor Author

I like how it looks now.

One more question:

In #85056, I see:

1259/2363 tests run.
* 1259 tests passed.
* 0 tests failed.
* 0 tests skipped.

If I see those numbers, passed + failed + skipped doesn't add up to tests run. Should that be considered a failure, and if so, should the XUnitLogChecker return an error code in that case? (Perhaps the test already did or should have returned an error code in that case?)

Oh this might need some addressing, considering you're bringing up the question. The log checker here is just reporting the status until the crash. When the test process was terminated, 1259 tests had passed, and no others had been skipped or failed gracefully. Maybe we could add in a subsequent PR a fourth field called "tests missing"? We could have the log checker subtract passed+failed+skipped from the total expected for a more complete report.

@tannergooding
Copy link
Member

If you pull in latest main, then the x64 hwIntrinsic tests should now pass.

You may still have to handle the other case I called out above of. That is, the HardwareIntrinsics_Arm_r tests are skipped on x64 and inversely the HardwareIntrinsics_X86_r tests are skipped on Arm64. CI was failing saying logs are missing, when they are intentionally "missing" due to the project being skipped.

@dotnet dotnet deleted a comment Apr 25, 2023
@ivdiazsa
Copy link
Contributor Author

If you pull in latest main, then the x64 hwIntrinsic tests should now pass.

You may still have to handle the other case I called out above of. That is, the HardwareIntrinsics_Arm_r tests are skipped on x64 and inversely the HardwareIntrinsics_X86_r tests are skipped on Arm64. CI was failing saying logs are missing, when they are intentionally "missing" due to the project being skipped.

Thanks Tanner. I will pull the latest main changes and be on the lookout for the other potential issue.

@ivdiazsa
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Member

BruceForstall commented Apr 25, 2023

@tannergooding The logs are now seeing, e.g.:

C:\h\w\9C720908\w\B4460A48\e>call JIT\HardwareIntrinsics\HardwareIntrinsics_Arm_r\HardwareIntrinsics_Arm_r.cmd -usewatcher 
'JIT\HardwareIntrinsics\HardwareIntrinsics_Arm_r\HardwareIntrinsics_Arm_r.cmd' is not recognized as an internal or external command,
operable program or batch file.

C:\h\w\9C720908\w\B4460A48\e>set test_exit_code=1 

C:\h\w\9C720908\w\B4460A48\e>dotnet C:\h\w\9C720908\p/XUnitLogChecker/XUnitLogChecker.dll JIT\HardwareIntrinsics\HardwareIntrinsics_Arm_r HardwareIntrinsics_Arm_r C:\cores 
[XUnitLogChecker]: No logs were found. This work item was skipped.
[XUnitLogChecker]: If this is a mistake, then something went very wrong. The expected temp log name would be: 'HardwareIntrinsics_Arm_r.tempLog.xml'

C:\h\w\9C720908\w\B4460A48\e>set xunitlogchecker_exit_code=0 

So the XUnitLogChecker changes in this PR appear to be working correctly. However, presumably these failures were masked/hidden before, and now they are exposed. It doesn't appear to be the job of the XUnitLogChecker to fix this problem.

How should this get fixed? Someone is trying to execute a .cmd file that doesn't exist, and is actually expected to not exist (in this case).

@BruceForstall
Copy link
Member

@trylek @markples In the example above, why do we create a Helix job for a merged test .cmd file that doesn't exist?

@trylek
Copy link
Member

trylek commented Apr 25, 2023

This might be due to a corner case where we don't know in advance that all tests in a particular merged wrapped will end up getting skipped e.g. due to their incompatiblity with GC stress or a given targeting architecture. Overall I would expect these cases to be rare and not particularly sensitive w.r.t. Helix performance. At the very least we should fix them so that they don't cause such spurious errors, ideally we should avoid sending them to Helix altogether (albeit I doubt it's generally doable in the presence of tests requiring process isolation as the merged wrapper logic simply has no way to know what the execution script contains).

@trylek
Copy link
Member

trylek commented Apr 25, 2023

On second thought, the corner case I described shouldn't manifest as a missing merged test execution script (cmd / sh) so I guess there must be something else at work here.

@markples
Copy link
Contributor

markples commented Apr 25, 2023

[edited to hopefully improve readability]

Conjecture - we produce the marker files (.MergedTestAssembly / .MergedTestAsembly.x.y.MergedTestAssemblyForStress) because the build (for managed components) is AnyCPU/etc, but then we produce all .cmd files in separate builds on the actual test targets. Since HardwareIntrinsics_Arm_r is architecture-specific (we have conditional GCStressIncompatible and CLRTestTargetUnsupported on the merged group itself) it does not generate a .cmd but at this point we're already in the helix job that will have nothing more to do.

@trylek
Copy link
Member

trylek commented Apr 25, 2023

Thanks Mark, that sounds very reasonable. If that's indeed the case, we should be able to easily filter these out somewhere around

<_MergedWrapperMarker Include="$(TestBinDir)**\*.MergedTestAssembly" Exclude="$(TestBinDir)**\supportFiles\*.MergedTestAssembly" />

by detecting that we don't have the equivalent execution scripts.

@ivdiazsa
Copy link
Contributor Author

Thanks Mark, that sounds very reasonable. If that's indeed the case, we should be able to easily filter these out somewhere around

<_MergedWrapperMarker Include="$(TestBinDir)**\*.MergedTestAssembly" Exclude="$(TestBinDir)**\supportFiles\*.MergedTestAssembly" />

by detecting that we don't have the equivalent execution scripts.

Am I right in assuming this gets executed at some point after the sh/cmd scripts have been generated? If yes, then it might not be too difficult to address in a follow-up PR.

@BruceForstall
Copy link
Member

I think the failure of missing .cmd files needs to be fixed before this PR can merge, or else there will be lots of Pri-0/outerloop test failures.

@ivdiazsa
Copy link
Contributor Author

On second thought, it might not be that good if merging this would break all those legs while we fix that other problem. What are your thoughts everyone?

@ivdiazsa
Copy link
Contributor Author

I think the failure of missing .cmd files needs to be fixed before this PR can merge, or else there will be lots of Pri-0/outerloop test failures.

Sounds reasonable to me. Let me know if you or someone in your team will do that Bruce. If not, then I can take on that work item.

@trylek
Copy link
Member

trylek commented Apr 26, 2023

My initial impression is that this should be a one-line change in the script helixpublishwitharcade.proj, basically going over the _MergedWrapperMarker item group and removing those items that don't have the matching cmd / sh in the same folder. It's getting late around here but I think I should be able to put up the PR for the fix tomorrow.

@markples
Copy link
Contributor

I wrote "but then we produce all .cmd files in separate builds on the actual test targets", but I can't find when this actually happens. In run-test-job, there are 3 script: sections before sending to helix. "Generate test wrappers" is the obvious one, but I only see evidence of the old (not merged group) wrappers. I don't see (for example) the .cmd for JIT\Regression_1 in any of those logs or in the helix log. I also don't see the .cmd in the managed build (AnyCPU/AnyOS), as expected.

(Please don't let me distract you from an actual fix here, which I think you'll be able to do much faster than me, as I'm not completely following this part of the discussion. Thanks for your help!)

@trylek
Copy link
Member

trylek commented Apr 26, 2023

I believe that from the Helix viewpoint the merged test wrappers are just ordinary runtime tests so that the logic for generating their own cmd / sh execution scripts should mostly follow the normal logic around CLRTest.Execute.Bash/Batch.targets. In general we're indeed producing all test execution scripts in the test run job (IIRC it happens as part of copying the native test components to the test output tree) but that doesn't automatically mean that the merged test wrapper script is aware of all the details in the component scripts for out-of-process tests.

@ivdiazsa
Copy link
Contributor Author

My initial impression is that this should be a one-line change in the script helixpublishwitharcade.proj, basically going over the _MergedWrapperMarker item group and removing those items that don't have the matching cmd / sh in the same folder. It's getting late around here but I think I should be able to put up the PR for the fix tomorrow.

Thanks a lot for your help Tomas! Let me know if there's anything you need my help with.

@BruceForstall
Copy link
Member

@ivdiazsa Can you merge in #85476 and re-test this?

@ivdiazsa
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Member

Test run looks (basically) clean

@ivdiazsa ivdiazsa merged commit 02b8071 into dotnet:main Apr 28, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants