Skip to content

8346836: C2: Verify CastII/CastLL bounds at runtime #22880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

merykitty
Copy link
Member

@merykitty merykitty commented Dec 25, 2024

Hi,

This patch adds a develop flag VerifyConstraintCasts, which will verify the correctness of CastIINodes and CastLLNodes at runtime and crash the VM if the dynamic value lies outside the type value range.

Please take a look, thanks a lot.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8346836: C2: Verify CastII/CastLL bounds at runtime (Enhancement - P4)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22880/head:pull/22880
$ git checkout pull/22880

Update a local copy of the PR:
$ git checkout pull/22880
$ git pull https://git.openjdk.org/jdk.git pull/22880/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22880

View PR using the GUI difftool:
$ git pr show -t 22880

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22880.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 25, 2024

👋 Welcome back qamai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 25, 2024

@merykitty This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8346836: C2: Verify CastII/CastLL bounds at runtime

Co-authored-by: Vladimir Ivanov <[email protected]>
Reviewed-by: vlivanov, epeter

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 122 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 25, 2024
@openjdk
Copy link

openjdk bot commented Dec 25, 2024

@merykitty The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Dec 25, 2024

@merykitty
Copy link
Member Author

merykitty commented Dec 25, 2024

Running tier 1 tests with this flag and -XX:+StressGCM reveals several failures. One example is compiler.arraycopy.TestArrayCopyConjoint. In the method testByte(int, int, int), we have the final graph before matching:

image

228 CastII is a cast on P0 has its type being int[0, 32], it depends on 2 dominating ifs:

  • 100 If, which corresponds to (P0 < 0) == false
  • 200 If, which corresponds to (ConvI2L(P0) u<= 32L) == true

As a result, 228 CastII should have its control flow being the IfTrueNode projection of 200 If. However, it is incorrectly wired to the IfTrueNode projection of 223 If (P0 != 0 == true), which leads to the verification failure.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting :)

Would we have caught any bug with this, do you have an example?

If I remember correctly, we relax/widen the Cast ranges somewhere later in optimizations, so that different CastII etc can common. Probably happens after loop-opts. So the ranges usually go from [1..10] -> [0, max] or [-1 .. 1] -> int. So this verification would then not be super effective, right? Things might have gone wrong much earlier with bad assumptions. I mean it could still catch issues, but I'm not sure how likely that is?

TLDR: I'd like some more context / motivation for this patch ;)

And: you should have at least one plain test where you enable the flag, and it compiles everything required to run an empty main function.

@eme64
Copy link
Contributor

eme64 commented Jan 8, 2025

Ah, I actually see that you have some examples. So you plan on introducing this flag first, and only then fixing the issues? But does it fail with a simple java --version? Or an empty main method, maybe with -Xcomp?

@merykitty
Copy link
Member Author

merykitty commented Jan 8, 2025

@eme64 Thanks for looking at this

The context is that while reviewing #22666 I came to the conclusion that our handling of depends_only_on_test is broken. I have added a comment explaining my understanding and concerns there. In principle, before the execution, a DivINode is the same as a CastIINode which limits the value range of the divisor to != 0. As a result, there should not be any difference in the way we handle the movements of these nodes. This leads me to the conclusion that CastIINodes may also be wired to the wrong control input, the reason we have not caught them is that unlike a division complaining loudly, a CastIINode will silently accept incorrect input values. This motivates me to make this patch.

If I remember correctly, we relax/widen the Cast ranges somewhere later in optimizations, so that different CastII etc can common. Probably happens after loop-opts. So the ranges usually go from [1..10] -> [0, max] or [-1 .. 1] -> int.

You are right, it is in ConstraintCastNode::widen_type for which I will disable that widening in the presence of VerifyConstraintCasts.

So you plan on introducing this flag first, and only then fixing the issues?

There are several failures in tier 1 alone, and this flag is not enabled by default or in the pipeline, so I think incorporating it first would be preferable, then after fixing all the issues we can add it to the stress options.

But does it fail with a simple java --version? Or an empty main method, maybe with -Xcomp?

No it does not fail with --version or with an empty main method with and without -Xcomp.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea a lot. I'm personally on the fence if it is ok to integrate a flag that does not yet pass in the tests. The risk is that nobody fixes it.

What do you think about that @TobiHartmann @vnkozlov ?

Did you run testing with and without the flag? Cannot see the link posted on JIRA.

@eme64
Copy link
Contributor

eme64 commented Jan 17, 2025

Talked with @TobiHartmann , he thinks it is ok to integrate the flags, even if it has failures. Let's go ahead with it then.

It would still be nice to have 2 modes: one where we do the widening, and one without. Both may catch different things, right? I imagine it like this: VerifyConstraintCasts=1 -> do widening. VerifyConstraintCasts=2 -> disable widening.

Plus, can you do this?

And: you should have at least one plain test where you enable the flag, and it compiles everything required to run an empty main function.

Or does that already trigger failures? It would be nice just so we have some basic testing, and see that the flag is not completely broken.

@eme64
Copy link
Contributor

eme64 commented Jan 17, 2025

What about implementing the same for aarch64? That would increase our coverage eventually.

@vnkozlov
Copy link
Contributor

We can add this flag to our stress testing sets of flags to make sure we run with it during our regular testing.

@merykitty
Copy link
Member Author

merykitty commented Jan 19, 2025

@eme64 Thanks for your reviews, I have added 2 test cases for TestIterativeGVN that set VerifyConstraintCasts, the name of the test may need to change but I have not been able to come up with anything preferable.

For aarch64, I don't have an aarch64 machine around so it would be not so trivial.

I like this idea a lot. I'm personally on the fence if it is ok to integrate a flag that does not yet pass in the tests. The risk is that nobody fixes it.

I am working on JDK-8347365 which can hopefully solve some of the issues we are having here.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments :)

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

Looks great, especially with the better comments! 👏

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 4, 2025
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add AArch64 implementation too to cover our platforms?

@merykitty
Copy link
Member Author

@vnkozlov I don't have an AArch64 machine so I feel less confident writing one. We can add an AArch64 implementation later, though. What do you think?

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

movl(rax, dst);
movl(rcx, type->_lo);
movl(rdx, type->_hi);
hlt(); // hlt so we have the stack trace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. Sounds like a problem in NativeStackPrinter::print_stack().

Speaking of debugging output, a call into a local helper function (encapsulating pretty printing logic) followed by a hlt call will do the job. But, considering the usages are in-line and quite common I suggest to make it conditional (guarded by a flag). It is possible to recover all 3 values from generated code if needed and turn on error reporting (specify the diagnostic flag) when reproducing failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a helper function that will print the error message together with the parameters, the stack printing is still problematic, though.

@vnkozlov
Copy link
Contributor

vnkozlov commented Feb 7, 2025

@vnkozlov I don't have an AArch64 machine so I feel less confident writing one. We can add an AArch64 implementation later, though. What do you think?

Okay, later is fine.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@iwanowww
Copy link
Contributor

iwanowww commented Apr 8, 2025

Speaking of bug synopsis, can you make it a bit more concrete and succinct?

How about "C2: Verify CastII/CastLL bounds at runtime"?

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 8, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Apr 9, 2025
@merykitty merykitty changed the title 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime 8346836: C2: Verify CastII/CastLL bounds at runtime Apr 9, 2025
@merykitty
Copy link
Member Author

@iwanowww Thanks a lot for the reviews, I have updated according to your suggestions.

@iwanowww
Copy link
Contributor

FTR here's AArch64 support:
7ed34d0

Feel free to incorporate it in this PR or I'll upstream it separately..

@merykitty
Copy link
Member Author

@iwanowww Thanks a lot for your help, I have incorporated your patches.

/contributor add vlivanov

@openjdk
Copy link

openjdk bot commented Apr 22, 2025

@merykitty
Contributor Vladimir Ivanov <[email protected]> successfully added.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks.

jint hi = t->_hi;

if (lo != min_jint && hi != max_jint) {
subsw(rtmp, rval, lo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it's equivalent to cmpw(rval, lo) which is clearer IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is, cmpw(rval, lo) is equivalent to subsw(zr, rval, lo). However, if lo does not fit into an immediate instruction, MacroAssembler::subsw, which calls into wrap_adds_subs_imm_insn, will use Rd as a temporary register to store lo, this is invalid if Rd is zr. Am I understanding it right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Completely forgot that there's only 12 bits available for the immediate (Assembler::operand_valid_for_add_sub_immediate()).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely forgot what I was thinking about when writing that code :-)

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 23, 2025
Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this looks even much better with the improved printing on failure now!

Just out of curiosity: Is the whole reconstruct_frame_pointer mechanism general enough so that we could use it in other places as well? It is not super important to me any more, but I've wanted to have something like this for VerifyAlignVector already :)

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Apr 24, 2025
@iwanowww
Copy link
Contributor

iwanowww commented Apr 24, 2025

Just out of curiosity: Is the whole reconstruct_frame_pointer mechanism general enough so that we could use it in other places as well?

IMO it is general enough, but I haven't found a good place to put it. Ideally, all runtime/debug calls should keep frame pointer valid for diagnostic purposes.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 24, 2025
@merykitty
Copy link
Member Author

@iwanowww @eme64 Thanks a lot for your reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Apr 25, 2025

Going to push as commit ed60403.
Since your change was applied there have been 123 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 25, 2025
@openjdk openjdk bot closed this Apr 25, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 25, 2025
@openjdk
Copy link

openjdk bot commented Apr 25, 2025

@merykitty Pushed as commit ed60403.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@merykitty
Copy link
Member Author

IMO it is general enough, but I haven't found a good place to put it. Ideally, all runtime/debug calls should keep frame pointer valid for diagnostic purposes.

Tbh I don't understand ... How does the VM normally walk the stack when we crash in a compiled method? I thought that it has to know the frame size of a compiled method? Why do we need to manually reconstruct the frame pointer?

@iwanowww
Copy link
Contributor

JVM knows how to unwind the stack when crash happens in compiled code (compiled frame on top). When native frame is on top, it relies on platform ABI, so fails to unwind the stack at the border of native and compiled frames because compiled code doesn't follow platform ABI conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants