Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define hardware error exception cause code #1105

Merged
merged 2 commits into from
Oct 26, 2023
Merged

Conversation

aswaterman
Copy link
Member

This PR is incomplete, as @ved-rivos and @gfavor are supposed to provide some non-normative text that describes the intended function of this exception.

cc @jhauser-us

@ved-rivos
Copy link
Collaborator

This PR is incomplete, as @ved-rivos and @gfavor are supposed to provide some non-normative text that describes the intended function of this exception.

A Proposal for Non-Normative Description of Hardware Error:

A hardware error is a synchronous exception triggered when corrupted or uncorrectable data is accessed. This can occur during the fetch of an instruction, through a load instruction (whether it's an explicit or implicit access), a store/AMO instruction (again, either through explicit or implicit access), or by other instructions that attempt to access corrupted data for their operation. In this context, "data" encompasses all types of information used within a RISC-V hart. The *epc CSR on delivery of the hardware error exception holds the address of the instruction that attempted to access corrupted data, while the *tval CSR is either set to 0 or holds the virtual address of an instruction fetch, load, or store that attempted to access corrupted data. The priority of this exception isn't consistently set in comparison to other synchronous exceptions because it can arise at any stage of instruction fetching and processing.

--
Please confirm that we're finalizing the use of cause code 18, not 19 (and 18 for software error) as previously discussed on the mailing list.

@aswaterman
Copy link
Member Author

I'll let you, @gfavor, and @jhauser-us sort out which cause code is used for which exception, then I'll update the PR accordingly. I'm just the messenger.

@gfavor
Copy link
Collaborator

gfavor commented Sep 4, 2023

As far as cause code assignment, I believe there is no great reason for which exception gets which cause code. So the naturual thing to do is what Andrew is doing, i.e. Hardware Error is now being assigned, so it gets 18, and then when Software Integrity Error gets assigned, it can take the next sequential code 19.

Btw, Andrew, why the conscious skip over reserved cause codes 16 and 17?

@gfavor
Copy link
Collaborator

gfavor commented Sep 4, 2023

Here's a reduced version of Ved's proposal for the Hardware Error exception description. This drops the second sentence that tries to describe all the kinds of instructions that may do explicit and implicit access, but doesn't truly add anything on top of the existing arch specs, especially since it isn't a complete explicit list of such types of instructions. (To be fair, Ved copied this from an earlier stab I had suggested a month ago.) This also drops the last sentence about uncertain priority since I then propose additions to Table 3.7 that establish clear priorities for Hardware Errors.

A Hardware Error is a synchronous exception triggered when corrupted or uncorrectable data is accessed explicitly or implicitly by an instruction.   In this context, "data" encompasses all types of information used within a RISC-V hart. The *epc CSR on delivery of the hardware error exception holds the address of the instruction that attempted to access corrupted data, while the *tval CSR is either set to 0 or holds the virtual address of an instruction fetch, load, or store that attempted to access corrupted data.

I would argue that there is a clear assignment of Hardware Error priority among cause codes that is consistent with existing cause code prioritization, e.g. instruction fetch-related exceptions take priority over decode/execution-related exceptions.  Hence, in Table 3.7 in the Priv spec:

  • In the "During instruction address translation" priority category, expand the text to say "First encountered page fault, access fault, or hardware error".

  • In the "With physical address for instruction" priority category, expand the text to say "Access fault or hardware error".

  • And ditto for the two "explicit memory access" priority categories.

@ved-rivos
Copy link
Collaborator

The description sounds good to me. About the priority, I agree that there is a priority implied by fetch, decode, and execution. But with the expanded scope of hardware errors, we could expect a hardware error during the phase that detects instruction address breakpoint (e.g., parity error on trigger registers) or as part of execution of a integer/FP/vector instruction (e.g., a register file parity error detected by an SLTI instruction). That led to my observation that it could be detected in any phase. Adding "first encountered ..." to other boxes could also address this expanded scope.

@gfavor
Copy link
Collaborator

gfavor commented Sep 4, 2023

Makes sense. I think this results in just these two further additions:

  • Add "18 Hardware Error" as another row in the " Instruction address breakpoint" category

  • Add "18 Hardware Error" as another row in the large middle category of exceptions.

@jhauser-us
Copy link
Collaborator

Per discussions in yesterday's Architecture Review Committee call, please move the hardware error exception to cause code 19, and change code 18 to be a software check exception.

(Speaking only for myself, I hope a better name can be found than software check, but this is what the ARC settled on during the call.)

@aswaterman
Copy link
Member Author

Fixed.

Need some text for the software check exception, too.

@aswaterman aswaterman force-pushed the hardware-error-exception branch from 88fbee7 to 47fc021 Compare September 7, 2023 02:12
@gfavor
Copy link
Collaborator

gfavor commented Sep 7, 2023

Andrew, what are you thinking as far as the exception priority issue (and old Table 3.7, now Table 15).

Re Ved's last priority comment, I'll agree. In which case it seems overly wordy and messy to suitably incorporate Hardware Error into each and every category in the table. How about instead adding to the preceding paragraph that refers to the table (addition in bold):

If an instruction may raise multiple synchronous exceptions, the decreasing priority order of Table 15 indicates which exception is taken and reported in mcause. The priority of Hardware Error exception is implementation-defined, but any given occurrence is generally expected to be recognized at the point in the overall priority order at which the hardware error is discovered. The priority of any custom synchronous exceptions is implementation-defined.

@aswaterman
Copy link
Member Author

aswaterman commented Sep 7, 2023

I'm with y'all on the problem statement, but I'm unsure as to the best solution. My hot take is that it would simplify the broader exception-priority story to identify the phases of instruction execution and prioritize exceptions by execution phase. Then we'd define a priority of exceptions within a phase. This might seem like overkill, but in the long run, I think it will simplify the description when the same cause code can show up at multiple points in the pipeline. (And it wouldn't be a dense matrix of phases and causes, as most causes can't occur in most phases.)

Perhaps I should try my hand at seeing what this would look like. If it doesn't pan out, so be it; I imagine we all find Greg's suggestion to be unobjectionable.

@ved-rivos
Copy link
Collaborator

I feel Software Protection Check will more descriptive of the exception.

A Proposal for Non-Normative Description of Software Protection Check Error:

The Software Protection Check is a synchronous exception that is triggered when there are violations of checks and assertions defined by ISA extensions that aim to safeguard the integrity of software assets, encompassing control flow, memory among others. When the Software Protection Check exception is raised, the *tval register captures details about the failing check. The specifics of this information are outlined in the ISA extensions that can trigger this exception. If no information is provided then a value of 0 is written. The priority of this exception, in relation to other synchronous exceptions, depends on the value stored in the *tval register and is explicitly specified by the corresponding ISA extensions.

@gfavor
Copy link
Collaborator

gfavor commented Sep 9, 2023

There was substantial ARC discussion about naming this exception that covered a range of proposed names, resulting in a non-unanimous concensus (reflecting how arguable a number of reasonable possible names are).

Personally I would be concerned about the adjective "Protection" being a bit narrowing to what this exception can be used for over time (as more extensions make use of it), and also has connotations that to some may be different that what we actually have in mind. My own leaning was "Software Error Check", but that had its own objections from some others.

@gfavor
Copy link
Collaborator

gfavor commented Sep 9, 2023

How about this tweaked and trimmed down version of Ved's description:

The Software Check exception is a synchronous exception that is triggered when there are violations of checks and assertions defined by ISA extensions that aim to safeguard the integrity of software assets, encompassing control flow and memory access constraints among others. When this exception is raised, the *tval register captures details about the failing check. The specifics of this information are outlined in the ISA extensions that can trigger this exception. If no information is provided, then a value of 0 is captured. The priority of this exception, in relation to other synchronous exceptions, depends the particular cause of this exception and is explicitly specified by the corresponding ISA extensions.

@ved-rivos
Copy link
Collaborator

looks good to me!

@allenjbaum
Copy link

allenjbaum commented Sep 11, 2023 via email

@aswaterman
Copy link
Member Author

I'm also fine with this, but I'll let it marinate for a few days before updating the PR.

@ved-rivos
Copy link
Collaborator

I guess its marinated enough to bake now!

@aswaterman aswaterman force-pushed the hardware-error-exception branch from 47fc021 to c0f33bf Compare October 26, 2023 22:36
@aswaterman
Copy link
Member Author

Indeed, thanks for reminding me, @ved-rivos!

@aswaterman aswaterman merged commit a1edc2f into main Oct 26, 2023
2 checks passed
@aswaterman aswaterman deleted the hardware-error-exception branch October 26, 2023 22:50
@ved-rivos
Copy link
Collaborator

@aswaterman - I think this description for software check exception did not make it into the PR
#1105 (comment)

@aswaterman
Copy link
Member Author

Fixed (with minor tweaks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants