-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to identify instructions type based on its arguments. #453
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Afonso Oliveira <[email protected]>
@AFOliveira, could you describe how you see this being used? One of my motivations in including instruction format information was to be able to automatically validate that the defined instructions actually comply with their specified formats. Arguably, this script does the converse: determines the format based on the instruction definition. Would we use this to ease the burden of seeding the YAML files with their respective formats, mostly as a one-off? I can see adding a new enum-type field to the instructions, with a set of defined formats, ideally limited by the bits in the encoding (or maybe just CI validated to match that). And, further CI validation that the fields match the format. Is that where we are heading? |
@ThinkOpenly Main point of this PR is to continue the discussion we had on the meeting. It can definitely be a one time writing "helper", but we need to agree on how to represent this. I suggest How we will use this in the future: first and probably easier use case is instruction documentation, i.e. when we generate a instruction, we can have "hints" for what each hardocoded field means. The example below is super specific to instruction, but even something as "opcode", "funct7", "funct3" would be great. Then sure, I can also create the "inverse" script and do the validation based on instruction type, that should be quite smooth after we have the types merged in the yaml. |
@ThinkOpenly @dhower-qc If you can please review the first extensions mock-up it would be great! The only odd thing is that yaml libraries seem to prefer to have |
description: Add an immediate to the value in rs1, and store the result in rd | ||
definedBy: I | ||
assembly: xd, xs1, imm | ||
encoding: | ||
match: -----------------000-----0010011 | ||
match: "-----------------000-----0010011" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the quotes are only needed in some places and not everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quotes serve to force the field to be a string, I don't think we need them, but there is also no reason not to have them. Aditionally, all python parsers seem to place them, so I would need to filter them out in some ad-hoc way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the quotes are (mostly) optional here. YAML type deduction is a suggested part of the spec. Most parsers do it in a predictable way but occasionally you run into issues. There are cases where the quotes are always needed (e.g., if match doesn't contain any '-', so that it's not interpreted as a number).
This highlights the tradeoff between YAML and JSON as the storage format; JSON is unambiguous but has lots of extra quotes, doesn't allow comments, etc. YAML is easier to read and format, but is actually an ambiguous spec when it comes to types. Paring YAML with JSON schema was my attempt to get the best of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that explanation (thanks, @dhower-qc!), the potential for ambiguity without quotes, and the known need for quotes in some cases, we should probably put quotes everywhere (at last for the "match" field, that is) for consistency. This is obviously not a high priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a pre-commit hook is formatting the yaml files with a tool prettier
. The behavior of quoting could be changed if needed https://prettier.io/docs/options/#quote-props
My assumption with the existing default setting as-needed
is if the match
value was all digits, the author would put it in quotes and that would remain unchanged by prettier.
Presumably the check-jsonschema
pre-commit hook would fail if prettier
managed to format a string into a number of vice versa.
The case we don't want it prettier fighting with check-jsonchema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we then run prettier on all instructions to get them all sorted and we solve it entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-commit run --all-files prettier
but I just ran it on your branch and it left quotes as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't it apply the changes to all files? Just so we don't run in this type of changes in commits like mine and already have it all sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run that command with -v
verbose you'll see it processing all the file but there are no changes:
[email protected]:~/rvi/repos/riscv-software-src/riscv-unified-db on main [!] via v22.13.1 via 🐍 v3.12.4 via 💎 v3.2.3
❯ pre-commit run -v --all-files prettier | head
prettier.................................................................Passed
- hook id: prettier
- duration: 1.4s
arch/inst/B/slli.uw.yaml 14ms (unchanged)
cfgs/qc_iu/arch_overlay/inst/Xqci/qc.insbprh.yaml 4ms (unchanged)
arch/inst/F/fmaxm.s.yaml 6ms (unchanged)
arch/inst/V/vmin.vv.yaml 4ms (unchanged)
arch/inst/V/vluxseg3ei32.v.yaml 3ms (unchanged)
arch/inst/V/vdivu.vx.yaml 5ms (unchanged)
I don't understand why prettier put the quotes in for addi.yaml
to begin with.
I also can't find anything in prettier docs that would lead me to believe it can figure out as-needed
for these files. I don't think is it capable of accessing the #yaml-language-server
comment at the top that references the schema file, but if it could, it would not "need" to add quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quotes are actually being added by the python yaml package that I am using to parse the yamls, if we decide on removing them, I'll simply clean all the outputs once I'm done with adding all instruction formats.
arch/inst/I/addiw.yaml
Outdated
@@ -4,12 +4,15 @@ $schema: "inst_schema.json#" | |||
kind: instruction | |||
name: addiw | |||
long_name: Add immediate word | |||
description: Add an immediate to the 32-bit value in rs1, and store the sign extended result in rd | |||
type: I-Type | |||
description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Showing my ignorance here, but does this need the |
at the end like other multiline field values? This occurs elsewhere as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the |
, this still gets interpreted as a string since there is no empty line in the description text. |
(or >
) is required if there were an empty line. Also, I believe that with no marker, the string is interpreted using something closer to the >
(folded) block style that keeps around extra whitespace.
In general, I think it's going to be best practice to consistently use |
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was done by the pre-commit and I did not check it afterwards, hence the issue. I'll fix it on next commit and pay more attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's prettier cfg for setting that made this change https://prettier.io/docs/options/#print-width
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbroch-rivosinc can we set prettier to have the | character when it does this specific behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the
|
, this still gets interpreted as a string since there is no empty line in the description text.|
(or>
) is required if there were an empty line. Also, I believe that with no marker, the string is interpreted using something closer to the>
(folded) block style that keeps around extra whitespace.
My reading of the YAML spec is that |
denotes a "literal block scalar" that preserves all linebreaks. >
denotes a "folded block scalar" (as you said), but this can fold lines where two non-space characters are separated by a single space.
Here, if we want that long line to be "preserved" when the YAML is processed, we'd want to use >
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably grounds here to open an issue, right? sS we can have a discussion with a slighly larger group, and maybe migrate all multi-line descriptions into >
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it's not as straightforward as a global replacement. For example, I don't think it would format this exactly as was originally intended:
name: rev8
long_name: Byte-reverse register (RV64 encoding)
description: |
This instruction reverses the order of the bytes in rs1.
[NOTE]
The rev8 mnemonic corresponds to different instruction encodings in RV32 and RV64.
[NOTE]
The byte-reverse operation is only available for the full register width. To emulate word-sized
and halfword-sized byte-reversal, perform a `rev8 rd,rs` followed by a `srai rd,rd,K`, where K
is XLEN-32 and XLEN-16, respectively.
I think the "[NOTE]" line would be folded with the line(s) below it.
I've no idea what it would do with stuff like:
description: |
Can causes the processor to enter a low-power state until the next interrupt occurs.
<%- if ext?(:H) -%>
The behavior of `wfi` is affected by the `mstatus.TW`
and `hstatus.VTW` bits, as summarized below.
[%autowidth,%footer]
|===
.2+| [.rotate]#`mstatus.TW`# .2+| [.rotate]#`hstatus.VTW`# 4+^.>| `wfi` behavior
h| HS-mode h| U-mode h| VS-mode h| in VU-mode
| 0 | 0 | Wait | Trap (I) | Wait | Trap (V)
| 0 | 1 | Wait | Trap (I) | Trap (V) | Trap (V)
| 1 | - | Trap (I) | Trap (I) | Trap (I) | Trap (I)
6+| Trap (I) - Trap with `Illegal Instruction` code +
Trap (V) - Trap with `Virtual Instruction` code
|===
<%- else -%>
The `wfi` instruction is also affected by `mstatus.TW`, as shown below:
[%autowidth,%footer]
|===
.2+| [.rotate]#`mstatus.TW`# 2+^.>| `wfi` behavior
h| S-mode h| U-mode
| 0 | Wait | Trap (I)
| 1 | Trap (I) | Trap (I)
3+| Trap (I) - Trap with `Illegal Instruction` code
|===
<%- end -%>
arch/inst/I/ebreak.yaml
Outdated
@@ -4,6 +4,7 @@ $schema: "inst_schema.json#" | |||
kind: instruction | |||
name: ebreak | |||
long_name: Breakpoint exception | |||
type: Unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just manually set this to "I-type"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, is it? I found some cases like this one in which I could not find information on what type the instruction actually is. Moreover, the fences have the same problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is an interesting one. It has the same general structure as an I-type, but the immediate is actually an opcode. I think this is a different (unnamed in the spec) instruction format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling out the spec... ;-)
Section 2.8:
SYSTEM instructions are used to access system functionality that might require privileged access and are encoded using the I-type instruction format.
EBREAK is a SYSTEM instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fences have the same problem
Fences are arguably I-type as well, given their 4-4-4-5-3-5-7 field layout only matches I-type's 12-5-3-5-7.
Although, I'm not saying that the application of formats in the spec is not sloppy at times. If we have a reasonably supportable case for reclassifying some instructions into new formats (or sub-formats?), we can take that forward for discussion to the spec owners. If we think that's a possibility, maybe we put a mark on the wall now, in the form of an issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea on how to define those as for type? Should I insert "form: System"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snippet from section 2.8 is saying that these are "I-type" format.
arch/inst/I/ecall.yaml
Outdated
@@ -4,6 +4,7 @@ $schema: "inst_schema.json#" | |||
kind: instruction | |||
name: ecall | |||
long_name: Environment call | |||
type: Unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I-type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ECALL is also a SYSTEM instruction.
schemas/inst_schema.json
Outdated
@@ -182,6 +180,19 @@ | |||
"type": "string", | |||
"description": "Short description of the instruction" | |||
}, | |||
"type": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we hew closely to the ISA spec, the field name should probably be "format", especially if we carry the "-type" suffix into every format name, below.
The spec calls these, generally, "instruction formats", but calls each instance "X-type". (I presume there is some logic to this.)
Should we make this field required, or does that just break everything until all of the instructions have a "format" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it required, but then this PR will have to put a type or format into all of the instructions, which actually seems feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I suspected. I think we're headed that way, anyway. Do you need help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If we choose not to do it now, let's create a reminder issue to "make the 'format' field required".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can sort it all out in this PR! I'll try to see how much of it I can automate (my guess is most of it) and then I'll need help for reviewing the whole thing and handling weirder types such as system calls and fences, is that alright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. As for SYSTEM and FENCE instructions, they should all be I-type, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a problem with this types being all I-types. How shall we do verification on them? Some of them will be all hardcoded and there is still a way to check if all bits still align; however, I had a vision of more profound verification. Moreover, if we want sub-fields appearing in instruction documentation, we would need to specify the sub-types of all instructions, otherwise we will never get the correct "caption" on every field. For the first approach, I think I won't delve in this, but what if we define all sub-type of each type in a next PR? The ISA already specifies them as specilization of the main type. so actually giving them names shouldn't be a big problem. We can see how to do this here in the UDB and if it makes sense, this would be a great discussion to have with the ISA commitee at it's time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, having all the fixed bits hardcoded in a "match" string for each instruction doesn't allow for that sort of field-level verification where the field value (or part of the field value!) is fixed. I avoided this in my Sail-to-JSON backend and left everything as individual fields, for good or bad. It's certainly handy to have a "match" string, but it's also pretty easy to compute a match string based on the individual field representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I look at this, the more I think we should probably just have all instructions as all fields described. This would only be possible with accurate instruction types and fields, that would precisely describe what every bit of every instruction is.
I'll try to get this this later today. I have some thoughts after a quick pass, but it will take me a bit to organize them. |
Signed-off-by: Afonso Oliveira <[email protected]>
3a5c55a
to
1ea910d
Compare
Signed-off-by: Afonso Oliveira <[email protected]>
83f357e
to
0451c11
Compare
Signed-off-by: Afonso Oliveira <[email protected]>
…al/ Zabha/ Zalasr/ Zawrs/ Zbkb/ Zbkx/ Zfbfmin/ Zfh/ Zicbom/ Zicboz/ Zicfilp/ Zicfiss/ Zicond/ Zicsr/ Zifencei/ Zimop/ Zk Zks/ Signed-off-by: Afonso Oliveira <[email protected]>
11ab941
to
beeecc1
Compare
@ThinkOpenly All but Compressed and Vector instructions should be now done. Also updated the schema. |
Signed-off-by: Afonso Oliveira <[email protected]>
Warning: I'm going to blow this PR up 💣 I'm wondering if we can align this with the goals of #285, which are more expansive than identifying the "format" of an instruction. As the discussion here highlights, "format" is too ambiguous to truly capture an instruction type (e.g., ecall is "I-type" even though it has no immediate). For each field in an encoding, we need to identify:
With this in mind, "format" actually maps to multiple combinations of the above. There are many ways to represent this. A two initial options for discussion: Option 1 - insts point to specific format# ld.yaml
name: ld
# ...
encoding:
match: -----------------011-----0000011
format: { $ref: iformats/itype_R_RISCV_PCREL_LO12_I.yaml# } # iformats/itype_R_RISCV_PCREL_LO12_I.yaml
common_name: I-type
fields:
- name: imm
location: 31-20
type: signed immediate
relocation: { $ref: relocations/R_RISCV_PCREL_LO12_I.yaml# }
- name: xs1
location: 19-15
type: xregister
- name: xd
location: 11-7
type: xregister
- name: opcode
location: 6-0
type: opcode
- name: funct3
location: 14-12
type: opcode Option 2 - insts pick their fields# ld.yaml
name: ld
# ...
encoding:
match: -----------------011-----0000011
format: I-type # still needed, unless we have a map of "common name" to field list elsewhere
fields:
- { $ref: ifields/itype_pc_rel_imm.yaml# }
- { $ref: ifields/xs1.yaml# }
- { $ref: ifields/xd.yaml# }
- { $ref: ifields/major_opcode.yaml# }
- { $ref: ifields/funct3.yaml# } ---
# ifields/itype_pc_rel_imm.yaml
- name: imm
location: 31-20
type: signed immediate
relocation: { $ref: relocations/R_RISCV_PCREL_LO12_I.yaml# }
---
# ifields/xs1.yaml
- name: xs1
location: 19-15
type: xregister
---
# ifields/xd.yaml#
- name: xd
location: 11-7
type: xregister
---
# ifields/major_opcode.yam
- name: opcode
location: 6-0
type: opcode
--
# ifields/funct3.yaml
- name: funct3
location: 14-12
type: opcode Option 1 winds up with more "iformats" since you need a new format for every variation. Option avoids that problem, but doesn't completely remove data duplication in the files since you will have a lot of instructions with the same field list. FWIW, Option 2 gives a little more information to the human reader. @drom might have opinions. |
I'm in the "Option 1" camp. If we think that formats are important (and the ISA spec does treat them as important), then every instruction must comply with a format, even with idiosyncrasies. We should not give instructions the freedom to "build their own" format with a list of fields that may or may not be part of the format but the layout works because "everything fits". (This is what Sail does today, and I think we have a nice opportunity to improve significantly on that.) I like the compartmentalization to be at a high enough level that the format definitions could be hardcoded in the database. So, instead of |
I also think we should still have the format as a required field for each instruction. This can be used to validate the instruction definition matches the layout of the fields, and it eases displaying the instruction format as part of the documentation/web reference. I didn't review all 415(!) files changed here, and it's an open question as to whether we need to retain |
I like that.
Agreed. |
I also prefer option 1; however, I'm still not sure how this would look. It seems to me that the task is going to be defining every subset of each instruction type and defining what fields it has. Moreover, how do we identify the fields hardcoded? Because sometimes it is the same type, but the field actually means different things, right? Take fence as an example, it is I-type but the Should we define a subtype for each meaning of the fields? I.e. this would be
and then for the fields, how do they get hardcoded? Can we specify them? And then which ones would we specify and which ones we wouldn't? Probably should do for all, even for opcodes
|
OK, let's go with Option 1. I was thinking that formats never have hardcoded values (though they might include excluded values with # inst/I/fence.yaml
name: fence
encoding: 0000-------------000-----0001111
format: { $ref: iformats/itype/fence.yaml# } # inst/I/fence.tso.yaml
name: fence
encoding: 0001-------------000-----0001111
format: { $ref: iformats/itype/fence.yaml# } # iformats/itype/fence.yaml
common_name: I-type # this is mostly informative, since it's too ambiguous to mean much
fields:
- name: fm
location: 31-28
type: opcode
- name: pi
location: 27
type: fence set
# ... |
Next question is: is this going to be a requirement to have this PR upstream? In theory, having instruction types on their own would already be a nice increment to what we have now and allow for some more verification; however, proposed changes aren't really minimal, and I don't think there is a way to automate, so we first need to figure how many different subtypes we have and describe them all - and I'll have a really hard/long time doing that on my own :) |
Yes, this is not a small task. And it's hard to see a "transition" period where we would support old and new schemas. If you can get things started, and get a better idea of the scope of work, then I can work on finding some help. |
If I understand the semantics of
Is exactly the same and can be implemented as:
Is that correct? If so, it might be an "easier" task to first convert everything from the current format to the second format, above, meaning change this: encoding:
match: -----------------000-----0001111
variables:
- name: fm
location: 31-28
- name: pred
location: 27-24
- name: succ
location: 23-20
- name: rs1
location: 19-15
- name: rd
location: 11-7 to this: encoding:
match: -----------------000-----0001111
format:
common_name: I-type
fields:
- name: fm
location: 31-28
- name: pred
location: 27-24
- name: succ
location: 23-20
- name: rs1
location: 19-15
- name: rd
location: 11-7 Then, we can coalesce common format representations (and add types and relocations) at will. |
@ThinkOpenly Yes, that was what I was asking. Can we first get all the formats upstream and only then open another PR on the next changes? |
I like this line of thinking, but I think there are a few pitfalls as proposed. For example, we'll need to know opcode vs. non-opcode so the generators know if the field is a variable for IDL code / wavedrom diagramming, etc. That said, we might be able to tolerate some missing info for a period if it helps. I'm wondering just how many formats we really have. Thinking aloud here, this should cover
That's 19. Say C adds another 10. A few for vector. And some misc here and there... @AFOliveira , how many of these are your scripts already looking for? |
I only have the base I-type, R-type etc... I'm afraid there are more than those 19, per example, I think V extension also adds a variety of unnamed types. There's also R-type FP in which RS2 is hardcoded. I think the hardest part here is having that definitive list, since the ISA is not super clear on them, after that I can see how much is actually automatable, but I would guess I could figure it out. |
This could benefit from iterations, I think. However, there is value in getting the basic types (6 for "I": R, I, S, B, U, J; and 9 for "C": CR, CI, CSS, CIW, CL, CS, CA, CB, CJ) added. These are the only types (AFAIK) defined in the ISA. Adding them will be non-controversial (if that is possible). I worry about running too far ahead of the ISA in these kinds of canonical areas. Other formats as described should probably be considered sub-types and non-canonical, and maybe wait for the next iteration. |
Based on last meeting, I created a script that can identify instructions type (e.g. I-type, R-type) based on what arguments each instruction has. I have a couple questions:
Right now, outputs for I extensions are: