Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to identify instructions type based on its arguments. #453

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

AFOliveira
Copy link
Collaborator

Based on last meeting, I created a script that can identify instructions type (e.g. I-type, R-type) based on what arguments each instruction has. I have a couple questions:

  • Where should this sit at? Probably at backends?
  • To move forward I would need to play a bit with the erb templates and ruby, but I can't seem to figure where the json.dump comes from.
RV64::
+
[wavedrom, ,svg,subs='attributes',width="100%"]
....
<%= JSON.dump inst.wavedrom_desc(64) %>
....
====

Right now, outputs for I extensions are:

arch/inst/I/slli.yaml: Instruction type: Unknown
arch/inst/I/ld.yaml: Instruction type: I-Type
arch/inst/I/sllw.yaml: Instruction type: R-Type
arch/inst/I/ecall.yaml: Instruction type: Unknown
arch/inst/I/lw.yaml: Instruction type: I-Type
arch/inst/I/ori.yaml: Instruction type: I-Type
arch/inst/I/addw.yaml: Instruction type: R-Type
arch/inst/I/auipc.yaml: Instruction type: U-Type
arch/inst/I/srliw.yaml: Instruction type: R-Type
arch/inst/I/sb.yaml: Instruction type: S-Type
arch/inst/I/sltiu.yaml: Instruction type: I-Type
arch/inst/I/sraw.yaml: Instruction type: R-Type
arch/inst/I/slt.yaml: Instruction type: R-Type
arch/inst/I/bgeu.yaml: Instruction type: B-Type
arch/inst/I/addiw.yaml: Instruction type: I-Type
arch/inst/I/srlw.yaml: Instruction type: R-Type
arch/inst/I/addi.yaml: Instruction type: I-Type
arch/inst/I/lbu.yaml: Instruction type: I-Type
arch/inst/I/blt.yaml: Instruction type: B-Type
arch/inst/I/xor.yaml: Instruction type: R-Type
arch/inst/I/mret.yaml: Instruction type: Unknown
arch/inst/I/srl.yaml: Instruction type: R-Type
arch/inst/I/lb.yaml: Instruction type: I-Type
arch/inst/I/add.yaml: Instruction type: R-Type
arch/inst/I/andi.yaml: Instruction type: I-Type
arch/inst/I/srli.yaml: Instruction type: Unknown
arch/inst/I/sd.yaml: Instruction type: S-Type
arch/inst/I/bltu.yaml: Instruction type: B-Type
arch/inst/I/lui.yaml: Instruction type: U-Type
arch/inst/I/sh.yaml: Instruction type: S-Type
arch/inst/I/slliw.yaml: Instruction type: R-Type
arch/inst/I/slti.yaml: Instruction type: I-Type
arch/inst/I/beq.yaml: Instruction type: B-Type
arch/inst/I/jalr.yaml: Instruction type: I-Type
arch/inst/I/sra.yaml: Instruction type: R-Type
arch/inst/I/sw.yaml: Instruction type: S-Type
arch/inst/I/xori.yaml: Instruction type: I-Type
arch/inst/I/subw.yaml: Instruction type: R-Type
arch/inst/I/lhu.yaml: Instruction type: I-Type
arch/inst/I/sltu.yaml: Instruction type: R-Type
arch/inst/I/fence.yaml: Instruction type: Unknown
arch/inst/I/bge.yaml: Instruction type: B-Type
arch/inst/I/srai.yaml: Instruction type: Unknown
arch/inst/I/wfi.yaml: Instruction type: Unknown
arch/inst/I/or.yaml: Instruction type: R-Type
arch/inst/I/ebreak.yaml: Instruction type: Unknown
arch/inst/I/jal.yaml: Instruction type: J-Type
arch/inst/I/lh.yaml: Instruction type: I-Type
arch/inst/I/sub.yaml: Instruction type: R-Type
arch/inst/I/sraiw.yaml: Instruction type: R-Type
arch/inst/I/bne.yaml: Instruction type: B-Type
arch/inst/I/lwu.yaml: Instruction type: I-Type
arch/inst/I/sll.yaml: Instruction type: R-Type
arch/inst/I/and.yaml: Instruction type: R-Type

@ThinkOpenly
Copy link
Collaborator

@AFOliveira, could you describe how you see this being used?

One of my motivations in including instruction format information was to be able to automatically validate that the defined instructions actually comply with their specified formats.

Arguably, this script does the converse: determines the format based on the instruction definition. Would we use this to ease the burden of seeding the YAML files with their respective formats, mostly as a one-off?

I can see adding a new enum-type field to the instructions, with a set of defined formats, ideally limited by the bits in the encoding (or maybe just CI validated to match that). And, further CI validation that the fields match the format.

Is that where we are heading?

@AFOliveira
Copy link
Collaborator Author

@AFOliveira, could you describe how you see this being used?

@ThinkOpenly Main point of this PR is to continue the discussion we had on the meeting. It can definitely be a one time writing "helper", but we need to agree on how to represent this. I suggest type : I , if that is fine I can put that in the schema and generate all the yaml replacement for that.

How we will use this in the future: first and probably easier use case is instruction documentation, i.e. when we generate a instruction, we can have "hints" for what each hardocoded field means. The example below is super specific to instruction, but even something as "opcode", "funct7", "funct3" would be great.
image

Then sure, I can also create the "inverse" script and do the validation based on instruction type, that should be quite smooth after we have the types merged in the yaml.

@AFOliveira
Copy link
Collaborator Author

@ThinkOpenly @dhower-qc If you can please review the first extensions mock-up it would be great! The only odd thing is that yaml libraries seem to prefer to have "" on the match strings. This does not affect the data, so I believe it won't be a problem. I can also filter them out if you prefer.

arch/inst/I/srliw.yaml Outdated Show resolved Hide resolved
description: Add an immediate to the value in rs1, and store the result in rd
definedBy: I
assembly: xd, xs1, imm
encoding:
match: -----------------000-----0010011
match: "-----------------000-----0010011"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the quotes are only needed in some places and not everywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quotes serve to force the field to be a string, I don't think we need them, but there is also no reason not to have them. Aditionally, all python parsers seem to place them, so I would need to filter them out in some ad-hoc way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the quotes are (mostly) optional here. YAML type deduction is a suggested part of the spec. Most parsers do it in a predictable way but occasionally you run into issues. There are cases where the quotes are always needed (e.g., if match doesn't contain any '-', so that it's not interpreted as a number).

This highlights the tradeoff between YAML and JSON as the storage format; JSON is unambiguous but has lots of extra quotes, doesn't allow comments, etc. YAML is easier to read and format, but is actually an ambiguous spec when it comes to types. Paring YAML with JSON schema was my attempt to get the best of both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that explanation (thanks, @dhower-qc!), the potential for ambiguity without quotes, and the known need for quotes in some cases, we should probably put quotes everywhere (at last for the "match" field, that is) for consistency. This is obviously not a high priority.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a pre-commit hook is formatting the yaml files with a tool prettier. The behavior of quoting could be changed if needed https://prettier.io/docs/options/#quote-props

My assumption with the existing default setting as-needed is if the match value was all digits, the author would put it in quotes and that would remain unchanged by prettier.

Presumably the check-jsonschema pre-commit hook would fail if prettier managed to format a string into a number of vice versa.

The case we don't want it prettier fighting with check-jsonchema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then run prettier on all instructions to get them all sorted and we solve it entirely?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-commit run --all-files prettier but I just ran it on your branch and it left quotes as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't it apply the changes to all files? Just so we don't run in this type of changes in commits like mine and already have it all sorted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run that command with -v verbose you'll see it processing all the file but there are no changes:

[email protected]:~/rvi/repos/riscv-software-src/riscv-unified-db on  main [!] via  v22.13.1 via 🐍 v3.12.4 via 💎 v3.2.3
❯ pre-commit run -v --all-files prettier | head
prettier.................................................................Passed
- hook id: prettier
- duration: 1.4s

arch/inst/B/slli.uw.yaml 14ms (unchanged)
cfgs/qc_iu/arch_overlay/inst/Xqci/qc.insbprh.yaml 4ms (unchanged)
arch/inst/F/fmaxm.s.yaml 6ms (unchanged)
arch/inst/V/vmin.vv.yaml 4ms (unchanged)
arch/inst/V/vluxseg3ei32.v.yaml 3ms (unchanged)
arch/inst/V/vdivu.vx.yaml 5ms (unchanged)

I don't understand why prettier put the quotes in for addi.yaml to begin with.

I also can't find anything in prettier docs that would lead me to believe it can figure out as-needed for these files. I don't think is it capable of accessing the #yaml-language-server comment at the top that references the schema file, but if it could, it would not "need" to add quotes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quotes are actually being added by the python yaml package that I am using to parse the yamls, if we decide on removing them, I'll simply clean all the outputs once I'm done with adding all instruction formats.

@@ -4,12 +4,15 @@ $schema: "inst_schema.json#"
kind: instruction
name: addiw
long_name: Add immediate word
description: Add an immediate to the 32-bit value in rs1, and store the sign extended result in rd
type: I-Type
description:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing my ignorance here, but does this need the | at the end like other multiline field values? This occurs elsewhere as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the |, this still gets interpreted as a string since there is no empty line in the description text. | (or >) is required if there were an empty line. Also, I believe that with no marker, the string is interpreted using something closer to the > (folded) block style that keeps around extra whitespace.

In general, I think it's going to be best practice to consistently use | here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was done by the pre-commit and I did not check it afterwards, hence the issue. I'll fix it on next commit and pay more attention

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's prettier cfg for setting that made this change https://prettier.io/docs/options/#print-width

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbroch-rivosinc can we set prettier to have the | character when it does this specific behaviour?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the |, this still gets interpreted as a string since there is no empty line in the description text. | (or >) is required if there were an empty line. Also, I believe that with no marker, the string is interpreted using something closer to the > (folded) block style that keeps around extra whitespace.

My reading of the YAML spec is that | denotes a "literal block scalar" that preserves all linebreaks. > denotes a "folded block scalar" (as you said), but this can fold lines where two non-space characters are separated by a single space.

Here, if we want that long line to be "preserved" when the YAML is processed, we'd want to use >.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably grounds here to open an issue, right? sS we can have a discussion with a slighly larger group, and maybe migrate all multi-line descriptions into >

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it's not as straightforward as a global replacement. For example, I don't think it would format this exactly as was originally intended:

name: rev8
long_name: Byte-reverse register (RV64 encoding)
description: |
  This instruction reverses the order of the bytes in rs1.

  [NOTE] 
  The rev8 mnemonic corresponds to different instruction encodings in RV32 and RV64.

  [NOTE] 
  The byte-reverse operation is only available for the full register width. To emulate word-sized
  and halfword-sized byte-reversal, perform a `rev8 rd,rs` followed by a `srai rd,rd,K`, where K
  is XLEN-32 and XLEN-16, respectively.

I think the "[NOTE]" line would be folded with the line(s) below it.

I've no idea what it would do with stuff like:

description: |
  Can causes the processor to enter a low-power state until the next interrupt occurs.

  <%- if ext?(:H) -%>
  The behavior of `wfi` is affected by the `mstatus.TW`
  and `hstatus.VTW` bits, as summarized below.

  [%autowidth,%footer]
  |===
  .2+| [.rotate]#`mstatus.TW`# .2+| [.rotate]#`hstatus.VTW`# 4+^.>| `wfi` behavior
  h| HS-mode h| U-mode h| VS-mode h| in VU-mode

  | 0 | 0 | Wait | Trap (I) | Wait | Trap (V)
  | 0 | 1 | Wait | Trap (I) | Trap (V) | Trap (V)
  | 1 | - | Trap (I) | Trap (I) | Trap (I) | Trap (I)

  6+| Trap (I) - Trap with `Illegal Instruction` code +
  Trap (V) - Trap with `Virtual Instruction` code
  |===

  <%- else -%>
  The `wfi` instruction is also affected by `mstatus.TW`, as shown below:

  [%autowidth,%footer]
  |===
  .2+| [.rotate]#`mstatus.TW`# 2+^.>| `wfi` behavior
  h| S-mode h| U-mode

  | 0 | Wait | Trap (I)
  | 1 | Trap (I) | Trap (I)

  3+| Trap (I) - Trap with `Illegal Instruction` code
  |===

  <%- end -%>

@@ -4,6 +4,7 @@ $schema: "inst_schema.json#"
kind: instruction
name: ebreak
long_name: Breakpoint exception
type: Unknown
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just manually set this to "I-type"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, is it? I found some cases like this one in which I could not find information on what type the instruction actually is. Moreover, the fences have the same problem

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is an interesting one. It has the same general structure as an I-type, but the immediate is actually an opcode. I think this is a different (unnamed in the spec) instruction format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling out the spec... ;-)

Section 2.8:

SYSTEM instructions are used to access system functionality that might require privileged access and are encoded using the I-type instruction format.

EBREAK is a SYSTEM instruction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fences have the same problem

Fences are arguably I-type as well, given their 4-4-4-5-3-5-7 field layout only matches I-type's 12-5-3-5-7.

Although, I'm not saying that the application of formats in the spec is not sloppy at times. If we have a reasonably supportable case for reclassifying some instructions into new formats (or sub-formats?), we can take that forward for discussion to the spec owners. If we think that's a possibility, maybe we put a mark on the wall now, in the form of an issue here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea on how to define those as for type? Should I insert "form: System"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet from section 2.8 is saying that these are "I-type" format.

@@ -4,6 +4,7 @@ $schema: "inst_schema.json#"
kind: instruction
name: ecall
long_name: Environment call
type: Unknown
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I-type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECALL is also a SYSTEM instruction.

schemas/inst_schema.json Outdated Show resolved Hide resolved
schemas/inst_schema.json Outdated Show resolved Hide resolved
schemas/inst_schema.json Outdated Show resolved Hide resolved
@@ -182,6 +180,19 @@
"type": "string",
"description": "Short description of the instruction"
},
"type": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hew closely to the ISA spec, the field name should probably be "format", especially if we carry the "-type" suffix into every format name, below.

The spec calls these, generally, "instruction formats", but calls each instance "X-type". (I presume there is some logic to this.)

Should we make this field required, or does that just break everything until all of the instructions have a "format" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it required, but then this PR will have to put a type or format into all of the instructions, which actually seems feasible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I suspected. I think we're headed that way, anyway. Do you need help?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If we choose not to do it now, let's create a reminder issue to "make the 'format' field required".)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can sort it all out in this PR! I'll try to see how much of it I can automate (my guess is most of it) and then I'll need help for reviewing the whole thing and handling weirder types such as system calls and fences, is that alright?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. As for SYSTEM and FENCE instructions, they should all be I-type, I believe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with this types being all I-types. How shall we do verification on them? Some of them will be all hardcoded and there is still a way to check if all bits still align; however, I had a vision of more profound verification. Moreover, if we want sub-fields appearing in instruction documentation, we would need to specify the sub-types of all instructions, otherwise we will never get the correct "caption" on every field. For the first approach, I think I won't delve in this, but what if we define all sub-type of each type in a next PR? The ISA already specifies them as specilization of the main type. so actually giving them names shouldn't be a big problem. We can see how to do this here in the UDB and if it makes sense, this would be a great discussion to have with the ISA commitee at it's time, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, having all the fixed bits hardcoded in a "match" string for each instruction doesn't allow for that sort of field-level verification where the field value (or part of the field value!) is fixed. I avoided this in my Sail-to-JSON backend and left everything as individual fields, for good or bad. It's certainly handy to have a "match" string, but it's also pretty easy to compute a match string based on the individual field representations.

Copy link
Collaborator Author

@AFOliveira AFOliveira Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at this, the more I think we should probably just have all instructions as all fields described. This would only be possible with accurate instruction types and fields, that would precisely describe what every bit of every instruction is.

schemas/inst_schema.json Outdated Show resolved Hide resolved
@dhower-qc
Copy link
Collaborator

@ThinkOpenly @dhower-qc If you can please review the first extensions mock-up it would be great! The only odd thing is that yaml libraries seem to prefer to have "" on the match strings. This does not affect the data, so I believe it won't be a problem. I can also filter them out if you prefer.

I'll try to get this this later today. I have some thoughts after a quick pass, but it will take me a bit to organize them.

@AFOliveira AFOliveira force-pushed the AFOliveira/instructionType branch from 3a5c55a to 1ea910d Compare February 10, 2025 10:05
Signed-off-by: Afonso Oliveira <[email protected]>
@AFOliveira AFOliveira force-pushed the AFOliveira/instructionType branch from 83f357e to 0451c11 Compare February 10, 2025 11:04
…al/ Zabha/ Zalasr/ Zawrs/ Zbkb/ Zbkx/ Zfbfmin/ Zfh/ Zicbom/ Zicboz/ Zicfilp/ Zicfiss/ Zicond/ Zicsr/ Zifencei/ Zimop/ Zk Zks/

Signed-off-by: Afonso Oliveira <[email protected]>
@AFOliveira AFOliveira force-pushed the AFOliveira/instructionType branch from 11ab941 to beeecc1 Compare February 10, 2025 14:54
@AFOliveira
Copy link
Collaborator Author

@ThinkOpenly All but Compressed and Vector instructions should be now done. Also updated the schema.

@dhower-qc
Copy link
Collaborator

Warning: I'm going to blow this PR up 💣

I'm wondering if we can align this with the goals of #285, which are more expansive than identifying the "format" of an instruction.

As the discussion here highlights, "format" is too ambiguous to truly capture an instruction type (e.g., ecall is "I-type" even though it has no immediate). For each field in an encoding, we need to identify:

  • Location
  • Type (opcode/variable)
    • Recall that for opcodes, we want to capture the opcode field name (e.g., 'opcode', 'funct3', etc.)
    • For variables, we want to capture the name and traits (e.g., 'signed', 'shift', 'not')
  • Relocation, if applicable

With this in mind, "format" actually maps to multiple combinations of the above.

There are many ways to represent this. A two initial options for discussion:

Option 1 - insts point to specific format

# ld.yaml
name: ld
# ...
encoding:
  match: -----------------011-----0000011
  format: { $ref: iformats/itype_R_RISCV_PCREL_LO12_I.yaml# }
# iformats/itype_R_RISCV_PCREL_LO12_I.yaml
common_name: I-type
fields:
  - name: imm
    location: 31-20
    type: signed immediate
    relocation: { $ref: relocations/R_RISCV_PCREL_LO12_I.yaml# }
  - name: xs1
    location: 19-15
    type: xregister
  - name: xd
    location: 11-7
    type: xregister
  - name: opcode
    location: 6-0
    type: opcode
  - name: funct3
    location: 14-12
    type: opcode

Option 2 - insts pick their fields

# ld.yaml
name: ld
# ...
encoding:
  match: -----------------011-----0000011
  format: I-type # still needed, unless we have a map of "common name" to field list elsewhere
  fields:
    - { $ref: ifields/itype_pc_rel_imm.yaml# }
    - { $ref: ifields/xs1.yaml# }
    - { $ref: ifields/xd.yaml# }
    - { $ref: ifields/major_opcode.yaml# }
    - { $ref: ifields/funct3.yaml# }
---
# ifields/itype_pc_rel_imm.yaml
- name: imm
  location: 31-20
  type: signed immediate
  relocation: { $ref: relocations/R_RISCV_PCREL_LO12_I.yaml# }

---
# ifields/xs1.yaml
- name: xs1
  location: 19-15
  type: xregister

---
# ifields/xd.yaml#
- name: xd
  location: 11-7
  type: xregister

---
# ifields/major_opcode.yam
- name: opcode
  location: 6-0
  type: opcode

--
# ifields/funct3.yaml
- name: funct3
  location: 14-12
  type: opcode

Option 1 winds up with more "iformats" since you need a new format for every variation. Option avoids that problem, but doesn't completely remove data duplication in the files since you will have a lot of instructions with the same field list. FWIW, Option 2 gives a little more information to the human reader.

@drom might have opinions.

@ThinkOpenly
Copy link
Collaborator

I'm in the "Option 1" camp. If we think that formats are important (and the ISA spec does treat them as important), then every instruction must comply with a format, even with idiosyncrasies. We should not give instructions the freedom to "build their own" format with a list of fields that may or may not be part of the format but the layout works because "everything fits". (This is what Sail does today, and I think we have a nice opportunity to improve significantly on that.)

I like the compartmentalization to be at a high enough level that the format definitions could be hardcoded in the database. So, instead of iformats/itype_R_RISCV_PCREL_LO12_I.yaml, the "itype" part would be its own directory: iformats/itype/R_RISCV_PCREL_LO12_I.yaml.

@ThinkOpenly
Copy link
Collaborator

I also think we should still have the format as a required field for each instruction. This can be used to validate the instruction definition matches the layout of the fields, and it eases displaying the instruction format as part of the documentation/web reference.

I didn't review all 415(!) files changed here, and it's an open question as to whether we need to retain type.py (maybe not), but I like the changes to the YAML files and would like to see those changes merged, at least after we also get the vector instructions covered.

@dhower-qc
Copy link
Collaborator

I like the compartmentalization to be at a high enough level that the format definitions could be hardcoded in the database. So, instead of iformats/itype_R_RISCV_PCREL_LO12_I.yaml, the "itype" part would be its own directory: iformats/itype/R_RISCV_PCREL_LO12_I.yaml.

I like that.

I also think we should still have the format as a required field for each instruction.

Agreed.

@AFOliveira
Copy link
Collaborator Author

I also prefer option 1; however, I'm still not sure how this would look. It seems to me that the task is going to be defining every subset of each instruction type and defining what fields it has. Moreover, how do we identify the fields hardcoded? Because sometimes it is the same type, but the field actually means different things, right? Take fence as an example, it is I-type but the imm section is not an immediate at all.

image

Should we define a subtype for each meaning of the fields? I.e. this would be

  # fence.yaml
    - format: I-type
      subformat: { $ref: iformats/itype_fence.yaml# }

and then for the fields, how do they get hardcoded? Can we specify them? And then which ones would we specify and which ones we wouldn't? Probably should do for all, even for opcodes

#iformats/itype_fence.yaml

  - name: imm
    location: 31-20
    type: signed immediate
    value: 0x0

@dhower-qc
Copy link
Collaborator

OK, let's go with Option 1.

I was thinking that formats never have hardcoded values (though they might include excluded values with not:, as you see frequently in the C extension). You can use the mask for values. For your fence example, I think it would like:

# inst/I/fence.yaml
name: fence
encoding: 0000-------------000-----0001111
format: { $ref: iformats/itype/fence.yaml# }
# inst/I/fence.tso.yaml
name: fence
encoding: 0001-------------000-----0001111
format: { $ref: iformats/itype/fence.yaml# }
# iformats/itype/fence.yaml
common_name: I-type  # this is mostly informative, since it's too ambiguous to mean much
fields:
  - name: fm
    location: 31-28
    type: opcode
  - name: pi
    location: 27
    type: fence set
  # ...

@AFOliveira
Copy link
Collaborator Author

Next question is: is this going to be a requirement to have this PR upstream? In theory, having instruction types on their own would already be a nice increment to what we have now and allow for some more verification; however, proposed changes aren't really minimal, and I don't think there is a way to automate, so we first need to figure how many different subtypes we have and describe them all - and I'll have a really hard/long time doing that on my own :)

@dhower-qc
Copy link
Collaborator

Next question is: is this going to be a requirement to have this PR upstream? In theory, having instruction types on their own would already be a nice increment to what we have now and allow for some more verification; however, proposed changes aren't really minimal, and I don't think there is a way to automate, so we first need to figure how many different subtypes we have and describe them all - and I'll have a really hard/long time doing that on my own :)

Yes, this is not a small task. And it's hard to see a "transition" period where we would support old and new schemas. If you can get things started, and get a better idea of the scope of work, then I can work on finding some help.

@ThinkOpenly
Copy link
Collaborator

Next question is: is this going to be a requirement to have this PR upstream? In theory, having instruction types on their own would already be a nice increment to what we have now and allow for some more verification; however, proposed changes aren't really minimal, and I don't think there is a way to automate, so we first need to figure how many different subtypes we have and describe them all - and I'll have a really hard/long time doing that on my own :)

Yes, this is not a small task. And it's hard to see a "transition" period where we would support old and new schemas. If you can get things started, and get a better idea of the scope of work, then I can work on finding some help.

If I understand the semantics of $ref correctly, it just gets replaced with the target contents. So, the example above:

# ld.yaml
name: ld
# ...
encoding:
  match: -----------------011-----0000011
  format: { $ref: iformats/itype_R_RISCV_PCREL_LO12_I.yaml# }

Is exactly the same and can be implemented as:

# ld.yaml
name: ld
# ...
encoding:
  match: -----------------011-----0000011
  format:
    common_name: I-type
    fields:
      - name: imm
        location: 31-20
...

Is that correct?

If so, it might be an "easier" task to first convert everything from the current format to the second format, above, meaning change this:

encoding:
  match: -----------------000-----0001111
  variables:
    - name: fm
      location: 31-28
    - name: pred
      location: 27-24
    - name: succ
      location: 23-20
    - name: rs1
      location: 19-15
    - name: rd
      location: 11-7

to this:

encoding:
  match: -----------------000-----0001111
  format:
    common_name: I-type
    fields:
      - name: fm
        location: 31-28
      - name: pred
        location: 27-24
      - name: succ
        location: 23-20
      - name: rs1
        location: 19-15
      - name: rd
        location: 11-7

Then, we can coalesce common format representations (and add types and relocations) at will.

@AFOliveira
Copy link
Collaborator Author

@ThinkOpenly Yes, that was what I was asking. Can we first get all the formats upstream and only then open another PR on the next changes?

@dhower-qc
Copy link
Collaborator

I like this line of thinking, but I think there are a few pitfalls as proposed. For example, we'll need to know opcode vs. non-opcode so the generators know if the field is a variable for IDL code / wavedrom diagramming, etc. That said, we might be able to tolerate some missing info for a period if it helps.

I'm wondering just how many formats we really have. Thinking aloud here, this should cover G:

  • I-type, pc-relative offset
  • I-type, sign-extended immediate (no relocation)
  • I-type, CSR (no relocation) (arguably same as prior, but want a different name for the field)
  • I-type, CSRI (no relocation)
  • I-type, fence
  • I-type, opcode (ecall/ebreak)
  • R-type
  • R-type, FP rm (arguably same as prior, but want different name for 'funct3')
  • R-type, FP cvt/sqrt/class/mv
  • R-type, RV64 shifts
  • R-type, RV32 shifts
  • R-type, LR
  • R-type, SC / AMO
  • R4-type
  • U-type, pc-relative offset
  • U-type, immediate (no relocation)
  • J-type
  • S-type
  • B-type

That's 19. Say C adds another 10. A few for vector. And some misc here and there...

@AFOliveira , how many of these are your scripts already looking for?

@AFOliveira
Copy link
Collaborator Author

@AFOliveira , how many of these are your scripts already looking for?

I only have the base I-type, R-type etc... I'm afraid there are more than those 19, per example, I think V extension also adds a variety of unnamed types. There's also R-type FP in which RS2 is hardcoded. I think the hardest part here is having that definitive list, since the ISA is not super clear on them, after that I can see how much is actually automatable, but I would guess I could figure it out.

@ThinkOpenly
Copy link
Collaborator

This could benefit from iterations, I think. However, there is value in getting the basic types (6 for "I": R, I, S, B, U, J; and 9 for "C": CR, CI, CSS, CIW, CL, CS, CA, CB, CJ) added. These are the only types (AFAIK) defined in the ISA. Adding them will be non-controversial (if that is possible).

I worry about running too far ahead of the ISA in these kinds of canonical areas.

Other formats as described should probably be considered sub-types and non-canonical, and maybe wait for the next iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants