-
Notifications
You must be signed in to change notification settings - Fork 34
c.mulh instruction #208
Comments
Here's a little additional colour: I develop the minimax RISC-V implementation (https://github.com/gsmecher/minimax), which directly executes only 16-bit instructions and emulates 32-bit instructions in microcode. Because of this architecture, there is an ~100x performance difference between compressed and uncompressed instructions. The ZC* extensions are critical because more code in the 16-bit space is a massive performance improvement for minimax. More specifically: Minimax is currently useless for DSP because all multiplies must be emulated. A c.mul instruction brings high-performance 16-bit convolutions within reach. A c.mulh instruction would make performant 32-bit DSP possible in this core. This would open the core up to an entirely new type of application. I understand that a ~100x performance differential in a weird implementation is a Me problem, not a You problem. I also understand it's late in the cycle for a new instruction. However: there is a perfect encoding candidate right below c.mul (see Table 16.6, RISC-V Unprivileged ISA). Since c.mul implies M, which mandates a 32x32 -> 64-bit multiplier anyways, the implementation cost is basically the multiplexer to select between low and high words of the 64-bit product. From this perspective: is the lack of c.mulh a deliberate decision, or just something nobody has requested yet? Again, thanks for all your efforts - I am excited about this WG's work, with or without c.mulh. |
The |
@jnk0le - that is an excellent point. A synthetic c.mulh using Zcb instructions would be pretty short, and it looks like Zcb has all the right primitives already. Thank you - closing. |
On reflection, @jnk0le, I closed this issue too soon. Your comment is predicated on paired c.mul/c.mulh, which is true for wider multiplies. It's not true for DSP algorithms (FFTs, FIRs, correlations, etc.). Because these are not general-purpose algorithms, they don't show up on general-purpose benchmarks. I am, unfortunately, flatly wrong about the "obvious" encoding space for c.mulh. The slot below c.mulh in Table 16-6 is already consumed in Zc* and it does not look like there's an easy place to put c.mulh even if it were justified. |
If you have any suggested benchmarks (preferably open source but IIRC ELF files is all you need) then I guess it should be possible/easy to add support for analyzing the impact of a c.mulh encoding to the hca script in this repository. |
Thanks, @tovine: for DSP applications, ARM's CMSIS DSP library 0 seems like a good reference point. CMSIS is Apache licensed and should be directly portable to RISC-V (see e.g. 1). The following is an example convolution kernel adapted from 2. The loop is unrolled by a factor of 4, but here's the compact version first:
I get:
There are 8 instructions in the non-unrolled loop, of which mulh is the only uncompressed one. Pretty tough to make a case that c.mulh would improve code density here. I am not sure if there's a strong microarchitectural argument, and/or whether that's relevant to this WG anyway. If I use the 4-way unrolled version of the kernel instead, I get something the following:
Here, there are 24 instructions in the loop (13 uncompressed, of which 4 are mulh.) There are RVC equivalents to everything except c.mulh - though it looks like it would take a different compiler or handwritten kernel to make a 16-bit c.mulh stand out. An FFT implementation would be nearly as general-purpose as a convolution, but CMSIS doesn't have one. I'll poke around and see if I can put together something similar. Ultimately, though, this is somewhat application-specific and I don't know if a benchmark belongs in your tree or not. |
Here's an FFT, taken brutally and somewhat arbitrarily from cmsis-dsp.
I get the following:
This implementation looks like a dead end for c.mulh: most of the loop consists of 32-bit instructions (mul/mulh pairs not least among them, per @jnk0le's point above). |
Just to be 100% clear: I think this idea is DOA for non-technical reasons (schedule, scope) and technical reasons outside the scope of all my analysis above (lack of available encoding space for c.mulh). It is probably not worth discussing the motivational benchmarks alone if it's moot. |
In the initial analysis phase, we generated frequency count of all instructions used in all benchmarks we have used, and before trying to fit to any compressed encoding, the saving from c.mulh were way too low to consider. Like you said this would be more applicable in DSP type application , but would need to prove some how that the code size saving per encoding bit spent is worth while, and I think that would be fairly hard. Zc* extension is frozen at its current state, so its very unlikely that any new instructions would be added. But if you would like to make a better case for such instruction, I would encourage you to find opensource benchmarks that you think would benefit from it, and PR their compilation instructions for RISC-V and link to their source here thus we can evaluate using them when we work on the next iteration of the extension ! |
Thanks - this context is helpful, and I understand and agree with what you say. I will keep my eyes open for a justifiable benchmark that supports c.mulh. |
A c.mulh instruction to mirror c.mul in zcb would allow the following to be expressed in purely compressed instructions:
There is room in the instruction encodings directly below c.mul (see Table 16-6, RISC-V Unprivileged ISA) - there is a natural spot for c.mulh and it seems a shame not to take advantage of the symmetry here.
Related: #9. The MAC proposal was (reasonably) excluded as out-of-scope here. A c.mulh bookend to c.mul would partly address the same need but fits better within the mandate of this WG.
The text was updated successfully, but these errors were encountered: