Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option for .w suffix boilerplating of all compressible instructions #83

Open
jnk0le opened this issue Aug 11, 2024 · 4 comments
Open

Comments

@jnk0le
Copy link

jnk0le commented Aug 11, 2024

related to #61, as I already spotted some instances of compressible but .w instructions used in some inputs, "for no reason".

Certain microarchitectures may suffer performance degradation due to the use of compressed instructions.
In order to avoid it and resulting false positives/negatives in benchmarking, all instructions need to be forced into uncompressed form (i.e. boilerplated with .w suffix.)
To not bother the "naive" writers it needs to be handled by the slothy via config.

cortex-m7:
For maximum ipc, all instructions need to be uncompressed and one needs forget about load/store double/multiple.
(no further penalties after "normal" stalls)
Of course it is possible to compress (and use CISCy instructions) without penalties but I couldn't figure out the exact pattern
and trial&error probing on HW is too much for superoptimizer.

cortex-m3/4:
.w loads needs to be aligned (instruction bits) at word boundaries or will fail to pipeline.
(Shwabe&Stoffelen aes work, went for "all uncompressed" way)

@dop-amin
Copy link
Collaborator

Hey @jnk0le, thanks for bringing this to our attention! In the development and tuning of the model, I also figured that we -- probably -- want .w everywhere. I think it should be available as an option to enable this.

However, this would require more changes than just in the printing as the scheduling properties of the expanded instruction may be different, as well as more possibilties for the register renaming should be taken into account, i.e., what's currently modeled as, e.g., eor_short in the architectural model should get transformed into 3-operand eor.

@jnk0le
Copy link
Author

jnk0le commented Aug 15, 2024

as the scheduling properties of the expanded instruction may be different,

Didn't spot such behaviour on M4/M7. Only the things like compiler preferring "shifted constant" over encoding T4 (better issuing) or having to chose between uxtb.n r0, r1 and and.w r0, r1, #0xff (better issuing)

That should be a thing on CM33 or CM55 though. (M85 can tripple issue nops and branches but that's independent of offending instruction size)

@dop-amin
Copy link
Collaborator

dop-amin commented Aug 15, 2024

Thanks for your input on that matter!

Didn't spot such behaviour on M4/M7

I agree, just did not want to exclude that this case could come up.
However, just adding the .w without switching to the 3-operand form of the instruction in slothy is still "a waste" as it limits the register renaming.

On, e.g., M85 this could matter though. From the Software Optimization Guide: "The latency from the shifter source operand is 2, regardless of whether the shift immediate value is non-zero or not." This means, using .w on an instruction where the shift is 0 and could be encoded in 16 bits will be promoted to the 32-bit form where the shift immediate is set to 0, incurring a latency penalty on the shifted argument.
I see you have been running experiments on M85, too; have you been able to observe this?

@jnk0le
Copy link
Author

jnk0le commented Aug 15, 2024

"The latency from the shifter source operand is 2, regardless of whether the shift immediate value is non-zero or not." This means, using .w on an instruction where the shift is 0 and could be encoded in 16 bits will be promoted to the 32-bit form where the shift immediate is set to 0, incurring a latency penalty on the shifted argument.

seems to be the case on chained (3-4+) dependency, otherwise stall is somehow folded by early/late ALU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants