Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Any" code model with link-time instruction selection? #322

Open
rui314 opened this issue Aug 5, 2022 · 10 comments
Open

"Any" code model with link-time instruction selection? #322

rui314 opened this issue Aug 5, 2022 · 10 comments

Comments

@rui314
Copy link
Collaborator

rui314 commented Aug 5, 2022

I've been wondering since I started working on a linker for RISC-V why its psABI needed the notion of "code model" in the first place.

I understand that code model is required if a compiler emits a sequence of machine code and the linker can only mutate it without changing its size. For example, if a compiler for x86-64 emits an immediate load instruction to set a symbol address to a register, we can't change it to a GOT-indirect load because we can't replace a single instruction with two instructions in-place. There's simply no enough space to rewrite instructions.

However, it's not an issue for the RISC-V psABI since it allows the linker to add or remove instructions in the middle of a section. Then, why did we want a compiler to emit a specialized instruction sequence instead of a generic one?

Let me show you an example. Imagine that we have a hypothetical relocation R_RISCV_SYMBOL_ADDRESS that materializes a symbol address in a register. A compiler emits code like this:

  addi a0, a0, 0 # R_RISCV_SYMBOL_ADDRESS foo

where addi is a placeholder instruction (I chose this instruction arbitrarily). The linker is free to expand it to

  lui a0, %pcrel_hi20(foo)
  addi a0, a0, %pcrel_lo12(foo)

if foo is defined locally, or

  lui a0, %pcrel_hi20(foo@GOT)
  ld a0, %pcrel_lo12(foo@GOT)

if foo needs to be accessed via GOT, or even

  addi a0, x0, foo

if the linker knows foo's address is very small.

We can do the same thing for call. The point is, we can let the linker to choose the instructions that works best for the final binary form.

I'm not sure if this was discussed before, but I don't think the above scheme have a fundamental flaw. Am I missing something?

@aswaterman
Copy link
Contributor

aswaterman commented Aug 5, 2022

It would indeed be possible to construct such a universe. But unless the linker is also an optimizing compiler, you'd lose out on some important optimization opportunities. Splitting addressing sequences into independently schedulable instructions avails the compiler of more code-motion opportunities. For example, if a global variable is referenced in a loop body, the lui can be hoisted out of the loop, reducing dynamic instruction count.

If, instead of emitting the best-case sequence and expanding it in the linker, you emitted the worst-case sequence and contracted it, the situation would improve. (My previous example could then be optimized as we'd hope.) But it still would give the compiler less-accurate information in some cases, misguiding the optimizer. The effect would be most pronounced for a fully general large code model for RV64, where addressing sequences have several instructions, so that scheduling them will materially affect register-allocation decisions.

@rui314
Copy link
Collaborator Author

rui314 commented Aug 5, 2022

I don't think it prevents compilers from emitting optimized code. Here is an example: imagine that there is a function that access thread-local variables A and B. If the compiler knows that both A and B will end up be in the same ELF module (e.g. they are hidden symbols), it can compute the TLS base address of the current module and just add offsets from it to access A and B. But such optimization is orthogonal to the access model; the code sequence emitted for this kind of optimized accesses are not affected by the current code model.

If the compiler for example knows that the upper 42 bits are the same for variable A and B, it can emit code that materializes the upper 42 bits in the code-model-less model and then reuse that value in the following code. We don't have to live only with the hypothetical R_RISCV_SYMBOL_ADDRESS relocation. We can define more relocations if needed for compiler optimizations.

@jnk0le
Copy link

jnk0le commented Aug 15, 2022

I've been wondering since I started working on a linker for RISC-V why its psABI needed the notion of "code model" in the first place.

I also think that the division into "medlow" and "medany" seems to be quite unnecessary.
Because of this division we already have issues like: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/xq0H3zlzz2Y

There only need to be defacto 2 "code models":

  • PIC/PIE code, obviously auipc only
  • static code where auipc is emitted only when LUI+ADDI can't reach the offset (as the LUI is easier and more likely to be macro fused)

auipc can be used for a relocation placeholder, so the "not optimizing linkers" don't break at large base offsets.

Also, considering things like riscvarchive/riscv-code-size-reduction/issues/158
in some cases it could be beneficial to emit c.lui and try c.auipc when out of c.lui range

@kito-cheng
Copy link
Collaborator

kito-cheng commented Aug 16, 2022

However, it's not an issue for the RISC-V psABI since it allows the linker to add or remove instructions in the middle of a section. Then, why did we want a compiler to emit a specialized instruction sequence instead of a generic one?

Linker can only remove instruction during the relaxation process, and can't insert new instruction, or more specifically, we can't grow the code size since that might cause linker relaxation can't convergence and made the relaxation become more complicated, inserting new instruction might cause branch/jump become out-of-range, and we need using indirect-jump to resolve once jump is out-of-range, but it's almost impossible at the linker stage due that require extra temp register, RISC-V ISA has relative short branch and jump range compare to other architecture so this will happen frequently then we expect.

@kito-cheng
Copy link
Collaborator

static code where auipc is emitted only when LUI+ADDI can't reach the offset (as the LUI is easier and more likely to be macro fused)

Ideally we should select LUI rather than AUIPC if possible, but current toolchain implementation has separated compilation and linking phase (true for LLVM and GCC), so we don't know when we can use LUI or AUIPC since we don't have symbol address during compilation stage - that's why we still need -mcmodel option, and let user to decide that.

@rui314
Copy link
Collaborator Author

rui314 commented Aug 16, 2022

Linker can only remove instruction during the relaxation process, and can't insert new instruction, or more specifically, we can't grow the code size since that might cause linker relaxation can't convergence and made the relaxation become more complicated, inserting new instruction might cause branch/jump become out-of-range, and we need using indirect-jump to resolve once jump is out-of-range, but it's almost impossible at the linker stage due that require extra temp register, RISC-V ISA has relative short branch and jump range compare to other architecture so this will happen frequently then we expect.

I think all the points you mentioned are technical problems we can just solve. The linkers do not currently expect sections to grow in size, but that's not a fundamental limitation. After all, someone has to select instructions. We do it in compiler at the moment with an algorithm that's guaranteed to converge. We can run the same or a similar algorithm in the linker.

As to the issue that a branch can be out-of-range if we add bytes between the instruction to its destination, there are many solutions. One obvious way to solve it is to reserve a register for the linker-synthesized long branch instructions. The other would be making the compiler to emit short jump instructions with some "safety margin" (say, if the destination is <95% of the instruction's reach) to allow the linker to optimize within that safety margin. There might be more solutions to this problem.

@kito-cheng
Copy link
Collaborator

As to the issue that a branch can be out-of-range if we add bytes between the instruction to its destination, there are many solutions. One obvious way to solve it is to reserve a register for the linker-synthesized long branch instructions. The other would be making the compiler to emit short jump instructions with some "safety margin" (say, if the destination is <95% of the instruction's reach) to allow the linker to optimize within that safety margin. There might be more solutions to this problem.

I don't like the idea of reserve a register for linker, but I admit one of possible solution, and that would made this become an new ABI rather than just a new code model.

But I am very happy to discuss this further, it's kind of brain storming, and that might not applicable on UABI/Unix ABI, but that would be worth to consider that for embedded ABI.

@jnk0le
Copy link

jnk0le commented Sep 9, 2022

Because of this division we already have issues like: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/xq0H3zlzz2Y

and here is another one:
https://xpack.github.io/blog/2022/08/30/riscv-none-elf-gcc-v12-2-0-1-released/

-mcmodel=medany
The libraries are compiled with -O2 -mcmodel=medany. The nano version is compiled with -Os -mcmodel=medany.
Important: It is mandatory for the applications to be compiled with -mcmodel=medany, otherwise the link might fail.

not able to use modlow because standard library that is tightly coupled to compiler was compiled using medany

Ideally we should select LUI rather than AUIPC if possible, but current toolchain implementation has separated compilation and linking phase (true for LLVM and GCC), so we don't know when we can use LUI or AUIPC since we don't have symbol address during compilation stage - that's why we still need -mcmodel option, and let user to decide that.

But can't the linker optimize auipc into lui, as it's already doing a lot of similar optimization steps?

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 9, 2022

Because of this division we already have issues like: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/xq0H3zlzz2Y

and here is another one: https://xpack.github.io/blog/2022/08/30/riscv-none-elf-gcc-v12-2-0-1-released/

-mcmodel=medany
The libraries are compiled with -O2 -mcmodel=medany. The nano version is compiled with -Os -mcmodel=medany.
Important: It is mandatory for the applications to be compiled with -mcmodel=medany, otherwise the link might fail.

not able to use modlow because standard library that is tightly coupled to compiler was compiled using medany

You can mix medlow and medany code just fine. The only problem is if you try to link medlow code at an address that isn't in (-2 GiB, +2 GiB), which is true irrespective of whether you're mixing medany code with it (which can support that just fine). I don't understand this note, it only makes sense if they have a linker script that sets the base address on RV64 to be outside that range, but that's impossible on RV32. The whole point of compiling library code as medany is to make it usable in more cases.

@jnk0le
Copy link

jnk0le commented Sep 9, 2022

You can mix medlow and medany code just fine. The only problem is if you try to link medlow code at an address that isn't in (-2 GiB, +2 GiB), which is true irrespective of whether you're mixing medany code with it (which can support that just fine). I don't understand this note, it only makes sense if they have a linker script that sets the base address on RV64 to be outside that range, but that's impossible on RV32. The whole point of compiling library code as medany is to make it usable in more cases.

So that turns out to be just a minor missing-opt in library code on RV32, if those auipcs don't get turned into luis by linker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants