-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for (complex) hash labeling scheme #151
Comments
I have two questions about using function signatures as landing label on function prologue.
|
A branch within a function is not expose to other module, so IMO that could be implementation defined.
Generally those function are using |
I would prefer not to add more uses of MD5, even if in this application there aren't any actual security concerns. Since xxHash is also readily available, why not use that instead?
This name mangling scheme matches LLVM's existing forward-edge CFI implementations and I agree that it's a reasonable choice.
If you mean using the same label for the function entry and all the indirect branch targets in the function, I don't think that's acceptable. We should ensure that indirect calls are allowed only to function entry points.
I would suggest at least providing a recommendation for the indirect branch labeling scheme as well, so we won't end up with implementations that accidentally make CFI weaker. |
Honestly I am not security guy, so that's my first time to heard xxHash, and I am open mind for other algorithm :)
Yeah, sounds good idea. |
xxHash is not a cryptographic hash. xxHash is faster to compute. MD5 is also used by LLVM for CFI. For this purpose I would consider MD5 and xxHash to be equivalent. For a branch within a function we could pick the convention of using the :<source_line_number>. That would restrict its scope. |
The original CFI implementation in Clang uses MD5, but the newer ones ( https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenFunction.cpp#L572
Agreed, that's better. Ideally each branch target should have a unique label, so I would recommend including the file path in addition to the line number, for example. |
I suspect encoding line number may not work for all situation, one example is GNU C extension support "Labels as Values", but I do agree that should have different label than function, so I think we may add some note about that "NOTE: Targets of local indirect jumps that are not used for jump tables should have a landing pad. Additionally, the labeling value for these targets is recommended to be different from the function label value." void *ptr;
if(...)
ptr = &&foo;
else
ptr = &&bar;
/* ... */
goto *ptr;
foo:
/* ... */
bar:
/* ... */ [1] https://stackoverflow.com/questions/56316820/what-is-an-indirect-goto-statement |
Good point. Basically, all address-taken labels in a function must have the same landing pad label. We don't have to tie the label to the line number if that's confusing. It's enough that the label is unique within the program. E.g.,
Unless the compiler uses a software-guarded branch, the targets must have a landing pad. Do we want to include a recommendation for when it's safe to use a software-guarded branch?
This doesn't sound strong enough, because setting all indirect branch target labels to the same value would be enough in this case. How about recommending that each indirect branch target must have a landing pad label that's unique within the program, with the exception that all address-taken labels in the same function must have the same landing pad label? |
what about the catch block which has a landing pad used by the runtime? |
Please correct me if I'm wrong, but shouldn't C++ exceptions behave similarly to a function return, which means the exception landing pad won't need an |
My impression is they use forward indirect jump to restore context instead of ret. This might be wrong though.. this vary by the implementation.. and for GCC and llvm it's just a ret. |
It depends on the architecture, but the RISC-V implementation in LLVM's libunwind seems to use https://github.com/llvm/llvm-project/blob/main/libunwind/src/UnwindRegistersRestore.S#L1159 Are you aware of other implementations that behave differently? |
You are right, it depends on the implementation. nongnu libunwind seems to use jr with non ra reg and GCC uses __builtin_eh_return which should be a ret. With that said, it should be possible to assign a landingpad for it? Because that ret cannot be enforced by sspop since the address is not in stack anyway.. |
Thanks for looking into this. It sounds like we should either require compatible implementations to use a return, or prepare a recommendation for exception handler landing pad labeling as well since there are implementations that use indirect jumps with a non-
That's no different from |
I think on x86 CET compiler emits an |
I have a question about the |
The disassembly should only show the number. Since the label is generated by hashing the string and then truncating the hash to 20 bits it will not be possible to revere the process and regenerate the string from the truncated hash. |
It's technical possible
|
I am not sure its necessary. If we do add it we may want it to be strippable. |
I think if we're considering the necessity, this disassembling display rule can be considered "unnecessary", since it does not affect runtime behavior. If we leave this part unspecified in the spec, I believe the mainstream toolchains would choose their favorable formats when the needs arise, and other late comers may just follow suit. This may lead to some interoperabilty issue in disassemblers, but it should not be a big problem, because they can/should always fallback to number when the string is unavailable. |
Question: It looks like that |
From my understanding it seems that recursive encoding will lead to finer label distribution. |
We should mangle the entire possibly qualified function type, not just the arguments and returns. Providing the unhashed type in strippable debug information for disassembly sounds like a useful feature. I don't know enough about debug information to say whether that should be a symbol or something in DWARF.
The landing pad approach defines a label (singular, since there is no meaningful type information) to use for The shadow stack approach requires adding additional return addresses to the shadow stack, and checking in
|
Also, we may have to consider:
|
Signature is inferred at the call site on the basis of the promoted actual arguments. This is a general phenomenon; LLVM doesn't even have a concept of calling a function without a signature.
Represented in the type system and handled fine with a mangled function type, although not with the original proposal.
These can only be indirectly called through a pointer to member function. Use the mangling rules for those. |
--> I later found out this is mostly like an undefined behavior in the C standard, so I guess we may be able to rule this case out, but an interesting exception in the (C11 draft) standard caught my attention:
Does this mean that if we distinguish signed and unsigned integer types in label calculation, some weird cases can be made? |
Zicfilp has provided two labeling schemes: simple and complex (also known as function signature-based). The simple scheme uses an lpad with a constant 0, which does not require any hashing mechanism. In contrast, the complex labeling scheme computes the MD5 hash from the signature string. Filling up an MD5 hash value is straightforward for compilers, but it is non-trivial work for humans to maintain. Therefore, we have added new assembler modifiers to compute this value. See also: riscv/riscv-cfi#151
Spend time on reading C++ Itanium ABI again, and I tend to agree with you that not make our own list syntax if possible...we don't really want to deal with those edge/complex casees in C++. I am gonna to moving this forward recently since the simple labeling scheme PoC is pass most of GCC testsuite :) |
psABI: riscv-non-isa/riscv-elf-psabi-doc#434 |
Closing this issue as discussion and development is continuing in the psabi and asm manual PR. |
Dump some content from my offlist discussion with @ved-rivos :P
Introduce one more function in asm syntax:
%lpad_label("<signature-string>")
.e.g.
lpad %lpad_label("v(i,i)")
,lui t2, %lpad_label("v(i,i)")
forvoid foo(int, int)
The signature string sytax:
TYPE-NAME is using the mangling name in itanium c++ abi[1] and RISC-V psABI[2].
[1] https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling-builtin
[2] https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-elf.adoc#c-name-mangling
The value is the hash result of the signature string, we may start from md5.
void foo(int, int)
=%lpad_label("v(i,i)")
=low20(md5("v(i,i)"))
=low20(3a7a68c073dcdbd6ee282a598c2dc44e)
= 0xdc44eWhy
md5
: no strong reason for using md5, but one reason is it's available on binutils and llvm already.Why using mangling name defined in itanium c++ abi: Prevent us to define complicate rule for mangling name.
More example:
void bar()
:v()
void *memset(void *, int, size_t)
:Pv(Pv,i,m)
The text was updated successfully, but these errors were encountered: