-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL][Docs] Add SYCLBIN feature and format design document #16872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Docs] Add SYCLBIN feature and format design document #16872
Conversation
This commit adds a design document detailing the SYCLBIN binary format for representing SYCL device kernel binaries to be loaded dynamically at runtime. Additionally, the design document details how this is to be handled by the SYCL runtime, driver and clang tooling. Signed-off-by: Larsen, Steffen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCL design documentation predominantly uses Markdown format. Please, convert the document to Markdown format.
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Apologies! Old habits die hard. It should be good now. |
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
sycl/doc/design/SYCLBINDesign.md
Outdated
*TODO:* Do we need a target-specific blob inside this structure? E.g. for CUDA | ||
we may want to embed the SM version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are talking about a specific CUDA property, is it still IR module? I suppose that I don't understand well enough what is PTX and what is its place in a toolchain.
Target-specific for me means "native", i.e. as if PTX is incorrectly assumed as IR module. Also, all IR modules are expected to share the same properties within an abstract module, right? If so, then maybe we should propagate that property up to the abstract module level and have PTX modules compiled for different SM versions as separate abstract modules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are talking about a specific CUDA property, is it still IR module?
Because of the forward-compatibility of SM archs, my understanding is that we want PTX to be considered an IR type.
Target-specific for me means "native", i.e. as if PTX is incorrectly assumed as IR module. Also, all IR modules are expected to share the same properties within an abstract module, right? If so, then maybe we should propagate that property up to the abstract module level and have PTX modules compiled for different SM versions as separate abstract modules?
If we were to put the SM architecture information at abstract module level, I don't see how an abstract module would ever have more than one IR module and more than one native device code image. Granted, having the exact same properties is somewhat rare, but I would expect it to be the case if the user was to compile for multiple SM versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems like we would probably want to annotate the IR module with the CUDA virtual architecture in the case when the IR is PTX. I was thinking that we would use the IR-level metadata for stuff like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a "target" field to the IR module metadata which contains the value of -fsycl-targets
used for the given module, similar to "arch" in native device code images, with the exception that this key may be missing from the metadata in the case that the option wasn't specified, in which case the IR type should be enough to infer from.
sycl/doc/design/SYCLBINDesign.md
Outdated
module AOT compiled for a specific device, identified by the architecture | ||
string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to specify what is the architecture string here.
Is it target triple? Is it value passed to -fsycl-targets
? Is it value from architecture
enum from our device architecture extension?
It is not clear how RT can use this field without such specifiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is actually a good question. I am not sure what the architecture string would be for cases like SASS binaries. For example, lets say we've compiled to PTX through our compiler, then load that to a kernel-bundle, compile that kernel bundle to native device code and then serialize that to SYCLBIN. The -fsycl-targets
would not be enough to express the architecture here, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example, the application has compiled the PTX to native code. Wouldn't you know the native architecture when this happens? It seems like the set of possible native CUDA architectures is a fixed set which would each map to one of the -fsycl-targets
values.
I wonder if there is a reason to use a string for the architecture names. Why couldn't this be an enumeration? We use an enumeration for the device architectures in sycl_ext_oneapi_device_architecture.
In any case, I agree with @AlexeySachkov. I think the set of possible architectures should be specified in the file format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example, the application has compiled the PTX to native code. Wouldn't you know the native architecture when this happens? It seems like the set of possible native CUDA architectures is a fixed set which would each map to one of the
-fsycl-targets
values.
I will have to do some research here. I know PTX can be associated with SM architectures, but I don't know if the same applies to the native device code produced from PTX. It may be device-specific and as such more strict than the SM version.
I wonder if there is a reason to use a string for the architecture names. Why couldn't this be an enumeration? We use an enumeration for the device architectures in sycl_ext_oneapi_device_architecture.
Since the compiler will need to know about these architectures too, I am reluctant to try and match enum values between the runtime and library for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've specified that the "arch" value is the string that corresponds to the -fsycl-targets
value used for the binary. The runtime should generally be able to convert that to the enums in sycl_ext_oneapi_device_architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've specified that the "arch" value is the string that corresponds to the
-fsycl-targets
value used for the binary. The runtime should generally be able to convert that to the enums in sycl_ext_oneapi_device_architecture.
The approach works for me, but I feel like the clarification is actually missing from the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've added the reference to the SYCL extension with a note about the runtime attempting to make the appropriate mapping.
directly, instead of extracting it from a host binary. This should be done when | ||
a new flag, `--syclbin`, is passed. In this case, the clang-linker-wrapper is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, .syclbin
files cannot be used if the output is not .syclbin
, right?
I'm not sure if I have a use case for that, just wanted to double-check the intent.
A potential use-case, though, is ability to embed .syclbin
into an application as if that device code was originally compiled as part of the application. I.e. you had your dynamically loadable .syclbin
, but at some point decided to embed it and stop shipping it separately. But that will have some implications on the API, I assume: we need to design then how to use such embedded SYCLBIN.
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
sycl/doc/design/SYCLBINDesign.md
Outdated
*TODO:* Do we need a target-specific blob inside this structure? E.g. for CUDA | ||
we may want to embed the SM version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems like we would probably want to annotate the IR module with the CUDA virtual architecture in the case when the IR is PTX. I was thinking that we would use the IR-level metadata for stuff like this.
sycl/doc/design/SYCLBINDesign.md
Outdated
module AOT compiled for a specific device, identified by the architecture | ||
string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example, the application has compiled the PTX to native code. Wouldn't you know the native architecture when this happens? It seems like the set of possible native CUDA architectures is a fixed set which would each map to one of the -fsycl-targets
values.
I wonder if there is a reason to use a string for the architecture names. Why couldn't this be an enumeration? We use an enumeration for the device architectures in sycl_ext_oneapi_device_architecture.
In any case, I agree with @AlexeySachkov. I think the set of possible architectures should be specified in the file format.
Co-authored-by: Michael Toguchi <[email protected]>
Co-authored-by: Greg Lueck <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
@AlexeySachkov @asudarsa @mdtoguchi - I believe I've addressed the open comments. A re-review would be much appreciated. 😄 |
Signed-off-by: Larsen, Steffen <[email protected]>
@intel/dpcpp-doc-reviewers @AlexeySachkov @asudarsa - Friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few final comments here and there, but nothing major, I believe
An abstract module metadata entry contains any number of property sets, as | ||
described in [PropertySets.md](PropertySets.md), excluding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a problem, but what if we add a new property there which cannot be applied to an abstract module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think adding new cases should be a problem. It is mostly to say that these property sets have no meaning for the abstract module metadata block. I don't think it's worth checking whether they are there or not, over just skipping them.
sycl/doc/design/SYCLBINDesign.md
Outdated
module AOT compiled for a specific device, identified by the architecture | ||
string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've specified that the "arch" value is the string that corresponds to the
-fsycl-targets
value used for the binary. The runtime should generally be able to convert that to the enums in sycl_ext_oneapi_device_architecture.
The approach works for me, but I feel like the clarification is actually missing from the doc.
sycl/doc/design/SYCLBINDesign.md
Outdated
`-fsycl` pipeline, instead passing the output of the clang-offload-packager | ||
invocation to clang-linker-wrapper together with the new `--syclbin` flag. | ||
|
||
Setting this option will imply `-fsycl` and override `-fsycl-device-only`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify what do you mean by "override -fsycl-device-only
"? Isn't skipping host compilation implies -fsycl-device-only
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose so. To some extend I think the terminology overlaps a little. They are not really mutually exclusive, it is just "-fsycl-device-only
with more". I'll change it to just "implies". @mdtoguchi - Is there a precedence for the terminology used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention of the wording here is that use of -fsyclbin
will only produce the SYCLBIN file. Expectation of using -fsycl-device-only
is produce just the device binary. If they are used together, -fsycl-device-only
is overridden from a driver function perspective.
Co-authored-by: Alexey Sachkov <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
sycl/doc/design/SYCLBINDesign.md
Outdated
`-fsycl` pipeline, instead passing the output of the clang-offload-packager | ||
invocation to clang-linker-wrapper together with the new `--syclbin` flag. | ||
|
||
Setting this option implies `-fsycl` and `-fsycl-device-only`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this written, what is the expectation of behavior when -fsyclbin -fsycl-device-only
is used on the command line? The natural behavior here is for -fsycl-device-only
to be seen and taken advantage of, which will create the device only file and stop. -fsyclbin
does additional work after the device compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Maybe it would be better to issue a diagnostic if they are used together? Seems like they may not be totally related, despite there being some conceptual overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the standard 'unused argument' diagnostic should be sufficient when used together:
> clang++ --offload-new-driver -fsycl-device-only -fsyclbin ~/a.cpp
clang++: warning: argument unused during compilation: '-fsyclbin' [-Wunused-command-line-argument]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I've reworded this part to explicitly say that the -fsyclbin
will be unused.
sycl/doc/design/SYCLBINDesign.md
Outdated
|
||
## clang-linker-wrapper changes | ||
|
||
The clang-linker-wrapper is responsible for doing post-processing and linking of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by post-processing here? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module-splitting and metadata analysis/extraction mainly. I've expanded it a bit.
SYCLBIN files are linked together is yet to be specified. | ||
|
||
|
||
## clang-linker-wrapper changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in the process of moving most of the SYCL specific functionality from clang-linker-wrapper into a new tool called clang-sycl-linker. So, this documentation will need to be updated based on that. For the purposes of this PR, we can use clang-linker-wrapper.
Just heads up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up! Would it make sense to change it now? From a documentation POV, is it as simple as a search-and-replace or is there an important semantic difference between the tools?
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Co-authored-by: Michael Toguchi <[email protected]>
the SYCL runtime library handles files of this format. | ||
|
||
|
||
## SYCLBIN binary format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing and debugging would require some capabilities of searching/extracting information from files of this format.
It would require us to provide such tools/utilities.
I think that making this format based on ELF would allows us to reuse some available utilities like standard gnu packages (readelf, objdump and i.e) and llvm utilities (llvm-objdump and rich LLVM library). LLVM library's support of ELF could be reused in the development as well.
Custom binary format requires custom support in many ways that significantly burdens the development and maintaining. I think we should strive to the generic Offloading Format from LLVM as much as we can.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with the general sentiment and I am open to the idea of reusing ELF as the format. However, I don't see how this format fits with ELF, as we would just be fitting bogus into a lot of the pre-defined ELF headers and sections. SYCLBIN is not an executable format per-se and to do appropriate linking the linker will have to consider the binary metadata too, which we would have to retrofit into some text section of the ELF file.
For tooling, I could see it, but are there any other tools than llvm-objdump
(and readelf
and objdump
) that we would get "for free" if we used ELF? Even if we do, would users be able to get much out of the metadata without us adding additional functionality to these tool or entirely new tooling? When you mention "rich LLVM library" could you be more specific?
I've previously tried and failed to fit the SYCLBIN format into the existing ELF format in a way that isn't just what the current design is but separated into best-effort chunks in the ELF format, so please, if you have a suggestion of how to structure the format based off ELF, please do explain your thoughts.
@@ -0,0 +1,296 @@ | |||
# SYCL binary property sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is our fault of not properly communicating the status of upstreaming of PropertySets.
There has been an attempt of doing that: llvm/llvm-project#110771
And there is comment/request to use text format (JSON or YAML):
llvm/llvm-project#110771 (comment)
In overall, we doubt that we able to get PropertySets in llvm-project as it is. At the moment, Justin is experimenting with text formats to add this into intel/llvm. Therefore, it is not provident to make the old binary format of PropertySet the part of a new ABI.
In these document we could describe properties in an abstract way as a collection of string keys and int/string values without specifying the exact binary format.
In the syclbin
format we could add some field specifying the actual format like:
struct Properties {
std::byte format[8]; // "json", "yaml", "prop" (old binary properties) and i.e.
uint32_t size;
std::byte encoded_data[size];
}
I don't insist on this concrete structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think switching the format would be a problem. SYCLBIN just embeds the same properties as the compiler generates. This document is not intended to document a constituent of SYCLBIN but rather the current design of the property sets, which were not well documented. If we change to some other properties format, it should be straight forward to just switch over to embedding that instead, as they are simply encoded into the byte tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I mean that a documentation format similar to AMD's (link) table format wouldn't require an additional work in the future.
The switch to ELF-based format was discussed offline and we decided to go with the current design for now to enable the development of the feature and then consider the upstreamability in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
PropertySets could be transformed to table later.
This commit adds a design document detailing the SYCLBIN binary format for representing SYCL device kernel binaries to be loaded dynamically at runtime. Additionally, the design document details how this is to be handled by the SYCL runtime, driver and clang tooling.
As the design of SYCLBIN files relies heavily on the property sets, this PR also adds documentation to the existing property set functionality.