Skip to content

[SYCL][Docs] Add SYCLBIN feature and format design document #16872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Apr 9, 2025

Conversation

steffenlarsen
Copy link
Contributor

@steffenlarsen steffenlarsen commented Feb 3, 2025

This commit adds a design document detailing the SYCLBIN binary format for representing SYCL device kernel binaries to be loaded dynamically at runtime. Additionally, the design document details how this is to be handled by the SYCL runtime, driver and clang tooling.

As the design of SYCLBIN files relies heavily on the property sets, this PR also adds documentation to the existing property set functionality.

This commit adds a design document detailing the SYCLBIN binary format
for representing SYCL device kernel binaries to be loaded dynamically
at runtime. Additionally, the design document details how this is to be
handled by the SYCL runtime, driver and clang tooling.

Signed-off-by: Larsen, Steffen <[email protected]>
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCL design documentation predominantly uses Markdown format. Please, convert the document to Markdown format.

Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
@steffenlarsen
Copy link
Contributor Author

SYCL design documentation predominantly uses Markdown format. Please, convert the document to Markdown format.

Apologies! Old habits die hard. It should be good now.

Fix xrefs
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Comment on lines 117 to 118
*TODO:* Do we need a target-specific blob inside this structure? E.g. for CUDA
we may want to embed the SM version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about a specific CUDA property, is it still IR module? I suppose that I don't understand well enough what is PTX and what is its place in a toolchain.

Target-specific for me means "native", i.e. as if PTX is incorrectly assumed as IR module. Also, all IR modules are expected to share the same properties within an abstract module, right? If so, then maybe we should propagate that property up to the abstract module level and have PTX modules compiled for different SM versions as separate abstract modules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about a specific CUDA property, is it still IR module?

Because of the forward-compatibility of SM archs, my understanding is that we want PTX to be considered an IR type.

Target-specific for me means "native", i.e. as if PTX is incorrectly assumed as IR module. Also, all IR modules are expected to share the same properties within an abstract module, right? If so, then maybe we should propagate that property up to the abstract module level and have PTX modules compiled for different SM versions as separate abstract modules?

If we were to put the SM architecture information at abstract module level, I don't see how an abstract module would ever have more than one IR module and more than one native device code image. Granted, having the exact same properties is somewhat rare, but I would expect it to be the case if the user was to compile for multiple SM versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems like we would probably want to annotate the IR module with the CUDA virtual architecture in the case when the IR is PTX. I was thinking that we would use the IR-level metadata for stuff like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a "target" field to the IR module metadata which contains the value of -fsycl-targets used for the given module, similar to "arch" in native device code images, with the exception that this key may be missing from the metadata in the case that the option wasn't specified, in which case the IR type should be enough to infer from.

Comment on lines 135 to 136
module AOT compiled for a specific device, identified by the architecture
string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to specify what is the architecture string here.

Is it target triple? Is it value passed to -fsycl-targets? Is it value from architecture enum from our device architecture extension?

It is not clear how RT can use this field without such specifiation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually a good question. I am not sure what the architecture string would be for cases like SASS binaries. For example, lets say we've compiled to PTX through our compiler, then load that to a kernel-bundle, compile that kernel bundle to native device code and then serialize that to SYCLBIN. The -fsycl-targets would not be enough to express the architecture here, I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, the application has compiled the PTX to native code. Wouldn't you know the native architecture when this happens? It seems like the set of possible native CUDA architectures is a fixed set which would each map to one of the -fsycl-targets values.

I wonder if there is a reason to use a string for the architecture names. Why couldn't this be an enumeration? We use an enumeration for the device architectures in sycl_ext_oneapi_device_architecture.

In any case, I agree with @AlexeySachkov. I think the set of possible architectures should be specified in the file format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, the application has compiled the PTX to native code. Wouldn't you know the native architecture when this happens? It seems like the set of possible native CUDA architectures is a fixed set which would each map to one of the -fsycl-targets values.

I will have to do some research here. I know PTX can be associated with SM architectures, but I don't know if the same applies to the native device code produced from PTX. It may be device-specific and as such more strict than the SM version.

I wonder if there is a reason to use a string for the architecture names. Why couldn't this be an enumeration? We use an enumeration for the device architectures in sycl_ext_oneapi_device_architecture.

Since the compiler will need to know about these architectures too, I am reluctant to try and match enum values between the runtime and library for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've specified that the "arch" value is the string that corresponds to the -fsycl-targets value used for the binary. The runtime should generally be able to convert that to the enums in sycl_ext_oneapi_device_architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've specified that the "arch" value is the string that corresponds to the -fsycl-targets value used for the binary. The runtime should generally be able to convert that to the enums in sycl_ext_oneapi_device_architecture.

The approach works for me, but I feel like the clarification is actually missing from the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've added the reference to the SYCL extension with a note about the runtime attempting to make the appropriate mapping.

Comment on lines +200 to +201
directly, instead of extracting it from a host binary. This should be done when
a new flag, `--syclbin`, is passed. In this case, the clang-linker-wrapper is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, .syclbin files cannot be used if the output is not .syclbin, right?

I'm not sure if I have a use case for that, just wanted to double-check the intent.

A potential use-case, though, is ability to embed .syclbin into an application as if that device code was originally compiled as part of the application. I.e. you had your dynamically loadable .syclbin, but at some point decided to embed it and stop shipping it separately. But that will have some implications on the API, I assume: we need to design then how to use such embedded SYCLBIN.

Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Comment on lines 117 to 118
*TODO:* Do we need a target-specific blob inside this structure? E.g. for CUDA
we may want to embed the SM version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems like we would probably want to annotate the IR module with the CUDA virtual architecture in the case when the IR is PTX. I was thinking that we would use the IR-level metadata for stuff like this.

Comment on lines 135 to 136
module AOT compiled for a specific device, identified by the architecture
string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, the application has compiled the PTX to native code. Wouldn't you know the native architecture when this happens? It seems like the set of possible native CUDA architectures is a fixed set which would each map to one of the -fsycl-targets values.

I wonder if there is a reason to use a string for the architecture names. Why couldn't this be an enumeration? We use an enumeration for the device architectures in sycl_ext_oneapi_device_architecture.

In any case, I agree with @AlexeySachkov. I think the set of possible architectures should be specified in the file format.

steffenlarsen and others added 4 commits February 6, 2025 07:54

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Michael Toguchi <[email protected]>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Greg Lueck <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
@steffenlarsen
Copy link
Contributor Author

@AlexeySachkov @asudarsa @mdtoguchi - I believe I've addressed the open comments. A re-review would be much appreciated. 😄

Signed-off-by: Larsen, Steffen <[email protected]>
@steffenlarsen
Copy link
Contributor Author

@intel/dpcpp-doc-reviewers @AlexeySachkov @asudarsa - Friendly ping.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final comments here and there, but nothing major, I believe

Comment on lines +116 to +117
An abstract module metadata entry contains any number of property sets, as
described in [PropertySets.md](PropertySets.md), excluding:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a problem, but what if we add a new property there which cannot be applied to an abstract module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think adding new cases should be a problem. It is mostly to say that these property sets have no meaning for the abstract module metadata block. I don't think it's worth checking whether they are there or not, over just skipping them.

Comment on lines 135 to 136
module AOT compiled for a specific device, identified by the architecture
string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've specified that the "arch" value is the string that corresponds to the -fsycl-targets value used for the binary. The runtime should generally be able to convert that to the enums in sycl_ext_oneapi_device_architecture.

The approach works for me, but I feel like the clarification is actually missing from the doc.

`-fsycl` pipeline, instead passing the output of the clang-offload-packager
invocation to clang-linker-wrapper together with the new `--syclbin` flag.

Setting this option will imply `-fsycl` and override `-fsycl-device-only`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify what do you mean by "override -fsycl-device-only"? Isn't skipping host compilation implies -fsycl-device-only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose so. To some extend I think the terminology overlaps a little. They are not really mutually exclusive, it is just "-fsycl-device-only with more". I'll change it to just "implies". @mdtoguchi - Is there a precedence for the terminology used here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention of the wording here is that use of -fsyclbin will only produce the SYCLBIN file. Expectation of using -fsycl-device-only is produce just the device binary. If they are used together, -fsycl-device-only is overridden from a driver function perspective.

steffenlarsen and others added 2 commits March 27, 2025 15:39

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Alexey Sachkov <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
`-fsycl` pipeline, instead passing the output of the clang-offload-packager
invocation to clang-linker-wrapper together with the new `--syclbin` flag.

Setting this option implies `-fsycl` and `-fsycl-device-only`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see this written, what is the expectation of behavior when -fsyclbin -fsycl-device-only is used on the command line? The natural behavior here is for -fsycl-device-only to be seen and taken advantage of, which will create the device only file and stop. -fsyclbin does additional work after the device compilation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Maybe it would be better to issue a diagnostic if they are used together? Seems like they may not be totally related, despite there being some conceptual overlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the standard 'unused argument' diagnostic should be sufficient when used together:

> clang++ --offload-new-driver -fsycl-device-only -fsyclbin ~/a.cpp
clang++: warning: argument unused during compilation: '-fsyclbin' [-Wunused-command-line-argument]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've reworded this part to explicitly say that the -fsyclbin will be unused.


## clang-linker-wrapper changes

The clang-linker-wrapper is responsible for doing post-processing and linking of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by post-processing here? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module-splitting and metadata analysis/extraction mainly. I've expanded it a bit.

SYCLBIN files are linked together is yet to be specified.


## clang-linker-wrapper changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the process of moving most of the SYCL specific functionality from clang-linker-wrapper into a new tool called clang-sycl-linker. So, this documentation will need to be updated based on that. For the purposes of this PR, we can use clang-linker-wrapper.
Just heads up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up! Would it make sense to change it now? From a documentation POV, is it as simple as a search-and-replace or is there an important semantic difference between the tools?

Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Michael Toguchi <[email protected]>
the SYCL runtime library handles files of this format.


## SYCLBIN binary format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing and debugging would require some capabilities of searching/extracting information from files of this format.
It would require us to provide such tools/utilities.
I think that making this format based on ELF would allows us to reuse some available utilities like standard gnu packages (readelf, objdump and i.e) and llvm utilities (llvm-objdump and rich LLVM library). LLVM library's support of ELF could be reused in the development as well.

Custom binary format requires custom support in many ways that significantly burdens the development and maintaining. I think we should strive to the generic Offloading Format from LLVM as much as we can.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with the general sentiment and I am open to the idea of reusing ELF as the format. However, I don't see how this format fits with ELF, as we would just be fitting bogus into a lot of the pre-defined ELF headers and sections. SYCLBIN is not an executable format per-se and to do appropriate linking the linker will have to consider the binary metadata too, which we would have to retrofit into some text section of the ELF file.

For tooling, I could see it, but are there any other tools than llvm-objdump (and readelf and objdump) that we would get "for free" if we used ELF? Even if we do, would users be able to get much out of the metadata without us adding additional functionality to these tool or entirely new tooling? When you mention "rich LLVM library" could you be more specific?

I've previously tried and failed to fit the SYCLBIN format into the existing ELF format in a way that isn't just what the current design is but separated into best-effort chunks in the ELF format, so please, if you have a suggestion of how to structure the format based off ELF, please do explain your thoughts.

@@ -0,0 +1,296 @@
# SYCL binary property sets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is our fault of not properly communicating the status of upstreaming of PropertySets.

There has been an attempt of doing that: llvm/llvm-project#110771
And there is comment/request to use text format (JSON or YAML):
llvm/llvm-project#110771 (comment)

In overall, we doubt that we able to get PropertySets in llvm-project as it is. At the moment, Justin is experimenting with text formats to add this into intel/llvm. Therefore, it is not provident to make the old binary format of PropertySet the part of a new ABI.

In these document we could describe properties in an abstract way as a collection of string keys and int/string values without specifying the exact binary format.

In the syclbin format we could add some field specifying the actual format like:

struct Properties {
  std::byte format[8];  // "json", "yaml", "prop" (old binary properties) and i.e.
  uint32_t size;
  std::byte encoded_data[size];
}

I don't insist on this concrete structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think switching the format would be a problem. SYCLBIN just embeds the same properties as the compiler generates. This document is not intended to document a constituent of SYCLBIN but rather the current design of the property sets, which were not well documented. If we change to some other properties format, it should be straight forward to just switch over to embedding that instead, as they are simply encoded into the byte tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I mean that a documentation format similar to AMD's (link) table format wouldn't require an additional work in the future.

@steffenlarsen
Copy link
Contributor Author

The switch to ELF-based format was discussed offline and we decided to go with the current design for now to enable the development of the feature and then consider the upstreamability in parallel.

@steffenlarsen steffenlarsen requested a review from maksimsab April 8, 2025 10:13
Copy link
Contributor

@maksimsab maksimsab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

PropertySets could be transformed to table later.

@steffenlarsen steffenlarsen merged commit c3601c2 into intel:sycl Apr 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants