Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c.parallel should reuse CUB policies directly #3494

Open
griwes opened this issue Jan 22, 2025 · 5 comments
Open

c.parallel should reuse CUB policies directly #3494

griwes opened this issue Jan 22, 2025 · 5 comments
Assignees
Labels
feature request New feature or request.

Comments

@griwes
Copy link
Contributor

griwes commented Jan 22, 2025

Currently, c.parallel contains copies of some of the tuning parameters of CUB algorithms, and uses those copies. This is unsustainable; for simpler algorithms it doesn't take much code, but for, say, RadixSort, this would be a massive block of logic. Additionally, maintaining it in two places means we can create accidental differences between the two.

The code from CUB should be directly (albeit through nvrtc) reused in c.parallel.

@griwes griwes added the feature request New feature or request. label Jan 22, 2025
@griwes griwes self-assigned this Jan 22, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 22, 2025
@griwes griwes moved this from Todo to In Progress in CCCL Jan 22, 2025
@leofang
Copy link
Member

leofang commented Mar 12, 2025

Just thinking out loud... if the tuning infra already generates json files for the tuning output, can we not hard-code the tuning params as C++ code, but instead consuming the json files at compile time? Then, the same json files can be also reused in Python?

@bernhardmgruber
Copy link
Contributor

if the tuning infra already generates json files for the tuning output

It creates an sqlite database, but we can convert the results to json.

can we consume the json files at compile time?

While ingesting JSON into C++ during compilation is its own challenge, it would not work, because tuning analysis is still a manual process. There is some judgement and experience needed to select the "best" tuning. Also, the result of a tuning may apply to more compile-time types than tested by the benchmark. For example, sorting int16 probably has the same performance characterist as uint16, bfloat16 and half becasue the memory access is the same and the comparison cost is likly similar or not dominant. Conversely, reducing int16 may take a different code path than reducing half on different architectures, so tunings for int16 don't translate to other 16-bit types.

@leofang
Copy link
Member

leofang commented Mar 12, 2025

can we consume the json files at compile time?

While ingesting JSON into C++ during compilation is its own challenge,

Yes my assumption was this is technically doable. (Perhaps we turn all tuning policy .cuh files into .cuh.in and utilize CMake's Json/string processing capabilities.)

it would not work, because tuning analysis is still a manual process. There is some judgement and experience needed to select the "best" tuning. Also, the result of a tuning may apply to more compile-time types than tested by the benchmark. For example, sorting int16 probably has the same performance characterist as uint16, bfloat16 and half becasue the memory access is the same and the comparison cost is likly similar or not dominant. Conversely, reducing int16 may take a different code path than reducing half on different architectures, so tunings for int16 don't translate to other 16-bit types.

I don't think we want to ship the whole database in either sqlite or json! We definitely need a highly pruned version of jsons (this is where manual process enters), containing the exact amount of the same information as the policy headers do today, no more no less.

@jrhemstad
Copy link
Collaborator

Just thinking out loud... if the tuning infra already generates json files for the tuning output, can we not hard-code the tuning params as C++ code, but instead consuming the json files at compile time? Then, the same json files can be also reused in Python?

Funny you mention this. @shwina and I were just talking about the same thing yesterday.

Basically, the question boils down to "What is the source of truth for tuning policies?"

Today, the source of truth is the the C++ headers that implement the various structs that encode the tuning policies for each algorithm/arch. This historically has been fine, but now with our more advanced tuning infrastructure and with the need to be able to use these tuning policies at runtime in cuda.parallel it's worth considering alternative approaches for encoding the source of truth for the tuning policies.

One idea would be to define the source of truth directly as JSON files.

I think this could have a number of advantages:

  1. It could simplify generating new tuning policies from the tuning infrastructure because it just needs to output JSON strings. Whereas today, we have to manually take the output from the tuning infrastructure and turn that into the corresponding C++ code.
  2. It would significantly simplify the cuda.parallel use case as the tuning policies will be readily available in a form that is easy to consume at runtime.

CUB would still need the statically defined tuning policies as C++ structs, but to do that we could make a simple codegen script that consumes the JSON strings and then emits the C++ structs tuning policies that we have today.

@jrhemstad
Copy link
Collaborator

Thinking about this some more after discussing internally, I didn’t appreciate the degree to which our tuning policies aren’t just a finite lookup table.

A simple example is just to think about the items per thread that is used.

Even if we pretend this is just a function of the size of the input type, we’d have to define a lookup table for all possible input type sizes. But there’s an infinite number of input type sizes.

The more I think about, the more I realize this is fundamentally the same problem as trying to build a library that preinstantiates CUB algorithms for all possible types. You can’t because there’s an infinite number of possible types.

Which is precisely why we need to use JIT for cuda.parallel. And tells me that @griwes‘s approach of “JIT compiling” the tuning policies is probably the right approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: In Progress
Development

No branches or pull requests

4 participants