c.parallel should reuse CUB policies directly #3494

griwes · 2025-01-22T21:19:54Z

Currently, c.parallel contains copies of some of the tuning parameters of CUB algorithms, and uses those copies. This is unsustainable; for simpler algorithms it doesn't take much code, but for, say, RadixSort, this would be a massive block of logic. Additionally, maintaining it in two places means we can create accidental differences between the two.

The code from CUB should be directly (albeit through nvrtc) reused in c.parallel.

leofang · 2025-03-12T01:15:53Z

Just thinking out loud... if the tuning infra already generates json files for the tuning output, can we not hard-code the tuning params as C++ code, but instead consuming the json files at compile time? Then, the same json files can be also reused in Python?

bernhardmgruber · 2025-03-12T08:39:29Z

if the tuning infra already generates json files for the tuning output

It creates an sqlite database, but we can convert the results to json.

can we consume the json files at compile time?

While ingesting JSON into C++ during compilation is its own challenge, it would not work, because tuning analysis is still a manual process. There is some judgement and experience needed to select the "best" tuning. Also, the result of a tuning may apply to more compile-time types than tested by the benchmark. For example, sorting int16 probably has the same performance characterist as uint16, bfloat16 and half becasue the memory access is the same and the comparison cost is likly similar or not dominant. Conversely, reducing int16 may take a different code path than reducing half on different architectures, so tunings for int16 don't translate to other 16-bit types.

leofang · 2025-03-12T15:40:02Z

can we consume the json files at compile time?

While ingesting JSON into C++ during compilation is its own challenge,

Yes my assumption was this is technically doable. (Perhaps we turn all tuning policy .cuh files into .cuh.in and utilize CMake's Json/string processing capabilities.)

it would not work, because tuning analysis is still a manual process. There is some judgement and experience needed to select the "best" tuning. Also, the result of a tuning may apply to more compile-time types than tested by the benchmark. For example, sorting int16 probably has the same performance characterist as uint16, bfloat16 and half becasue the memory access is the same and the comparison cost is likly similar or not dominant. Conversely, reducing int16 may take a different code path than reducing half on different architectures, so tunings for int16 don't translate to other 16-bit types.

I don't think we want to ship the whole database in either sqlite or json! We definitely need a highly pruned version of jsons (this is where manual process enters), containing the exact amount of the same information as the policy headers do today, no more no less.

jrhemstad · 2025-03-12T17:00:46Z

Just thinking out loud... if the tuning infra already generates json files for the tuning output, can we not hard-code the tuning params as C++ code, but instead consuming the json files at compile time? Then, the same json files can be also reused in Python?

Funny you mention this. @shwina and I were just talking about the same thing yesterday.

Basically, the question boils down to "What is the source of truth for tuning policies?"

Today, the source of truth is the the C++ headers that implement the various structs that encode the tuning policies for each algorithm/arch. This historically has been fine, but now with our more advanced tuning infrastructure and with the need to be able to use these tuning policies at runtime in cuda.parallel it's worth considering alternative approaches for encoding the source of truth for the tuning policies.

One idea would be to define the source of truth directly as JSON files.

I think this could have a number of advantages:

It could simplify generating new tuning policies from the tuning infrastructure because it just needs to output JSON strings. Whereas today, we have to manually take the output from the tuning infrastructure and turn that into the corresponding C++ code.
It would significantly simplify the cuda.parallel use case as the tuning policies will be readily available in a form that is easy to consume at runtime.

CUB would still need the statically defined tuning policies as C++ structs, but to do that we could make a simple codegen script that consumes the JSON strings and then emits the C++ structs tuning policies that we have today.

jrhemstad · 2025-03-13T15:30:22Z

Thinking about this some more after discussing internally, I didn’t appreciate the degree to which our tuning policies aren’t just a finite lookup table.

A simple example is just to think about the items per thread that is used.

Even if we pretend this is just a function of the size of the input type, we’d have to define a lookup table for all possible input type sizes. But there’s an infinite number of input type sizes.

The more I think about, the more I realize this is fundamentally the same problem as trying to build a library that preinstantiates CUB algorithms for all possible types. You can’t because there’s an infinite number of possible types.

Which is precisely why we need to use JIT for cuda.parallel. And tells me that @griwes‘s approach of “JIT compiling” the tuning policies is probably the right approach.

griwes added the feature request New feature or request. label Jan 22, 2025

griwes self-assigned this Jan 22, 2025

github-project-automation bot added this to CCCL Jan 22, 2025

github-project-automation bot moved this to Todo in CCCL Jan 22, 2025

griwes moved this from Todo to In Progress in CCCL Jan 22, 2025

griwes mentioned this issue Feb 19, 2025

c.parallel: device wrappers as code, not format strings #3439

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c.parallel should reuse CUB policies directly #3494

c.parallel should reuse CUB policies directly #3494

griwes commented Jan 22, 2025

leofang commented Mar 12, 2025

bernhardmgruber commented Mar 12, 2025

leofang commented Mar 12, 2025

jrhemstad commented Mar 12, 2025

jrhemstad commented Mar 13, 2025

c.parallel should reuse CUB policies directly #3494

c.parallel should reuse CUB policies directly #3494

Comments

griwes commented Jan 22, 2025

leofang commented Mar 12, 2025

bernhardmgruber commented Mar 12, 2025

leofang commented Mar 12, 2025

jrhemstad commented Mar 12, 2025

jrhemstad commented Mar 13, 2025