-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c.parallel should reuse CUB policies directly #3494
Comments
Just thinking out loud... if the tuning infra already generates json files for the tuning output, can we not hard-code the tuning params as C++ code, but instead consuming the json files at compile time? Then, the same json files can be also reused in Python? |
It creates an sqlite database, but we can convert the results to json.
While ingesting JSON into C++ during compilation is its own challenge, it would not work, because tuning analysis is still a manual process. There is some judgement and experience needed to select the "best" tuning. Also, the result of a tuning may apply to more compile-time types than tested by the benchmark. For example, sorting |
Yes my assumption was this is technically doable. (Perhaps we turn all tuning policy
I don't think we want to ship the whole database in either sqlite or json! We definitely need a highly pruned version of jsons (this is where manual process enters), containing the exact amount of the same information as the policy headers do today, no more no less. |
Funny you mention this. @shwina and I were just talking about the same thing yesterday. Basically, the question boils down to "What is the source of truth for tuning policies?" Today, the source of truth is the the C++ headers that implement the various structs that encode the tuning policies for each algorithm/arch. This historically has been fine, but now with our more advanced tuning infrastructure and with the need to be able to use these tuning policies at runtime in One idea would be to define the source of truth directly as JSON files. I think this could have a number of advantages:
CUB would still need the statically defined tuning policies as C++ structs, but to do that we could make a simple codegen script that consumes the JSON strings and then emits the C++ structs tuning policies that we have today. |
Thinking about this some more after discussing internally, I didn’t appreciate the degree to which our tuning policies aren’t just a finite lookup table. A simple example is just to think about the items per thread that is used. Even if we pretend this is just a function of the size of the input type, we’d have to define a lookup table for all possible input type sizes. But there’s an infinite number of input type sizes. The more I think about, the more I realize this is fundamentally the same problem as trying to build a library that preinstantiates CUB algorithms for all possible types. You can’t because there’s an infinite number of possible types. Which is precisely why we need to use JIT for cuda.parallel. And tells me that @griwes‘s approach of “JIT compiling” the tuning policies is probably the right approach. |
Currently, c.parallel contains copies of some of the tuning parameters of CUB algorithms, and uses those copies. This is unsustainable; for simpler algorithms it doesn't take much code, but for, say, RadixSort, this would be a massive block of logic. Additionally, maintaining it in two places means we can create accidental differences between the two.
The code from CUB should be directly (albeit through nvrtc) reused in c.parallel.
The text was updated successfully, but these errors were encountered: