-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring Fusion Executor, pulling out compiled kernel #3468
Conversation
…d in executor::compile.
…need to run compileRTC and runRTC calls in tests with CompiledKernel instances directly. Also fix profiling calls in KernelExecutor.
…owering but is now checked after lowering.
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few TODO's in this PR, I might not take on all of them in this PR.
buffer << cuda_src.rdbuf(); | ||
return buffer.str(); | ||
} | ||
} // namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything above this is only code motion.
@@ -58,20 +47,11 @@ struct CompiledKernel : public NonCopyable { | |||
int register_spills = -1; | |||
}; | |||
|
|||
// Returns executable function and the ptxas log from compilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
@@ -253,12 +233,5 @@ void validateCircularBuffering( | |||
kir::Kernel* kernel, | |||
ExpressionEvaluator& expr_eval); | |||
|
|||
//! Query the target GPU version number NVRTC compiles CUDA kernels for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually these are likely just removed as they should now be contained in runtime/compiled_kernel.cpp
@@ -32,117 +30,6 @@ | |||
#include <cstdlib> | |||
|
|||
namespace nvfuser { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
@@ -194,14 +81,6 @@ bool detectComputeSanitizer() { | |||
|
|||
namespace nvfuser { | |||
|
|||
namespace executor_utils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to compiled_kernel.cpp
TODO list:
I'm leaving these to consider in the future.
|
!test |
47 successful checks |
…untime_id, group_id, and device constant in CompiledKernel.
All checks have passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a pass. Left several comments/questions.
As far as I can see, there's no change in the underlying logic, but it's just code movement. Am I missing anything?
!test |
!test |
Getting a bunch of thunder failures: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/133353224 I was able to reproduce one of them on main, so uncertain what's going on. Clang build was the only other test to fail. Follow up: Thunder failures reproduced on main at this point: #3698 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The Thunder failures are unlikely to have anything to do with this PR.
!test |
!test |
PR Reviewer Guide 🔍(Review updated until commit 107123a)Here are some key observations to aid the review process:
|
csrc/runtime/compiled_kernel.h
Outdated
return lowered_->kernel(); | ||
} | ||
|
||
Fusion* fusion() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this isn't any more useful than kernel() so I removed this in #3725.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing.
csrc/runtime/compiled_kernel.h
Outdated
|
||
// function to query whether a `CompiledKernel` has a compiled kernel to | ||
// execute | ||
bool hasCompiledKernel() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/NVIDIA/Fuser/pull/3725/files#diff-31a5ef26405804394f573b42c15512e1fb87f930fe7e5bfd95ad4034d867c30fR147 has merged isCompiled and hasCompiledKernel. Having both used to be important for a FusionExecutor which may or may not be a kernel executor -- no longer after your ExecutorAbstract work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to reflect these changes. (Will push in a minute)
} | ||
// Lowered is needed to compute launch parameters as it uses the CA map. We | ||
// could modify that, but simply generating that part first. | ||
compiled_kernel_ = std::make_unique<CompiledKernel>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand the supposed responsibility between KernelExecutor::compile and CompiledKernel::compile. With this PR, KernelExecutor::compile appears to become a thin wrapper of CompiledKernel::compile that merely does profiling and overrides some compilation/launch parameters. Do you plan to get rid of KernelExecutor::compile so FusionKernelRuntime doesn't need to create a KernelExecutor to compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing, and I think ideally yes. If I recall correctly the challenge comes down to requiring runtime information to compile. It makes it a bit awkward in my opinion as it'd be nice to have a clearer boundary between compilation and runtime, but we need runtime information for compilation. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a clear boundary between compilation and execution is useful for host IR. Currently, we have the LaunchKernel
host IR instruction that runs a compiled KernelExecutor
. It should really be CompiledKernel
instead because KernelExecutor
comes with lots of code and states for compilation that are useless for execution.
Ideally, I'd like the following
- a
KernelCompiler
that takes a fusion segment and compilation parameters (which can be derived from runtime information), schedules the segment, device-lowers it, and generates aCompiledKernel
. In this PR, the fact thatCompiledKernel
also "compiles" is kind of weird, but I'm sure that's due to some practical challenges as you just said. - a
LaunchKernel
host IR instruction that takes aCompiledKernel
and execution parameters and runs it. KernelExecutor
can be the non-host-IR orchestrator that drives the process fromKernelCompiler
toCompiledKernel
to run that kernel. After host IR becomes the only path, we can and probably should get rid ofKernelExecutor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, how should we handle needed recompilation under dynamic workloads when there's a heuristic match?
Should this also be represented in HostIR? Should we have a "maybe recompile" node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: I didn't know about KernelExecutor::recompileKernel. AFAICT, it recompiles CUDA with a new block size and/or a new max number of registers, but doesn't re-schedule, re-lower or re-codegen.
I'd keep this "mini-caching" in CompiledKernel. We could represent this recompilation in host IR, but I don't see a practical benefit -- having a "maybe recompile" node doesn't enable more host IR optimization or make the result host program faster. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your plan is good. I think this PR would help as it's a step towards isolating compilation from execution. I think 1 is doable with your plan. I think you will still need a maybe recompile in the execution somewhere if launch parameters change. Right now it's being called in FusionExecutor::runFusion: https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L1269
The reason we do this is the same reason why someone might use launch_bounds on a cuda kernel. The first time we get a problem we might compile with one register limit in mind based on the number of threads we first compute are needed. If that number goes up because it's input dependent then we may need to recompile otherwise the kernel won't fit on the SM.
!test |
Pull out kernel compilation from the
KernelExecutor
, trying to separate out the two concepts as we will move towards a world where the execution of a kernel is done throughHostIr
.CompiledKernel
class to hold compilation information incompiled_kernel.h/cpp
runtime/executor.h
toruntime/compiled_kernel.h
runtime/executor.cpp
toruntime/compiled_kernel.cpp
runtime/executor_utils.cpp
toruntime/compiled_kernel.cpp
(these are functions only used incompiled_kernel
)sys/utils.cpp
toruntime/compiled_kernel.cpp
(these are functions only used incompiled_kernel
)executor::compileRTC
andexecutor::runRTC
into its own class (RtcKernel
). It shares compilation logic withCompiledKernel
and is incompiled_kernel.h/cpp
Some improvements left for another time:
executor_utils::CudaExecutable
andCompiledKernel
. I'm not sure if this is the right thing to do, partially just because of RTCKernel and CompiledKernel both own aexecutor_utils::CudaExecutable