Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring Fusion Executor, pulling out compiled kernel #3468

Merged
merged 43 commits into from
Jan 26, 2025

Conversation

csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Nov 24, 2024

Pull out kernel compilation from the KernelExecutor, trying to separate out the two concepts as we will move towards a world where the execution of a kernel is done through HostIr.

  • Made CompiledKernel class to hold compilation information in compiled_kernel.h/cpp
  • Moved code:
    • from runtime/executor.h to runtime/compiled_kernel.h
    • from runtime/executor.cpp to runtime/compiled_kernel.cpp
    • from runtime/executor_utils.cpp to runtime/compiled_kernel.cpp (these are functions only used in compiled_kernel)
    • from sys/utils.cpp to runtime/compiled_kernel.cpp (these are functions only used in compiled_kernel)
  • Moved executor::compileRTC and executor::runRTC into its own class (RtcKernel). It shares compilation logic with CompiledKernel and is in compiled_kernel.h/cpp

Some improvements left for another time:

  • Don't disable the parameter cache completely when the size of a tensor is a function of an input scalar. I don't think this is particularly common as Thunder is mostly static shapes, but it might be good to support for resize ops.
  • Merge executor_utils::CudaExecutable and CompiledKernel. I'm not sure if this is the right thing to do, partially just because of RTCKernel and CompiledKernel both own a executor_utils::CudaExecutable

@csarofeen
Copy link
Collaborator Author

!test

Copy link
Collaborator Author

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few TODO's in this PR, I might not take on all of them in this PR.

csrc/fusion.cpp Outdated Show resolved Hide resolved
buffer << cuda_src.rdbuf();
return buffer.str();
}
} // namespace
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything above this is only code motion.

csrc/runtime/compiled_kernel.cpp Show resolved Hide resolved
csrc/runtime/compiled_kernel.cpp Outdated Show resolved Hide resolved
csrc/runtime/compiled_kernel.cpp Outdated Show resolved Hide resolved
csrc/runtime/executor_utils.h Show resolved Hide resolved
@@ -58,20 +47,11 @@ struct CompiledKernel : public NonCopyable {
int register_spills = -1;
};

// Returns executable function and the ptxas log from compilation
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to compiled_kernel.cpp

@@ -253,12 +233,5 @@ void validateCircularBuffering(
kir::Kernel* kernel,
ExpressionEvaluator& expr_eval);

//! Query the target GPU version number NVRTC compiles CUDA kernels for
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to compiled_kernel.cpp

Copy link
Collaborator Author

@csarofeen csarofeen Dec 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually these are likely just removed as they should now be contained in runtime/compiled_kernel.cpp

@@ -32,117 +30,6 @@
#include <cstdlib>

namespace nvfuser {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to compiled_kernel.cpp

@@ -194,14 +81,6 @@ bool detectComputeSanitizer() {

namespace nvfuser {

namespace executor_utils {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to compiled_kernel.cpp

@csarofeen
Copy link
Collaborator Author

csarofeen commented Dec 22, 2024

TODO list:

  • Move pre lowering hooks to constructor of compiled kernel (cleanup executor.cpp 279)
  • Move post lowering hooks to the constructor of compiled kernel as well
  • rename compileFusion to compile
  • When compiled kernel checks when to disable the parameter cache change the check to make sure when an extent depends on a TensorView input it goes through metadata op, or it throws an error.
  • Not sure what this TODO means it's on compiled_kernel.cpp 1131 ("TODO: These properties should be set as part of the constructor so that it can be const")
  • Check that compiled_kernel.cpp 1233 ("TODO: high water mark should be computed via occupancy API after") is an old todo. This could be done using the occupancy calculator but when I tried to do it the linker failed. It's also a minor optimization so simply removed the todo.
  • Remove compiled_kernel.h::CompileOptions it only holds device and should likely be in compile params
  • Remove the TODO "TODO: Consider split out compileRtc and runRtc to a different" it will need to be evaluated in the future if that makes sense when compilation and execution are completely separate concepts
  • Check if the default constructor of CompiledKernel can be removed
  • executor.cpp 241 "TODO: Is this necessary?"
  • Check if executor.h still needs the function disableLaunchParamCache since CompiledKernel has one
  • executor.h 384 "TODO: Should this be removed?" SchedulerType scheduler_type_ = SchedulerType::None;

I'm leaving these to consider in the future.

  • Don't disable parameter cache completely when a scalar input is used in a computation of an ID extent
  • Check if executor_utils::CudaExecutable can be merged with CompiledKernel

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

47 successful checks
https://nv/e2E/130642066

…untime_id, group_id, and device constant in CompiledKernel.
@csarofeen
Copy link
Collaborator Author

All checks have passed
48 successful checks
https://nv/e2E/131360124

@csarofeen csarofeen marked this pull request as ready for review January 1, 2025 22:20
@csarofeen csarofeen requested a review from naoyam January 1, 2025 22:20
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a pass. Left several comments/questions.

As far as I can see, there's no change in the underlying logic, but it's just code movement. Am I missing anything?

csrc/runtime/executor_params.h Show resolved Hide resolved
csrc/runtime/compiled_kernel.h Outdated Show resolved Hide resolved
csrc/runtime/compiled_kernel.h Show resolved Hide resolved
tests/cpp/utils.h Show resolved Hide resolved
csrc/runtime/executor.h Outdated Show resolved Hide resolved
@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

csarofeen commented Jan 12, 2025

Getting a bunch of thunder failures: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/133353224 I was able to reproduce one of them on main, so uncertain what's going on.

Clang build was the only other test to fail.

Follow up: Thunder failures reproduced on main at this point: #3698

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The Thunder failures are unlikely to have anything to do with this PR.

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen
Copy link
Collaborator Author

!test

Copy link

github-actions bot commented Jan 20, 2025

PR Reviewer Guide 🔍

(Review updated until commit 107123a)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
⚡ Recommended focus areas for review

Potential Bug

The code has a potential bug in the KernelExecutor class. The compiled_kernel_ member variable is not checked for null before being used in the compile method. This could lead to a segmentation fault if compiled_kernel_ is null.

const LaunchParams& launch_constraints,
Code Duplication

The code has duplicated logic in the KernelExecutor class. The compile method has duplicated code for checking the compile_params.index_type and setting the index_type member variable. This duplicated code can be extracted into a separate method to improve maintainability.

} else if (has_cp_async_bulk) {
Potential Performance Issue

The code has a potential performance issue in the KernelExecutor class. The run method uses a recursive function call to execute the kernel, which can lead to stack overflow for large inputs. Consider using an iterative approach instead.

group_id_);

return lowered_->kernel();
}

Fusion* fusion() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this isn't any more useful than kernel() so I removed this in #3725.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing.


// function to query whether a `CompiledKernel` has a compiled kernel to
// execute
bool hasCompiledKernel() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/NVIDIA/Fuser/pull/3725/files#diff-31a5ef26405804394f573b42c15512e1fb87f930fe7e5bfd95ad4034d867c30fR147 has merged isCompiled and hasCompiledKernel. Having both used to be important for a FusionExecutor which may or may not be a kernel executor -- no longer after your ExecutorAbstract work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to reflect these changes. (Will push in a minute)

}
// Lowered is needed to compute launch parameters as it uses the CA map. We
// could modify that, but simply generating that part first.
compiled_kernel_ = std::make_unique<CompiledKernel>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the supposed responsibility between KernelExecutor::compile and CompiledKernel::compile. With this PR, KernelExecutor::compile appears to become a thin wrapper of CompiledKernel::compile that merely does profiling and overrides some compilation/launch parameters. Do you plan to get rid of KernelExecutor::compile so FusionKernelRuntime doesn't need to create a KernelExecutor to compile?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same thing, and I think ideally yes. If I recall correctly the challenge comes down to requiring runtime information to compile. It makes it a bit awkward in my opinion as it'd be nice to have a clearer boundary between compilation and runtime, but we need runtime information for compilation. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a clear boundary between compilation and execution is useful for host IR. Currently, we have the LaunchKernel host IR instruction that runs a compiled KernelExecutor. It should really be CompiledKernel instead because KernelExecutor comes with lots of code and states for compilation that are useless for execution.

Ideally, I'd like the following

  1. a KernelCompiler that takes a fusion segment and compilation parameters (which can be derived from runtime information), schedules the segment, device-lowers it, and generates a CompiledKernel. In this PR, the fact that CompiledKernel also "compiles" is kind of weird, but I'm sure that's due to some practical challenges as you just said.
  2. a LaunchKernel host IR instruction that takes a CompiledKernel and execution parameters and runs it.
  3. KernelExecutor can be the non-host-IR orchestrator that drives the process from KernelCompiler to CompiledKernel to run that kernel. After host IR becomes the only path, we can and probably should get rid of KernelExecutor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, how should we handle needed recompilation under dynamic workloads when there's a heuristic match?

https://github.com/NVIDIA/Fuser/pull/3468/files#diff-4e98fb6577fe11f6854ee230d2ed5952b441fa277061180b00fa1265cf98173eR1527-R1552

Should this also be represented in HostIR? Should we have a "maybe recompile" node?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: I didn't know about KernelExecutor::recompileKernel. AFAICT, it recompiles CUDA with a new block size and/or a new max number of registers, but doesn't re-schedule, re-lower or re-codegen.

I'd keep this "mini-caching" in CompiledKernel. We could represent this recompilation in host IR, but I don't see a practical benefit -- having a "maybe recompile" node doesn't enable more host IR optimization or make the result host program faster. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your plan is good. I think this PR would help as it's a step towards isolating compilation from execution. I think 1 is doable with your plan. I think you will still need a maybe recompile in the execution somewhere if launch parameters change. Right now it's being called in FusionExecutor::runFusion: https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L1269

The reason we do this is the same reason why someone might use launch_bounds on a cuda kernel. The first time we get a problem we might compile with one register limit in mind based on the number of threads we first compute are needed. If that number goes up because it's input dependent then we may need to recompile otherwise the kernel won't fit on the SM.

@csarofeen
Copy link
Collaborator Author

!test

@csarofeen csarofeen merged commit ba5c878 into main Jan 26, 2025
51 of 52 checks passed
@csarofeen csarofeen deleted the compiled_kernel_2 branch January 26, 2025 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants