Skip to content

Commit 6c70cbe

Browse files
zasdfgbnmfacebook-github-bot
authored andcommitted
step 0 of cuDNN v8 convolution API integration (pytorch#51390)
Summary: This PR is step 0 of adding PyTorch convolution bindings using the cuDNN frontend. The cuDNN frontend is the recommended way of using cuDNN v8 API. It is supposed to have faster release cycles, so that, for example, if people find a specific kernel has a bug, they can report it, and that kernel will be blocked in the cuDNN frontend and frameworks could just update that submodule without the need for waiting for a whole cuDNN release. The work is not complete, and this PR is only step 0. **What this PR does:** - Add cudnn-frontend as a submodule. - Modify cmake to build that submodule. - Add bindings for convolution forward in `Conv_v8.cpp`, which is disabled by a macro by default. - Tested manually by enabling the macro and run `test_nn.py`. All tests pass except those mentioned below. **What this PR doesn't:** - Only convolution forward, no backward. The backward will use v7 API. - No 64bit-indexing support for some configuration. This is a known issue of cuDNN, and will be fixed in a later cuDNN version. PyTorch will not implement any workaround for issue, but instead, v8 API should be disabled on problematic cuDNN versions. - No test beyond PyTorch's unit tests. - Not tested for correctness on real models. - Not benchmarked for performance. - Benchmark cache is not thread-safe. (This is marked as `FIXME` in the code, and will be fixed in a follow-up PR) - cuDNN benchmark is not supported. - There are failing tests, which will be resolved later: ``` FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (in... FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (... FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_large_cuda - RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: 9 FAILED test/test_nn.py::TestNN::test_Conv2d_depthwise_naive_groups_cuda - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=1e-05, found 64 element(s) (out of 64) whose difference(s) exceeded the margin of error (including 0 an... FAILED test/test_nn.py::TestNN::test_Conv2d_deterministic_cudnn - RuntimeError: not supported yet FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM ``` Although this is not a complete implementation of cuDNN v8 API binding, I still want to merge this first. This would allow me to do small and incremental work, for the ease of development and review. Pull Request resolved: pytorch#51390 Reviewed By: malfet Differential Revision: D28513167 Pulled By: ngimel fbshipit-source-id: 9cc20c9dec5bbbcb1f94ac9e0f59b10c34f62740
1 parent 954d39b commit 6c70cbe

File tree

10 files changed

+217
-4
lines changed

10 files changed

+217
-4
lines changed

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,9 @@
130130
ignore = dirty
131131
path = third_party/tensorpipe
132132
url = https://github.com/pytorch/tensorpipe.git
133+
[submodule "third_party/cudnn_frontend"]
134+
path = third_party/cudnn_frontend
135+
url = https://github.com/NVIDIA/cudnn-frontend.git
133136
[submodule "third_party/kineto"]
134137
path = third_party/kineto
135138
url = https://github.com/pytorch/kineto

CMakeLists.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,9 @@ cmake_dependent_option(
194194
cmake_dependent_option(
195195
USE_STATIC_CUDNN "Use cuDNN static libraries" OFF
196196
"USE_CUDNN" OFF)
197+
cmake_dependent_option(
198+
USE_EXPERIMENTAL_CUDNN_V8_API "Use experimental cuDNN v8 API" OFF
199+
"USE_CUDNN" OFF)
197200
option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
198201
option(USE_KINETO "Use Kineto profiling library" ON)
199202
option(USE_CUPTI_SO "Use CUPTI as a shared library" OFF)

aten/src/ATen/native/cudnn/Conv_v7.cpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
#if AT_CUDNN_ENABLED()
44

5+
#include <ATen/native/cudnn/Macros.h>
6+
57
#include <limits>
68
#include <vector>
79
#include <sstream>
@@ -614,6 +616,8 @@ if (args.params.dataType == CUDNN_DATA_FLOAT) {
614616
//
615617
// ---------------------------------------------------------------------
616618

619+
#if !HAS_CUDNN_V8()
620+
617621
void raw_cudnn_convolution_forward_out_32bit(
618622
const Tensor& output, const Tensor& input, const Tensor& weight,
619623
IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups,
@@ -665,6 +669,8 @@ void raw_cudnn_convolution_forward_out(
665669
split_batch_dim_to_32bit_out(output, input, weight, padding, stride, dilation, groups, benchmark, deterministic, allow_tf32, 1024 * 1024 * 256, raw_cudnn_convolution_forward_out_32bit);
666670
}
667671

672+
#endif // !HAS_CUDNN_V8()
673+
668674
// ---------------------------------------------------------------------
669675
//
670676
// Convolution backward / Transposed convolution forward
Lines changed: 175 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,177 @@
11
#include <ATen/cuda/CUDAConfig.h> // for the definition of AT_CUDNN_ENABLED
22

3-
#if AT_CUDNN_ENABLED() && defined(CUDNN_VERSION) && CUDNN_VERSION >= 8000
4-
// Coming soon
5-
#endif // AT_CUDNN_ENABLED and CUDNN_VERSION
3+
#if AT_CUDNN_ENABLED()
4+
5+
#include <ATen/native/cudnn/Macros.h>
6+
7+
#if HAS_CUDNN_V8()
8+
9+
#include <ATen/cudnn/cudnn-wrapper.h>
10+
#include <cudnn_frontend.h>
11+
#include <ATen/ATen.h>
12+
#include <ATen/TensorUtils.h>
13+
#include <ATen/cuda/Exceptions.h>
14+
#include <ATen/native/ConvUtils.h>
15+
#include <ATen/native/cudnn/ConvShared.h>
16+
#include <ATen/native/utils/ParamsHash.h>
17+
#include <ATen/cudnn/Handle.h>
18+
#include <ATen/TensorUtils.h>
19+
20+
#include <unordered_map>
21+
22+
namespace at { namespace native{
23+
24+
namespace {
25+
26+
uint8_t getAlignment(const Tensor &t) {
27+
// alignment are in bytes
28+
uint8_t alignment = 1;
29+
uint64_t address = reinterpret_cast<uint64_t>(t.data_ptr());
30+
while (address % alignment == 0 && alignment < 16) alignment *= 2;
31+
return alignment;
32+
}
33+
34+
cudnn_frontend::Tensor getTensorDescriptor(const Tensor &t, int64_t id, uint8_t alignment) {
35+
auto shape = t.sizes();
36+
auto strides = t.strides();
37+
return cudnn_frontend::TensorBuilder()
38+
.setDim(shape.size(), shape.data())
39+
.setStrides(strides.size(), strides.data())
40+
.setId(id)
41+
.setAlignment(alignment)
42+
.setDataType(getCudnnDataType(t))
43+
.build();
44+
}
45+
46+
cudnn_frontend::ConvDesc_v8 getConvDescriptor(cudnnDataType_t dataType, IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation) {
47+
uint64_t convDim = stride.size();
48+
return cudnn_frontend::ConvDescBuilder()
49+
.setDataType(dataType)
50+
.setMathMode(CUDNN_CROSS_CORRELATION)
51+
.setNDims(convDim)
52+
.setStrides(convDim, stride.data())
53+
.setPrePadding(convDim, padding.data())
54+
.setPostPadding(convDim, padding.data())
55+
.setDilation(convDim, dilation.data())
56+
.build();
57+
}
58+
59+
void filterEngineConfigs(
60+
cudnn_frontend::EngineConfigList &from,
61+
cudnn_frontend::EngineConfigList &to,
62+
bool deterministic, bool allow_tf32, c10::ScalarType scalar_type)
63+
{
64+
auto filter = [=](cudnnBackendDescriptor_t c) {
65+
if (deterministic) {
66+
if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC>(c)) return true;
67+
}
68+
if (scalar_type == kFloat || !allow_tf32) {
69+
if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS>(c)) return true;
70+
if (cudnn_frontend::hasNumericalNote<CUDNN_NUMERICAL_NOTE_TENSOR_CORE>(c)) return true;
71+
}
72+
return false;
73+
};
74+
cudnn_frontend::filter(from, to, filter);
75+
}
76+
77+
struct CacheKey {
78+
ConvolutionParams params;
79+
uint8_t input_alignment;
80+
uint8_t weight_alignment;
81+
uint8_t output_alignment;
82+
};
83+
84+
// FIXME: make this thread-safe by reusing the benchmark cache in Conv_v7.cpp
85+
std::unordered_map<CacheKey, cudnn_frontend::ManagedOpaqueDescriptor, ParamsHash<CacheKey>, ParamsEqual<CacheKey>> engine_cache;
86+
87+
}
88+
89+
void raw_cudnn_convolution_forward_out(
90+
const Tensor& output, const Tensor& input, const Tensor& weight,
91+
IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups,
92+
bool benchmark, bool deterministic, bool allow_tf32)
93+
{
94+
TORCH_CHECK(!benchmark, "not supported yet");
95+
if (output.numel() == 0) {
96+
return;
97+
}
98+
99+
cudnnHandle_t handle = getCudnnHandle();
100+
101+
CacheKey key;
102+
setConvolutionParams(&key.params, input, weight, padding, stride, dilation, groups, deterministic, allow_tf32);
103+
key.input_alignment = getAlignment(input);
104+
key.output_alignment = getAlignment(output);
105+
key.weight_alignment = getAlignment(weight);
106+
107+
auto run = [&](cudnn_frontend::ManagedOpaqueDescriptor cfg) {
108+
auto plan = cudnn_frontend::ExecutionPlanBuilder()
109+
.setHandle(handle)
110+
.setEngineConfig(cfg)
111+
.build();
112+
113+
auto workspace_size = plan.getWorkspaceSize();
114+
auto workspace = at::empty({workspace_size}, input.options().dtype(kByte));
115+
void *data_ptrs[] = {input.data_ptr(), output.data_ptr(), weight.data_ptr()};
116+
// std::cout << plan.describe() << " requires workspace " << workspace_size << std::endl;
117+
int64_t uids[] = {'x', 'y', 'w'};
118+
auto variantPack = cudnn_frontend::VariantPackBuilder()
119+
.setWorkspacePointer(workspace.data_ptr())
120+
.setDataPointers(3, data_ptrs)
121+
.setUids(3, uids)
122+
.build();
123+
AT_CUDNN_CHECK(cudnnBackendExecute(handle, plan.get_raw_desc(), variantPack.get_raw_desc()));
124+
};
125+
126+
auto search = engine_cache.find(key);
127+
if (search != engine_cache.end()) {
128+
run(search->second);
129+
return;
130+
}
131+
132+
auto op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR)
133+
.setxDesc(getTensorDescriptor(input, 'x', key.input_alignment))
134+
.setyDesc(getTensorDescriptor(output, 'y', key.output_alignment))
135+
.setwDesc(getTensorDescriptor(weight, 'w', key.weight_alignment))
136+
.setcDesc(getConvDescriptor(key.params.dataType, padding, stride, dilation))
137+
.build();
138+
// std::cout << op.describe() << std::endl;
139+
140+
std::array<cudnn_frontend::Operation const *, 1> ops = {&op};
141+
142+
auto opGraph = cudnn_frontend::OperationGraphBuilder()
143+
.setHandle(handle)
144+
.setOperationGraph(1, ops.data())
145+
.build();
146+
// std::cout << opGraph.describe() << std::endl;
147+
148+
auto heuristics = cudnn_frontend::EngineHeuristicsBuilder()
149+
.setOperationGraph(opGraph)
150+
.setHeurMode(CUDNN_HEUR_MODE_INSTANT)
151+
.build();
152+
auto fallback = cudnn_frontend::EngineFallbackListBuilder()
153+
.setOperationGraph(opGraph)
154+
.setOperation(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR)
155+
.build();
156+
157+
auto& engine_configs = heuristics.getEngineConfig(heuristics.getEngineConfigCount());
158+
auto& fallback_list = fallback.getFallbackList();
159+
160+
cudnn_frontend::EngineConfigList filtered_configs;
161+
filterEngineConfigs(engine_configs, filtered_configs, deterministic, allow_tf32, input.scalar_type());
162+
filterEngineConfigs(fallback_list, filtered_configs, deterministic, allow_tf32, input.scalar_type());
163+
164+
for (auto &cfg : filtered_configs) {
165+
try {
166+
run(cfg);
167+
engine_cache[key] = cfg;
168+
return;
169+
} catch (cudnn_frontend::cudnnException &e) {} catch(CuDNNError &e) {}
170+
}
171+
TORCH_CHECK(false, "Unable to find an engine to execute this computation");
172+
}
173+
174+
}} // at::native
175+
176+
#endif // HAS_CUDNN_V8
177+
#endif // AT_CUDNN_ENABLED

aten/src/ATen/native/cudnn/Macros.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#pragma once
2+
3+
#include <ATen/cudnn/cudnn-wrapper.h>
4+
5+
// Note: The version below should not actually be 8000. Instead, it should
6+
// be whatever version of cuDNN that v8 API work with PyTorch correctly.
7+
// The version is set to 8000 today for convenience of debugging.
8+
#if defined(USE_EXPERIMENTAL_CUDNN_V8_API) && defined(CUDNN_VERSION) && CUDNN_VERSION >= 8000
9+
#define HAS_CUDNN_V8() true
10+
#else
11+
#define HAS_CUDNN_V8() false
12+
#endif

caffe2/CMakeLists.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1304,6 +1304,15 @@ elseif(USE_ROCM)
13041304
target_compile_definitions(torch_hip PRIVATE "-DTORCH_HIP_BUILD_MAIN_LIB")
13051305
endif()
13061306

1307+
if(USE_EXPERIMENTAL_CUDNN_V8_API)
1308+
if(BUILD_SPLIT_CUDA)
1309+
target_compile_definitions(torch_cuda_cu PRIVATE "-DUSE_EXPERIMENTAL_CUDNN_V8_API")
1310+
target_compile_definitions(torch_cuda_cpp PRIVATE "-DUSE_EXPERIMENTAL_CUDNN_V8_API")
1311+
elseif(USE_CUDA)
1312+
target_compile_definitions(torch_cuda PRIVATE "-DUSE_EXPERIMENTAL_CUDNN_V8_API")
1313+
endif()
1314+
endif()
1315+
13071316
set(EXPERIMENTAL_SINGLE_THREAD_POOL "0" CACHE STRING
13081317
"Experimental option to use a single thread pool for inter- and intra-op parallelism")
13091318
if("${EXPERIMENTAL_SINGLE_THREAD_POOL}")

cmake/Dependencies.cmake

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1189,6 +1189,12 @@ if(USE_CUDA)
11891189
endif()
11901190
endif()
11911191

1192+
# ---[ cuDNN
1193+
if(USE_CUDNN)
1194+
set(CUDNN_FRONTEND_INCLUDE_DIR ${CMAKE_CURRENT_LIST_DIR}/../third_party/cudnn_frontend/include)
1195+
include_directories(${CUDNN_FRONTEND_INCLUDE_DIR})
1196+
endif()
1197+
11921198
# ---[ HIP
11931199
if(USE_ROCM)
11941200
include(${CMAKE_CURRENT_LIST_DIR}/public/LoadHIP.cmake)

cmake/Summary.cmake

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ function(caffe2_print_configuration_summary)
7474
message(STATUS " Split CUDA : ${BUILD_SPLIT_CUDA}")
7575
message(STATUS " CUDA static link : ${CAFFE2_STATIC_LINK_CUDA}")
7676
message(STATUS " USE_CUDNN : ${USE_CUDNN}")
77+
message(STATUS " USE_EXPERIMENTAL_CUDNN_V8_API: ${USE_EXPERIMENTAL_CUDNN_V8_API}")
7778
message(STATUS " CUDA version : ${CUDA_VERSION}")
7879
if(${USE_CUDNN})
7980
message(STATUS " cuDNN version : ${CUDNN_VERSION}")

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,7 @@ def not_exists_or_empty(folder):
332332
print('Please run:\n\tgit submodule update --init --recursive')
333333
sys.exit(1)
334334
for folder in folders:
335-
check_for_files(folder, ["CMakeLists.txt", "Makefile", "setup.py", "LICENSE"])
335+
check_for_files(folder, ["CMakeLists.txt", "Makefile", "setup.py", "LICENSE", "LICENSE.txt"])
336336
check_for_files(os.path.join(third_party_path, 'fbgemm', 'third_party',
337337
'asmjit'), ['CMakeLists.txt'])
338338
check_for_files(os.path.join(third_party_path, 'onnx', 'third_party',

third_party/cudnn_frontend

Submodule cudnn_frontend added at 51e60d8

0 commit comments

Comments
 (0)