Refactor the C code template in `third_party/nvidia/backend/driver.py` #4722

sfzhu93 · 2024-09-13T00:17:07Z

Refactor the C code template in driver.py

Previously, in the third_party/nvidia/backend/driver.py, there was a long format string to define a C source code. In this commit, I moved the long format string in Python code into a separate C file, and use macros on the driver.py side to fill in the missing part in the C code.

This improves readability and makes it easy for future extension to the driver.py.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
This PR does not need a test because this PR doesn't introduce any new features or bug fixing.
I have not added any lit tests.

minjang

LGTM

ThomasRaoux · 2024-09-19T15:42:56Z

third_party/nvidia/backend/driver.py

+    def gen_c_def_macro(macro_name, macro_value):
+        return f"#define {macro_name} {macro_value}\n"
+
+    # macros to define:
+    """
+    #define EXTRA_INNER_LAUNCH_PARAM_DECLS
+    #define INNER_LAUNCH_CUDA_CHECK_ARGS
+    #define LAUNCH_PY_ARGS
+    #define PY_ARG_FORMAT_STR
+    #define EXTRA_LAUNCH_PARSE_PY_ARGS
+    #define DEVICE_PTR_INFO_VARS
+    #define TMA_DESC_VARS
+    #define EXTRA_INNER_LAUNCH_CALL_ARGS
+    """
+    macro_defs = gen_c_def_macro("EXTRA_INNER_LAUNCH_PARAM_DECLS", ", " + arg_decls if arg_decls else "")
+    macro_defs += gen_c_def_macro("INNER_LAUNCH_CUDA_CHECK_ARGS", ', '.join(f"&arg{i}" for i in params))
+    macro_defs += gen_c_def_macro("LAUNCH_PY_ARGS",
+                                  ';'.join([f"{_extracted_type(ty)} _arg{i}" for i, ty in signature.items()]))
+    macro_defs += gen_c_def_macro("PY_ARG_FORMAT_STR", f'"{format}"')
+    macro_defs += gen_c_def_macro("EXTRA_LAUNCH_PARSE_PY_ARGS", ", " + args_list if args_list else "")
+    device_ptr_info_var_list = []
+    tma_desc_var_list = []
+    for i, ty in signature.items():
+        if ty[0] == "*":
+            device_ptr_info_var_list.append(
+                f"DevicePtrInfo ptr_info{i} = getPointer(_arg{i}, {i}); if (!ptr_info{i}.valid) return NULL;")
+        elif ty == "nvTmaDesc":
+            tma_desc_var_list.append(f"CUtensorMap* tma_ptr{i} = getTmaDesc(_arg{i}); if (!tma_ptr{i}) return NULL;")
+
+    macro_defs += gen_c_def_macro("DEVICE_PTR_INFO_VARS", "  \\\n".join(device_ptr_info_var_list))
+    macro_defs += gen_c_def_macro("TMA_DESC_VARS", "  \\\n".join(tma_desc_var_list))
+    extra_inner_launch_call_args = ', '.join(internal_args_list)
+    macro_defs += gen_c_def_macro("EXTRA_INNER_LAUNCH_CALL_ARGS",
+                                  ', ' + extra_inner_launch_call_args if extra_inner_launch_call_args else "")
+    src = macro_defs + Path(os.path.join(dirname, "cuda_launcher.c")).read_text()


personally I find this harder to read/modify than the previous version.
Can we only move the functions that don't need patching in a separate file and keep the rest in a template kind of way (and maybe break up functions if it helps moving more code out of python)

gflegar · 2024-09-19T18:30:15Z

Fly-by comment here, since this seems quite related to what we did about the launcher internally at Google:

I don't think we need python string interpolation at all, and we could completely precompile a generic version of the launcher. @ThomasRaoux would this be something interesting for you? If so, I could work on that (or if @sfzhu93 wants to take that on, I'm super happy to have that support!). I wanted to try upstreaming this, but didn't get to it yet. The best I have right now is this patch I can share (don't have a machine with command line git at the moment to make a proper PR out of it 😅): https://gist.github.com/gflegar/6fea5e50023a69e17c64e73eb0037cab. It should hopefully git apply -p2 launcher.patch on top of commit 174e780. The caveats are that:

I don't have the CMake files for this figured out yet.
I didn't get around to porting over the latest TMA stuff from [nvidia] Support passing TMA descriptors by-value #4498

(Sorry for not making this more polished, lots of stuff going on right now, I can definitely polish it more in a few weeks time, but just wanted to bring this option up since stuff seems to be moving on this.)

ThomasRaoux · 2024-09-30T12:53:56Z

Fly-by comment here, since this seems quite related to what we did about the launcher internally at Google:

I don't think we need python string interpolation at all, and we could completely precompile a generic version of the launcher. @ThomasRaoux would this be something interesting for you? If so, I could work on that (or if @sfzhu93 wants to take that on, I'm super happy to have that support!). I wanted to try upstreaming this, but didn't get to it yet. The best I have right now is this patch I can share (don't have a machine with command line git at the moment to make a proper PR out of it 😅): https://gist.github.com/gflegar/6fea5e50023a69e17c64e73eb0037cab. It should hopefully git apply -p2 launcher.patch on top of commit 174e780. The caveats are that:

I don't have the CMake files for this figured out yet.

I didn't get around to porting over the latest TMA stuff from [nvidia] Support passing TMA descriptors by-value #4498

(Sorry for not making this more polished, lots of stuff going on right now, I can definitely polish it more in a few weeks time, but just wanted to bring this option up since stuff seems to be moving on this.)

Interesting, did you measure the dispatch cost of doing that?

sfzhu93 · 2024-10-10T21:40:58Z

@gflegar @ThomasRaoux are you happy to temporarily move forward with my approach? I will remove the macros and wrap the code that will be generated by Python into separate functions.

ThomasRaoux · 2024-11-06T05:45:17Z

closing this for now, if we have a strong motivation we could discuss going with the pre-compiled launcher

gflegar · 2024-11-06T13:53:21Z

Interesting, did you measure the dispatch cost of doing that?

Oh, I completely missed this, since I was out for a while and got a huge email backlog.

I didn't measure it yet, since I can't run the upstream version in our internal environment (we don't have a C compiler available at runtime, hence we patches Triton to precompile it), and can't run our patched version in OSS since I don't have CMake files for it yet. So getting an apples-to-apples comparison is somewhat difficult at this point.

Will definitely run those if I try to upstream it. But I don't think it should be that bad, since I was specifically trying to implement it in a way that minimizes launch overhead - there's no memory allocations, and hopefully only inexpensive Python API calls needed to launch the kernel. The expensive part of figuring out and caching the signature is done at kernel compile time.

Do you actually have a specific set of utilities / methodology you use to evaluate the dispatch cost you could point me to, so I use the same thing as you do when I attempt to evaluate it @ThomasRaoux?

sfzhu93 added 3 commits September 12, 2024 17:05

refactor the c code template in driver.py

4fc1f55

clang tidy

5d05e2b

clean up

30f7ccc

minjang marked this pull request as ready for review September 13, 2024 03:25

minjang requested a review from ptillet as a code owner September 13, 2024 03:25

minjang approved these changes Sep 13, 2024

View reviewed changes

ThomasRaoux reviewed Sep 19, 2024

View reviewed changes

ThomasRaoux closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the C code template in `third_party/nvidia/backend/driver.py` #4722

Refactor the C code template in `third_party/nvidia/backend/driver.py` #4722

sfzhu93 commented Sep 13, 2024

minjang left a comment

ThomasRaoux Sep 19, 2024

gflegar commented Sep 19, 2024

ThomasRaoux commented Sep 30, 2024

sfzhu93 commented Oct 10, 2024

ThomasRaoux commented Nov 6, 2024

gflegar commented Nov 6, 2024

Refactor the C code template in third_party/nvidia/backend/driver.py #4722

Refactor the C code template in third_party/nvidia/backend/driver.py #4722

Conversation

sfzhu93 commented Sep 13, 2024

minjang left a comment

Choose a reason for hiding this comment

ThomasRaoux Sep 19, 2024

Choose a reason for hiding this comment

gflegar commented Sep 19, 2024

ThomasRaoux commented Sep 30, 2024

sfzhu93 commented Oct 10, 2024

ThomasRaoux commented Nov 6, 2024

gflegar commented Nov 6, 2024

Refactor the C code template in `third_party/nvidia/backend/driver.py` #4722

Refactor the C code template in `third_party/nvidia/backend/driver.py` #4722