Skip to content

Commit

Permalink
[Feature] Fix guided decoding blocking bitmask memcpy (vllm-project#1…
Browse files Browse the repository at this point in the history
…2563)

**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7)

---------

Signed-off-by: Ryan N <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
  • Loading branch information
xpbowler authored and Isotr0py committed Feb 2, 2025
1 parent 514d439 commit df7ab19
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions vllm/model_executor/guided_decoding/xgrammar_decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,8 +307,8 @@ def __call__(self, input_ids: list[int],
# Note: In this method, if the tensors have different dimensions
# on CPU device fails, but on GPU it runs without error. Hence the
# unsqueeze above for scores, to match the token bitmask shape
xgr.apply_token_bitmask_inplace(scores,
self.token_bitmask.to(scores.device))
xgr.apply_token_bitmask_inplace(
scores, self.token_bitmask.to(scores.device, non_blocking=True))
if device_type != "cuda":
scores = scores.to(dtype).to(device_type).squeeze()

Expand Down

0 comments on commit df7ab19

Please sign in to comment.