Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Feature] Fix guided decoding blocking bitmask memcpy (vllm-project#1…
…2563) **[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <[email protected]> Signed-off-by: Isotr0py <[email protected]>
- Loading branch information