[Feature] Update apply mask kernels #128

Ubospica · 2024-12-12T21:36:24Z

Previously the apply mask kernel computes:

for i in range(len(indices)):
    for j in range(vocab_size):
        if get_bitmask_value(bitmask, i, j) == 0:
            logits[indices[i], j] = -inf

This means the shape of the bitmask is (len(indices), bitmask_size), where len(indices) represents the number of structured generation requests in a batch. However, this number can be different for different batches, and the shape of the bitmask also differs.

This PR makes the shape of the bitmask consist across batches to make it easier to allocate the bitmask. It fixes the shape of the bitmask to (batch_size, bitmask_size), and modifies the apply mask kernel to:

for batch_id in indices:
    for j in range(vocab_size):
        if get_bitmask_value(bitmask, batch_id, j) == 0:
            logits[batch_id, j] = -inf

GPU Kernel benchmarks:

_______________________________ test_apply_token_bitmask_inplace_large[True-1-128000-1024-1] ________________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 5.824641790241003 us
______________________________ test_apply_token_bitmask_inplace_large[True-1-128000-120000-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 6.118565797805786 us
______________________________ test_apply_token_bitmask_inplace_large[True-1-128001-120000-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 5.8456724509596825 us
______________________________ test_apply_token_bitmask_inplace_large[True-1-128010-120000-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 5.760767031461 us
_______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 19.754506647586823 us
______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-1] ______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 62.413290143013 us
_______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-4] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 20.551515743136406 us
______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-4] ______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 33.12954679131508 us

It also removes CUDA kernel impl.

Ubospica added 4 commits December 12, 2024 15:34

1213

16d3149

finish

4fbfc3c

finish

f4bafb8

update

76300d6

Ubospica merged commit dd7feea into mlc-ai:main Dec 12, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Update apply mask kernels #128

[Feature] Update apply mask kernels #128

Ubospica commented Dec 12, 2024 •

edited

Loading

[Feature] Update apply mask kernels #128

[Feature] Update apply mask kernels #128

Conversation

Ubospica commented Dec 12, 2024 • edited Loading

Ubospica commented Dec 12, 2024 •

edited

Loading