Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Update apply mask kernels #128

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

Ubospica
Copy link
Collaborator

@Ubospica Ubospica commented Dec 12, 2024

Previously the apply mask kernel computes:

for i in range(len(indices)):
    for j in range(vocab_size):
        if get_bitmask_value(bitmask, i, j) == 0:
            logits[indices[i], j] = -inf

This means the shape of the bitmask is (len(indices), bitmask_size), where len(indices) represents the number of structured generation requests in a batch. However, this number can be different for different batches, and the shape of the bitmask also differs.

This PR makes the shape of the bitmask consist across batches to make it easier to allocate the bitmask. It fixes the shape of the bitmask to (batch_size, bitmask_size), and modifies the apply mask kernel to:

for batch_id in indices:
    for j in range(vocab_size):
        if get_bitmask_value(bitmask, batch_id, j) == 0:
            logits[batch_id, j] = -inf

GPU Kernel benchmarks:

_______________________________ test_apply_token_bitmask_inplace_large[True-1-128000-1024-1] ________________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 5.824641790241003 us
______________________________ test_apply_token_bitmask_inplace_large[True-1-128000-120000-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 6.118565797805786 us
______________________________ test_apply_token_bitmask_inplace_large[True-1-128001-120000-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 5.8456724509596825 us
______________________________ test_apply_token_bitmask_inplace_large[True-1-128010-120000-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 5.760767031461 us
_______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-1] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 19.754506647586823 us
______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-1] ______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 62.413290143013 us
_______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-1024-4] _______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 20.551515743136406 us
______________________________ test_apply_token_bitmask_inplace_large[True-64-128000-120000-4] ______________________________
--------------------------------------------------- Captured stdout call ----------------------------------------------------
apply_token_bitmask_inplace_cuda time: 33.12954679131508 us

It also removes CUDA kernel impl.

@Ubospica Ubospica merged commit dd7feea into mlc-ai:main Dec 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant