Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support mac metal #93

Open
zuowenjian opened this issue Nov 13, 2024 · 5 comments
Open

support mac metal #93

zuowenjian opened this issue Nov 13, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@zuowenjian
Copy link

The candle project has already released version 0.8, which supports Metal acceleration on macOS. You would like candle-vllm to also support Metal acceleration on macOS.

@guoqingbao
Copy link
Collaborator

The candle project has already released version 0.8, which supports Metal acceleration on macOS. You would like candle-vllm to also support Metal acceleration on macOS.

Thank you for the suggestion! I recently acquired an M4 device, which I believe will help facilitate further development on macOS. However, the primary goal of this project is to port vLLM (Paged Attention) to the Rust platform. At this time, I am unsure whether Paged Attention is supported or can be implemented using Metal. If you're specifically interested in LLM inference on macOS within the Rust ecosystem, you might want to explore another project by Eric @EricLBuehler that focuses on this area (mistral.rs).

@guoqingbao guoqingbao added the enhancement New feature or request label Dec 11, 2024
@guoqingbao guoqingbao self-assigned this Dec 11, 2024
@EricLBuehler EricLBuehler self-assigned this Dec 22, 2024
@EricLBuehler
Copy link
Owner

Hi @zuowenjian! I added CUDA PagedAttention to mistral.rs some time ago, but I am now actively working on a Metal implementation.

@guoqingbao congrats on the M4! The PR for implementing PagedAttention on Metal can be found here: EricLBuehler/mistral.rs#1001. It's currently already able to run & compile (I've fully integrated the kernels, backend interop, scheduling, etc), but it doesn't produce correct output. I'm not sure what your experience with Metal is, but perhaps you have time to take a look? I can add you as a collaborator to push to the branch if you find anything!

@guoqingbao
Copy link
Collaborator

Hi @zuowenjian! I added CUDA PagedAttention to mistral.rs some time ago, but I am now actively working on a Metal implementation.

@guoqingbao congrats on the M4! The PR for implementing PagedAttention on Metal can be found here: EricLBuehler/mistral.rs#1001. It's currently already able to run & compile (I've fully integrated the kernels, backend interop, scheduling, etc), but it doesn't produce correct output. I'm not sure what your experience with Metal is, but perhaps you have time to take a look? I can add you as a collaborator to push to the branch if you find anything!

Great work, Eric! Supporting Paged Attention on Apple Silicon is a promising direction. I've reviewed your PR, but I'm uncertain whether the Metal kernel for Paged Attention that you ported from CUDA is functioning as intended. Have you tested the paged_attention metal kernel in isolation before integrating it into mistral.rs? This might help narrow down the issue with the incorrect output.

@guoqingbao
Copy link
Collaborator

The candle project has already released version 0.8, which supports Metal acceleration on macOS. You would like candle-vllm to also support Metal acceleration on macOS.

Given that Eric has successfully ported paged attention kernels to the Metal platform (EricLBuehler/mistral.rs#1001), we can now support Mac devices in the candle-vllm project.

@EricLBuehler
Copy link
Owner

@guoqingbao thanks for the great work in #97!

@zuowenjian I think that Mac Metal is now supported!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants