-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Baseline for SGLang Benchmark Test #602
Conversation
Add sgl server benchmark to workflow file, Restructure `app_tests/benchmark_tests`
Temporarily comment out shortfin job to verify sglang benchmark job
Update benchmark tests to download model on demand
…om shortfin/sharktank
Add disable-cuda-graph option to allow server to properly run
…stbaione/sgl-benchmark-add-baseline
…but differing answers to be accepted
Link to successful run: With SRT (Sglang RunTime): With Shortfin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Would be nice to get @ScottTodd 's look too if he's got time.
Add back step to clean up docker image
Always use python3.11 for merging reports, Make merging reports one step, Temporarily enable PR trigger for validation
… but are dependent on each other's success
Make `merge_and_upload_reports` run conditionally on either succeeding
…ithub.com/nod-ai/shark-ai into users/stbaione/sgl-benchmark-add-baseline
# Description The SGLang Benchmark Test has been running for awhile, but only benchmarks the shortfin server itself. In order to get a baseline metric and enable long-term convergence in-terms of performance, we need to be able to track metrics of the SGLang server using the same benchmark method. This adds an `sglang_benchmark_test` to complement the `shortfin_benchmark_test`. Also restructures `app_tests/benchmark_tests/llm` -> `app_tests/benchmark_tests/llm/sglang_benchmarks`. This keeps the benchmark tests organized and allows for the folder to be extended with other types of benchmarks in the future. # Why are we using docker to start the SGLang server? Currently, the pyprompt.toml file inside of SGLang requires `vllm==0.6.3.dev13` to run on ROCm. I looked into potentially building vLLM from source for this test, but couldn't find a branch, tag, or release that matched that signature. From their own comments inside of `pyproject.toml`, it appears to only be available inside of a `ROCm` base image: ```toml # HIP (Heterogeneous-computing Interface for Portability) for AMD # => base docker rocm/vllm-dev:20241022, not from public vllm whl srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"] ``` Their [instructions](https://sgl-project.github.io/start/install.html#method-3-using-docker) on installing SGLang and running for ROCm also appear to suggest the docker method: ## Instructions from their docs for running with ROCm ``` docker build --build-arg SGL_BRANCH=v0.3.5.post2 -t v0.3.5.post2-rocm620 -f Dockerfile.rocm . alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \ --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ -v $HOME/dockerx:/dockerx -v /data:/data' drun -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ v0.3.5.post2-rocm620 \ python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 ``` The workflow file handles starting the container and cleaning up once the workflow is done. I set the timeout for waiting for the server to start to `10 minutes` to give the SGLang server enough time to load necessary model weights and startup.
Description
The SGLang Benchmark Test has been running for awhile, but only benchmarks the shortfin server itself. In order to get a baseline metric and enable long-term convergence in-terms of performance, we need to be able to track metrics of the SGLang server using the same benchmark method.
This adds an
sglang_benchmark_test
to complement theshortfin_benchmark_test
. Also restructuresapp_tests/benchmark_tests/llm
->app_tests/benchmark_tests/llm/sglang_benchmarks
. This keeps the benchmark tests organized and allows for the folder to be extended with other types of benchmarks in the future.Why are we using docker to start the SGLang server?
Currently, the pyprompt.toml file inside of SGLang requires
vllm==0.6.3.dev13
to run on ROCm. I looked into potentially building vLLM from source for this test, but couldn't find a branch, tag, or release that matched that signature. From their own comments inside ofpyproject.toml
, it appears to only be available inside of aROCm
base image:Their instructions on installing SGLang and running for ROCm also appear to suggest the docker method:
Instructions from their docs for running with ROCm
The workflow file handles starting the container and cleaning up once the workflow is done. I set the timeout for waiting for the server to start to
10 minutes
to give the SGLang server enough time to load necessary model weights and startup.