Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MT_BENCH to pipeline #34

Merged
merged 4 commits into from
Sep 25, 2024
Merged

Conversation

sallyom
Copy link
Collaborator

@sallyom sallyom commented Sep 20, 2024

This adds MT_Bench - and uses ilab image with ilab model evaluate --benchmark mt_bench

@sallyom sallyom changed the title WIP - add MT_BENCH to pipeline add MT_BENCH to pipeline Sep 20, 2024
@sallyom sallyom force-pushed the eval-mtbench branch 2 times, most recently from aabd1bf to 08cfd37 Compare September 20, 2024 14:04
training/deployment.yaml Outdated Show resolved Hide resolved
training/ilab-multi-node-multi-gpu-pytorch-job.yaml Outdated Show resolved Hide resolved
training/ilab-pytorch-job.yaml Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
utils/components.py Outdated Show resolved Hide resolved
eval/mt_bench/components.py Outdated Show resolved Hide resolved
eval/mt_bench/components.py Outdated Show resolved Hide resolved
eval/mt_bench/components.py Outdated Show resolved Hide resolved
eval/mt_bench/components.py Outdated Show resolved Hide resolved
@sallyom sallyom force-pushed the eval-mtbench branch 17 times, most recently from 50c3fe4 to 9c21301 Compare September 25, 2024 01:12
Comment on lines +48 to +83
# This seems like excessive effort to stop the vllm process, but merely saving & killing the pid doesn't work
# Also, the base image does not include `pkill` cmd, so can't pkill -f vllm.entrypoints.openai.api_server either
def stop_vllm_server_by_name():
import psutil

for process in psutil.process_iter(attrs=["pid", "name", "cmdline"]):
cmdline = process.info.get("cmdline")
if cmdline and "vllm.entrypoints.openai.api_server" in cmdline:
print(f"Found vLLM server process with PID: {process.info['pid']}, terminating...")
try:
process.terminate() # Try graceful termination
process.wait(timeout=5) # Wait a bit for it to terminate
if process.is_running():
print(f"Forcefully killing vLLM server process with PID: {process.info['pid']}")
process.kill() # Force kill if it's still running
print(f"Successfully stopped vLLM server with PID: {process.info['pid']}")
except psutil.NoSuchProcess:
print(f"Process with PID {process.info['pid']} no longer exists.")
except psutil.AccessDenied:
print(f"Access denied when trying to terminate process with PID {process.info['pid']}.")
except Exception as e:
print(f"Failed to terminate process with PID {process.info['pid']}. Error: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not working in the first place because vLLM does not handle signals properly. It expects to be run from the CLI and thus stopped via SIGINT.

If you use a process group in the instantiation above you can get ride of the current pid search logic and simply do:

process_group_id = os.getpgid(process.pid)
process.send_signal(signal.SIGINT)
try:
    print("Waiting for vLLM server to shut down gracefully")
    process.wait(timeout)
except subprocess.TimeoutExpired:
    print(
        f"Sending SIGKILL to vLLM server since timeout ({timeout}s) expired"
    )
    process.kill()

# Attempt to cleanup any remaining child processes
# Make sure process_group is legit (> 1) before trying
if process_group_id and process_group_id > 1:
    try:
        os.killpg(process_group_id, signal.SIGKILL)
        print("Sent SIGKILL to vLLM process group")
    except ProcessLookupError:
        print("Nothing left to clean up with the vLLM process group")
else:
    print("vLLM process group id not found")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please add in a follow-up, I need to be done with this PR 🤣

eval/mt_bench/components.py Outdated Show resolved Hide resolved
@sallyom
Copy link
Collaborator Author

sallyom commented Sep 25, 2024

@leseb how about if we merge this, then you add the PID/Kill changes in a follow-up?

@leseb
Copy link
Collaborator

leseb commented Sep 25, 2024

@leseb how about if we merge this, then you add the PID/Kill changes in a follow-up?

Sounds good :).

import requests

command = f"nohup python3.11 -m vllm.entrypoints.openai.api_server --model {model_path} &"
subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
subprocess.Popen(args=command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

Why subprocess.PIPE? Both stdout and sterr are not used anywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relic from another try, another time - I'll remove those

eval/mt_bench/components.py Outdated Show resolved Hide resolved

EVAL_IMAGE = "quay.io/sallyom/instructlab-ocp:eval"

@component(base_image=EVAL_IMAGE, packages_to_install=["vllm"]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we enforce a particular version? @git+https://github.com/opendatahub-io/[email protected]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe bake it into the image? It's our custom image anyways... Installing in runtime without a lockfile is not safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was always my plan, but wasn't convenient while iterating on the 1000 different ways I tried to get vllm working properly - I'll add that to the Containerfile

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tumido I'll update the Containerfile in a follow-up & remove the runtime install

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes- will do that in a follow-up - will move the pip install to the Containerfile

eval/mt_bench/components.py Show resolved Hide resolved
eval/mt_bench_pipeline.py Outdated Show resolved Hide resolved
@sallyom sallyom force-pushed the eval-mtbench branch 2 times, most recently from 5c53958 to be01551 Compare September 25, 2024 15:39
eval/mt_bench/components.py Outdated Show resolved Hide resolved
@sallyom sallyom force-pushed the eval-mtbench branch 2 times, most recently from 10d77bb to 948971d Compare September 25, 2024 17:45
Copy link
Collaborator

@MichaelClifford MichaelClifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cooktheryan cooktheryan merged commit 05ff202 into opendatahub-io:main Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants