add MT_BENCH to pipeline #34

sallyom · 2024-09-20T01:16:37Z

This adds MT_Bench - and uses ilab image with ilab model evaluate --benchmark mt_bench

training/deployment.yaml

training/ilab-multi-node-multi-gpu-pytorch-job.yaml

training/ilab-pytorch-job.yaml

training/components.py

utils/components.py

eval/mt_bench/components.py

leseb · 2024-09-25T07:43:15Z

eval/mt_bench/components.py

+    # This seems like excessive effort to stop the vllm process, but merely saving & killing the pid doesn't work
+    # Also, the base image does not include `pkill` cmd, so can't pkill -f vllm.entrypoints.openai.api_server either
+    def stop_vllm_server_by_name():
+        import psutil
+
+        for process in psutil.process_iter(attrs=["pid", "name", "cmdline"]):
+            cmdline = process.info.get("cmdline")
+            if cmdline and "vllm.entrypoints.openai.api_server" in cmdline:
+                print(f"Found vLLM server process with PID: {process.info['pid']}, terminating...")
+                try:
+                    process.terminate()  # Try graceful termination
+                    process.wait(timeout=5)  # Wait a bit for it to terminate
+                    if process.is_running():
+                        print(f"Forcefully killing vLLM server process with PID: {process.info['pid']}")
+                        process.kill()  # Force kill if it's still running
+                    print(f"Successfully stopped vLLM server with PID: {process.info['pid']}")
+                except psutil.NoSuchProcess:
+                    print(f"Process with PID {process.info['pid']} no longer exists.")
+                except psutil.AccessDenied:
+                    print(f"Access denied when trying to terminate process with PID {process.info['pid']}.")
+                except Exception as e:
+                    print(f"Failed to terminate process with PID {process.info['pid']}. Error: {e}")


It was not working in the first place because vLLM does not handle signals properly. It expects to be run from the CLI and thus stopped via SIGINT.

If you use a process group in the instantiation above you can get ride of the current pid search logic and simply do:

process_group_id = os.getpgid(process.pid) process.send_signal(signal.SIGINT) try: print("Waiting for vLLM server to shut down gracefully") process.wait(timeout) except subprocess.TimeoutExpired: print( f"Sending SIGKILL to vLLM server since timeout ({timeout}s) expired" ) process.kill() # Attempt to cleanup any remaining child processes # Make sure process_group is legit (> 1) before trying if process_group_id and process_group_id > 1: try: os.killpg(process_group_id, signal.SIGKILL) print("Sent SIGKILL to vLLM process group") except ProcessLookupError: print("Nothing left to clean up with the vLLM process group") else: print("vLLM process group id not found")

Thanks! Please add in a follow-up, I need to be done with this PR 🤣

eval/mt_bench/components.py

sallyom · 2024-09-25T11:30:34Z

@leseb how about if we merge this, then you add the PID/Kill changes in a follow-up?

leseb · 2024-09-25T11:35:01Z

@leseb how about if we merge this, then you add the PID/Kill changes in a follow-up?

Sounds good :).

eval/mt_bench/components.py

leseb · 2024-09-25T14:17:35Z

eval/mt_bench/components.py

+        import requests
+
+        command = f"nohup python3.11 -m vllm.entrypoints.openai.api_server --model {model_path} &"
+        subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)


Suggested change

subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

subprocess.Popen(args=command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

Why subprocess.PIPE? Both stdout and sterr are not used anywhere.

relic from another try, another time - I'll remove those

eval/mt_bench/components.py

leseb · 2024-09-25T14:24:19Z

eval/mt_bench/components.py

+
+EVAL_IMAGE = "quay.io/sallyom/instructlab-ocp:eval"
+
+@component(base_image=EVAL_IMAGE, packages_to_install=["vllm"]))


can we enforce a particular version? @git+https://github.com/opendatahub-io/[email protected]

Or maybe bake it into the image? It's our custom image anyways... Installing in runtime without a lockfile is not safe.

that was always my plan, but wasn't convenient while iterating on the 1000 different ways I tried to get vllm working properly - I'll add that to the Containerfile

@tumido I'll update the Containerfile in a follow-up & remove the runtime install

yes- will do that in a follow-up - will move the pip install to the Containerfile

eval/mt_bench/components.py

eval/mt_bench_pipeline.py

Signed-off-by: sallyom <[email protected]>

eval/mt_bench/components.py

Signed-off-by: sallyom <[email protected]>

MichaelClifford

LGTM

sallyom force-pushed the eval-mtbench branch from cd52165 to 2ccfeb8 Compare September 20, 2024 01:31

sallyom mentioned this pull request Sep 20, 2024

restructure eval & add MMLU to main pipeline #29

Merged

sallyom force-pushed the eval-mtbench branch from 2ccfeb8 to ed868cc Compare September 20, 2024 02:25

sallyom changed the title ~~WIP - add MT_BENCH to pipeline~~ add MT_BENCH to pipeline Sep 20, 2024

sallyom force-pushed the eval-mtbench branch 2 times, most recently from aabd1bf to 08cfd37 Compare September 20, 2024 14:04

MichaelClifford requested changes Sep 20, 2024

View reviewed changes

sallyom force-pushed the eval-mtbench branch 17 times, most recently from 50c3fe4 to 9c21301 Compare September 25, 2024 01:12

leseb reviewed Sep 25, 2024

View reviewed changes

MichaelClifford reviewed Sep 25, 2024

View reviewed changes

eval/mt_bench/components.py Outdated Show resolved Hide resolved

MichaelClifford reviewed Sep 25, 2024

View reviewed changes

eval/mt_bench/components.py Outdated Show resolved Hide resolved

MichaelClifford self-requested a review September 25, 2024 13:56

leseb reviewed Sep 25, 2024

View reviewed changes

tumido reviewed Sep 25, 2024

View reviewed changes

eval/mt_bench/components.py Show resolved Hide resolved

eval/mt_bench_pipeline.py Outdated Show resolved Hide resolved

sallyom added 2 commits September 25, 2024 11:09

add MT_BENCH to main pipeline

07076ae

Signed-off-by: sallyom <[email protected]>

update vllm server to include gpu_count

c35f4b5

Signed-off-by: sallyom <[email protected]>

sallyom force-pushed the eval-mtbench branch 2 times, most recently from 5c53958 to be01551 Compare September 25, 2024 15:39

leseb requested changes Sep 25, 2024

View reviewed changes

eval/mt_bench/components.py Outdated Show resolved Hide resolved

sallyom force-pushed the eval-mtbench branch 2 times, most recently from 10d77bb to 948971d Compare September 25, 2024 17:45

add note about magic word auto and update subprocess cmds

92f6057

Signed-off-by: sallyom <[email protected]>

sallyom force-pushed the eval-mtbench branch from 948971d to 92f6057 Compare September 25, 2024 17:47

remove old mtbench pipeline

b4939dc

Signed-off-by: sallyom <[email protected]>

cooktheryan requested review from leseb and tumido September 25, 2024 19:47

MichaelClifford approved these changes Sep 25, 2024

View reviewed changes

cooktheryan merged commit 05ff202 into opendatahub-io:main Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add MT_BENCH to pipeline #34

add MT_BENCH to pipeline #34

sallyom commented Sep 20, 2024

leseb Sep 25, 2024

sallyom Sep 25, 2024

sallyom commented Sep 25, 2024

leseb commented Sep 25, 2024

leseb Sep 25, 2024

sallyom Sep 25, 2024

leseb Sep 25, 2024

tumido Sep 25, 2024

sallyom Sep 25, 2024

sallyom Sep 25, 2024

sallyom Sep 25, 2024

MichaelClifford left a comment

	subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	subprocess.Popen(args=command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)


		EVAL_IMAGE = "quay.io/sallyom/instructlab-ocp:eval"

		@component(base_image=EVAL_IMAGE, packages_to_install=["vllm"]))

add MT_BENCH to pipeline #34

add MT_BENCH to pipeline #34

Conversation

sallyom commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sallyom commented Sep 25, 2024

leseb commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelClifford left a comment

Choose a reason for hiding this comment