[BFCL] Support for pre-existing completion endpoint (#864)

# URL Endpoint Support for BFCL This PR is a product of the discussion in #850. ## Description This PR adds support for using pre-existing OpenAI-compatible endpoints in BFCL, allowing users to bypass the built-in vLLM/sglang server setup. This is particularly useful for distributed environments like SLURM clusters where model serving and benchmarking need to be handled as separate jobs. ## Changes - Added `--skip-server-setup` flag to CLI - Added environment variable support for endpoint configuration: - `VLLM_ENDPOINT` (defaults to 'localhost') - `VLLM_PORT` (defaults to existing VLLM_PORT constant) - Modified OSSHandler to support external endpoints - Updated documentation for new configuration options ## Usage Users can now specify custom endpoints in two ways: 1. Using environment variables: ```bash export VLLM_ENDPOINT="custom.host.com" export VLLM_PORT="8000" ``` 2. Using a `.env` file: ```bash VLLM_ENDPOINT=custom.host.com VLLM_PORT=8000 ``` Then run BFCL with the `--skip-server-setup` flag: ```bash python -m bfcl generate --model MODEL_NAME --backend vllm --skip-server-setup ``` ## Related Issue Closes #850 --------- Co-authored-by: Huanzhi (Hans) Mao <[email protected]>
ShishirPatil · Jan 3, 2025 · 1729c9b · 1729c9b
1 parent 5fe4a87
commit 1729c9b
Show file tree

Hide file tree

Showing 6 changed files with 160 additions and 106 deletions.
diff --git a/berkeley-function-call-leaderboard/.env.example b/berkeley-function-call-leaderboard/.env.example
@@ -28,5 +28,10 @@ EXCHANGERATE_API_KEY=
 OMDB_API_KEY=
 GEOCODE_API_KEY=
 
+# [OPTIONAL] For local vllm/sglang server configuration
+# Defaults to localhost port 1053 if not provided
+VLLM_ENDPOINT=localhost
+VLLM_PORT=1053
+
 # [OPTIONAL] Required for WandB to log the generated .csv in the format 'entity:project
 WANDB_BFCL_PROJECT=ENTITY:PROJECT
diff --git a/berkeley-function-call-leaderboard/CHANGELOG.md b/berkeley-function-call-leaderboard/CHANGELOG.md
@@ -2,6 +2,8 @@
 
 All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.
 
+- [Jan 3, 2025] [#864](https://github.com/ShishirPatil/gorilla/pull/864): Add support for pre-existing completion endpoints, allowing users to skip the local vLLM/SGLang server setup (using the `--skip-server-setup` flag) and point the generation pipeline to an existing OpenAI-compatible endpoint via `VLLM_ENDPOINT` and `VLLM_PORT`.
+- [Jan 3, 2025] [#859](https://github.com/ShishirPatil/gorilla/pull/859): Rename directories: `proprietary_model` -> `api_inference`, `oss_model` -> `local_inference` for better clarity.
 - [Dec 29, 2024] [#857](https://github.com/ShishirPatil/gorilla/pull/857): Add new model `DeepSeek-V3` to the leaderboard.
 - [Dec 29, 2024] [#855](https://github.com/ShishirPatil/gorilla/pull/855): Add new model `mistralai/Ministral-8B-Instruct-2410` to the leaderboard.
 - [Dec 22, 2024] [#838](https://github.com/ShishirPatil/gorilla/pull/838): Fix parameter type mismatch error in possible answers.

diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -16,6 +16,7 @@
       - [Output and Logging](#output-and-logging)
       - [For API-based Models](#for-api-based-models)
       - [For Locally-hosted OSS Models](#for-locally-hosted-oss-models)
+        - [For Pre-existing OpenAI-compatible Endpoints](#for-pre-existing-openai-compatible-endpoints)
       - [(Alternate) Script Execution for Generation](#alternate-script-execution-for-generation)
     - [Evaluating Generated Responses](#evaluating-generated-responses)
       - [(Optional) API Sanity Check](#optional-api-sanity-check)
@@ -155,6 +156,21 @@ bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend {vllm|s
 - Choose your backend using `--backend vllm` or `--backend sglang`. The default backend is `vllm`.
 - Control GPU usage by adjusting `--num-gpus` (default `1`, relevant for multi-GPU tensor parallelism) and `--gpu-memory-utilization` (default `0.9`), which can help avoid out-of-memory errors.
 
+##### For Pre-existing OpenAI-compatible Endpoints
+
+If you have a server already running (e.g., vLLM in a SLURM cluster), you can bypass the vLLM/sglang setup phase and directly generate responses by using the `--skip-server-setup` flag:
+
+```bash
+bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --skip-server-setup
+```
+
+In addition, you should specify the endpoint and port used by the server. By default, the endpoint is `localhost` and the port is `1053`. These can be overridden by the `VLLM_ENDPOINT` and `VLLM_PORT` environment variables in the `.env` file:
+
+```bash
+VLLM_ENDPOINT=localhost
+VLLM_PORT=1053
+```
+
 #### (Alternate) Script Execution for Generation
 
 For those who prefer using script execution instead of the CLI, you can run the following command:

diff --git a/berkeley-function-call-leaderboard/bfcl/__main__.py b/berkeley-function-call-leaderboard/bfcl/__main__.py
@@ -113,6 +113,11 @@ def generate(
     num_threads: int = typer.Option(1, help="The number of threads to use."),
     gpu_memory_utilization: float = typer.Option(0.9, help="The GPU memory utilization."),
     backend: str = typer.Option("vllm", help="The backend to use for the model."),
+    skip_server_setup: bool = typer.Option(
+        False,
+        "--skip-server-setup",
+        help="Skip vLLM/SGLang server setup and use existing endpoint specified by the VLLM_ENDPOINT and VLLM_PORT environment variables.",
+    ),
     result_dir: str = typer.Option(
         RESULT_PATH,
         "--result-dir",
@@ -144,6 +149,7 @@ def generate(
         num_threads=num_threads,
         gpu_memory_utilization=gpu_memory_utilization,
         backend=backend,
+        skip_server_setup=skip_server_setup,
         result_dir=result_dir,
         allow_overwrite=allow_overwrite,
         run_ids=run_ids,

diff --git a/berkeley-function-call-leaderboard/bfcl/_llm_response_generation.py b/berkeley-function-call-leaderboard/bfcl/_llm_response_generation.py
@@ -50,6 +50,13 @@ def get_args():
     parser.add_argument("--result-dir", default=None, type=str)
     parser.add_argument("--run-ids", action="store_true", default=False)
     parser.add_argument("--allow-overwrite", "-o", action="store_true", default=False)
+    # Add the new skip_vllm argument
+    parser.add_argument(
+        "--skip-server-setup",
+        action="store_true",
+        default=False,
+        help="Skip vLLM/SGLang server setup and use existing endpoint specified by the VLLM_ENDPOINT and VLLM_PORT environment variables."
+    )
     args = parser.parse_args()
     return args
 
@@ -232,6 +239,7 @@ def generate_results(args, model_name, test_cases_total):
             num_gpus=args.num_gpus,
             gpu_memory_utilization=args.gpu_memory_utilization,
             backend=args.backend,
+            skip_server_setup=args.skip_server_setup,
             include_input_log=args.include_input_log,
             exclude_state_log=args.exclude_state_log,
             result_dir=args.result_dir,