Skip to content

Commit

Permalink
[BFCL] Support for pre-existing completion endpoint (#864)
Browse files Browse the repository at this point in the history
# URL Endpoint Support for BFCL
This PR is a product of the discussion in #850.

## Description
This PR adds support for using pre-existing OpenAI-compatible endpoints
in BFCL, allowing users to bypass the built-in vLLM/sglang server setup.
This is particularly useful for distributed environments like SLURM
clusters where model serving and benchmarking need to be handled as
separate jobs.

## Changes
- Added `--skip-server-setup` flag to CLI
- Added environment variable support for endpoint configuration:
  - `VLLM_ENDPOINT` (defaults to 'localhost')
  - `VLLM_PORT` (defaults to existing VLLM_PORT constant)
- Modified OSSHandler to support external endpoints
- Updated documentation for new configuration options

## Usage
Users can now specify custom endpoints in two ways:

1. Using environment variables:
```bash
export VLLM_ENDPOINT="custom.host.com"
export VLLM_PORT="8000"
```

2. Using a `.env` file:
```bash
VLLM_ENDPOINT=custom.host.com
VLLM_PORT=8000
```

Then run BFCL with the `--skip-server-setup` flag:
```bash
python -m bfcl generate --model MODEL_NAME --backend vllm --skip-server-setup
```

## Related Issue
Closes #850

---------

Co-authored-by: Huanzhi (Hans) Mao <[email protected]>
  • Loading branch information
ThomasRochefortB and HuanzhiMao authored Jan 3, 2025
1 parent 5fe4a87 commit 1729c9b
Show file tree
Hide file tree
Showing 6 changed files with 160 additions and 106 deletions.
5 changes: 5 additions & 0 deletions berkeley-function-call-leaderboard/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,10 @@ EXCHANGERATE_API_KEY=
OMDB_API_KEY=
GEOCODE_API_KEY=

# [OPTIONAL] For local vllm/sglang server configuration
# Defaults to localhost port 1053 if not provided
VLLM_ENDPOINT=localhost
VLLM_PORT=1053

# [OPTIONAL] Required for WandB to log the generated .csv in the format 'entity:project
WANDB_BFCL_PROJECT=ENTITY:PROJECT
2 changes: 2 additions & 0 deletions berkeley-function-call-leaderboard/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.

- [Jan 3, 2025] [#864](https://github.com/ShishirPatil/gorilla/pull/864): Add support for pre-existing completion endpoints, allowing users to skip the local vLLM/SGLang server setup (using the `--skip-server-setup` flag) and point the generation pipeline to an existing OpenAI-compatible endpoint via `VLLM_ENDPOINT` and `VLLM_PORT`.
- [Jan 3, 2025] [#859](https://github.com/ShishirPatil/gorilla/pull/859): Rename directories: `proprietary_model` -> `api_inference`, `oss_model` -> `local_inference` for better clarity.
- [Dec 29, 2024] [#857](https://github.com/ShishirPatil/gorilla/pull/857): Add new model `DeepSeek-V3` to the leaderboard.
- [Dec 29, 2024] [#855](https://github.com/ShishirPatil/gorilla/pull/855): Add new model `mistralai/Ministral-8B-Instruct-2410` to the leaderboard.
- [Dec 22, 2024] [#838](https://github.com/ShishirPatil/gorilla/pull/838): Fix parameter type mismatch error in possible answers.
Expand Down
16 changes: 16 additions & 0 deletions berkeley-function-call-leaderboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Output and Logging](#output-and-logging)
- [For API-based Models](#for-api-based-models)
- [For Locally-hosted OSS Models](#for-locally-hosted-oss-models)
- [For Pre-existing OpenAI-compatible Endpoints](#for-pre-existing-openai-compatible-endpoints)
- [(Alternate) Script Execution for Generation](#alternate-script-execution-for-generation)
- [Evaluating Generated Responses](#evaluating-generated-responses)
- [(Optional) API Sanity Check](#optional-api-sanity-check)
Expand Down Expand Up @@ -155,6 +156,21 @@ bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --backend {vllm|s
- Choose your backend using `--backend vllm` or `--backend sglang`. The default backend is `vllm`.
- Control GPU usage by adjusting `--num-gpus` (default `1`, relevant for multi-GPU tensor parallelism) and `--gpu-memory-utilization` (default `0.9`), which can help avoid out-of-memory errors.

##### For Pre-existing OpenAI-compatible Endpoints

If you have a server already running (e.g., vLLM in a SLURM cluster), you can bypass the vLLM/sglang setup phase and directly generate responses by using the `--skip-server-setup` flag:

```bash
bfcl generate --model MODEL_NAME --test-category TEST_CATEGORY --skip-server-setup
```

In addition, you should specify the endpoint and port used by the server. By default, the endpoint is `localhost` and the port is `1053`. These can be overridden by the `VLLM_ENDPOINT` and `VLLM_PORT` environment variables in the `.env` file:

```bash
VLLM_ENDPOINT=localhost
VLLM_PORT=1053
```

#### (Alternate) Script Execution for Generation

For those who prefer using script execution instead of the CLI, you can run the following command:
Expand Down
6 changes: 6 additions & 0 deletions berkeley-function-call-leaderboard/bfcl/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,11 @@ def generate(
num_threads: int = typer.Option(1, help="The number of threads to use."),
gpu_memory_utilization: float = typer.Option(0.9, help="The GPU memory utilization."),
backend: str = typer.Option("vllm", help="The backend to use for the model."),
skip_server_setup: bool = typer.Option(
False,
"--skip-server-setup",
help="Skip vLLM/SGLang server setup and use existing endpoint specified by the VLLM_ENDPOINT and VLLM_PORT environment variables.",
),
result_dir: str = typer.Option(
RESULT_PATH,
"--result-dir",
Expand Down Expand Up @@ -144,6 +149,7 @@ def generate(
num_threads=num_threads,
gpu_memory_utilization=gpu_memory_utilization,
backend=backend,
skip_server_setup=skip_server_setup,
result_dir=result_dir,
allow_overwrite=allow_overwrite,
run_ids=run_ids,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,13 @@ def get_args():
parser.add_argument("--result-dir", default=None, type=str)
parser.add_argument("--run-ids", action="store_true", default=False)
parser.add_argument("--allow-overwrite", "-o", action="store_true", default=False)
# Add the new skip_vllm argument
parser.add_argument(
"--skip-server-setup",
action="store_true",
default=False,
help="Skip vLLM/SGLang server setup and use existing endpoint specified by the VLLM_ENDPOINT and VLLM_PORT environment variables."
)
args = parser.parse_args()
return args

Expand Down Expand Up @@ -232,6 +239,7 @@ def generate_results(args, model_name, test_cases_total):
num_gpus=args.num_gpus,
gpu_memory_utilization=args.gpu_memory_utilization,
backend=args.backend,
skip_server_setup=args.skip_server_setup,
include_input_log=args.include_input_log,
exclude_state_log=args.exclude_state_log,
result_dir=args.result_dir,
Expand Down
Loading

0 comments on commit 1729c9b

Please sign in to comment.