Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] URL endpoint support discussion #850

Closed
ThomasRochefortB opened this issue Dec 23, 2024 · 3 comments · Fixed by #864
Closed

[BFCL] URL endpoint support discussion #850

ThomasRochefortB opened this issue Dec 23, 2024 · 3 comments · Fixed by #864

Comments

@ThomasRochefortB
Copy link
Contributor

Describe the issue

This is to open a discussion on supporting models that are already served via an openai compatible endpoint and therefore bypassing the vLLM serve method in BFCL.

At Valence Labs we are often in the position where we need to do the model serving and the model benchmarking as two separate jobs (in a SLURM cluster for example). This means that we cannot use the implemented vLLM serving in the BFCL library and we need to instead directly point BFCL to an Open-AI-compatible endpoint URL.

We have tested something like this

I had in mind a PR to BFCL to support these use cases but there are many potential ways of implementing this and wanted to have your opinion first. So 2 questions here:

  1. Do you understand the desired usecase here?
  2. Any guidelines or pointers as to how best to implement this (see my linked branch above for our current naïve implementation)
@HuanzhiMao
Copy link
Collaborator

Hey @ThomasRochefortB,
If I understand correctly, you want to skip the part where the bfcl generation pipeline spins up the openai compatible server, and precede as if the server has been setup, right?

My question is:
Currently, we assume that the server is using the URL http://localhost:{VLLM_PORT}/v1 where the vllm port number is defined in the constant.py (with default value 1053). Do you need to change the endpoint and port settings?
If not, then it's fairly straightforward to make the change. We could have an optional flag in the cli to indicate if the server has already been setup. And in the code, we only run these lines if the flag is set (section 1, section 2, section 3, section 4). What do you think?

@ThomasRochefortB
Copy link
Contributor Author

Hello @HuanzhiMao !

That's exactly what I had in mind!
For a SLURM cluster application, it gets more tricky to force the endpoint and port to a constant value as these can be dependent on the node that gets allocated.

  • What would be a good way to specify the IP address and the port number? In my branch I am using environment variables and then reading them using :
        # Read from env vars with fallbacks
        vllm_host = os.getenv('VLLM_ENDPOINT', 'localhost')
        vllm_port = os.getenv('VLLM_PORT', '8000')

Could we make the VLLM_PORT and VLLM_ENDPOINT an optional entry in .env?

  • The CLI could have a --skip-vllm flag that allows us to bypass all the vLLM setup steps and directly point to the env vars for the completions request.

@HuanzhiMao
Copy link
Collaborator

Hello @HuanzhiMao !

That's exactly what I had in mind! For a SLURM cluster application, it gets more tricky to force the endpoint and port to a constant value as these can be dependent on the node that gets allocated.

  • What would be a good way to specify the IP address and the port number? In my branch I am using environment variables and then reading them using :
        # Read from env vars with fallbacks
        vllm_host = os.getenv('VLLM_ENDPOINT', 'localhost')
        vllm_port = os.getenv('VLLM_PORT', '8000')

Could we make the VLLM_PORT and VLLM_ENDPOINT an optional entry in .env?

  • The CLI could have a --skip-vllm flag that allows us to bypass all the vLLM setup steps and directly point to the env vars for the completions request.

That sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants