Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a GET Interface to Inference Would Allow for Better Performance #20

Open
fstakem opened this issue Feb 19, 2024 · 3 comments
Open

Comments

@fstakem
Copy link

fstakem commented Feb 19, 2024

The current specification does not allow for good use of Cache Control i.e., client side caching, which is inefficient in production environments. The specification should add a GET request for inference to allow better use of client side caching with Cache Control. Let me explain better.

If a user is querying a deterministic model the response from the endpoint should be the same each time until the model is retrained, at which time the model should get a new version. (For non deterministic models such as simulation the current interface is fine) The current implementation only has a HTTP POST for querying the model for inference. If a HTTP GET is used with proper Cache Control settings the load on the server can be decreased. Cache control allows the client to cache response and the server to control the cache settings. By having the server control the cache other systems such as experimentation can be used on the server side without worry that the client will get the wrong response. The RFC on Cache Control is probably better at explaining this than I am and is included below.

RFC on HTTP caching: here

Currently different implementations of this specification use a more inefficient server side caching. Although server side caching can reduce the load on the server, the network bandwidth and round trip delay on the POST request are not eliminated. A good production system should utilize both client side and server side caching to have optimal results.

Here is an example of an implementation that uses server side caching: here

@zevisert
Copy link

Hi @fstakem, it's been months - I know, but can you give us some use-cases of models that are deterministic? The way I read what you're describing, it sounds like you should invoke the model after retraining, write the result to a file and store that file on a CDN. If your model is deterministic, there's no need to have an inference server running at all. I don't think this request is in scope for the open-inference-protocol.

NB: I'm not on the steering committee here, I'm just an interested user

@fstakem
Copy link
Author

fstakem commented Oct 25, 2024

Deterministic models are not uncommon or rare. They make up a large set of models used in machine learning. Some teams lean on stochastic models more; some teams lean on deterministic models more. A regression is an engineers bread and butter model and it is deterministic. So are many others.

Using a CDN or other edge based caching model only is relevant in the few cases a model is served directly to a client. Many models are served internally to organizations where this scenario is not relevant. Likewise precomputation is one strategy. It is also expensive and time consuming for larger state spaces.

I am not sure how following an RFC is out of scope for an internet based system. The IETF is well regard even when the web development world ignores them.

@zevisert
Copy link

I am not sure how following an RFC is out of scope for an internet based system. The IETF is well regard even when the web development world ignores them.

Apologies, I wasn't pointing at IETF - I know their importance very well. I was attempting to say that adding a get endpoint is what I think is out-of-scope.

I may have misunderstood what you meant by determinism there, but with regression models (or decision trees, classifiers, or alike), while sure the model is deterministic you still have to provide some form of input. If we were to add a GET endpoint to the protocol definition, we'd still need a way to provide input to the model. While yes, technically you can send a request body with any HTTP method (including a GET request) - it should not have any meaning for a GET request. If we were to handle a request body on a GET request, and incorporate it in the processing by parsing it on the server and changing your response based on its contents, then we would be are ignoring this recommendation in the HTTP/1.1 spec, section 4.3:

...if the request method does not include defined semantics for an entity-body, then the message-body SHOULD be ignored when handling the request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants