-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] Support Anthropic's prompt caching feature #623
Comments
Thanks @tfriedel this will be possible to support with Big-AGI 2. It's a good technology, and the savings are very significant. |
So we typically want this enabled if a large file is added. E.g. source code, an article, a book and so on. Also for system prompts that are large. This feature in theory allows you to stuff tons of stuff into your system prompt and thus rival fine-tuning. Now when should this not apply? If we don't need to ask follow up questions!
Since there are these use cases, having this easily toggable would be nice. Maybe with states off/auto ? Where auto uses some heuristics. I'd say in auto mode we check if the diff to the last cache breakpoint is large. So we wouldn't automatically move the breakpoint to each new message, cause that would be expensive. Only if a large amount of tokens is added. I haven't yet fully grasped how this works in multi-turn conversations. If the latter, then I think it makes sense to update the cache only at large increments. I'd probably start with something like the proposed idea and then later simulations could be performed on some saved conversations to optimize the heuristic rules. |
I'm just thinking about this:
5 minutes is short! |
The keep alive is such a good idea :) |
I see what you mean for the complexity. I want to have the perfect automatic policy for the users, so that they just have the same experience but pay less money. But it's not that easy to get it to the optimal planning. For starters we don't know if the user will intendo to continue the conversation, although we could parse it (or see the length of the user message as a signal). For system prompts, the chance that they get reused is high, so can auto-breakpoint those, and tools too. For the chat messages, I like the strat of 2 breakpoints on the last 2 user messages (so both adding 1 message, or regenerating the last will be cheap), but I also can't get over the fac that many time the user will pay +25% more, to then not hit the cache again. |
Note: if there aren't enough tokens in the chat, the Anthropic API will throw. Nothing to do there for now. We will wait for Anthropic to step in and fix the issue before fixing something on our end that's clearly not our issue.
Note that given the API is not really malleable, for now we should give the user the control of placing a breakpoint on a message (how does it work to send a message placing the breakpoint tho? - we'd need extra send functionality for that). In the meantime, it's a per-provider (i.e. all Anthropic models) option, but easy to toggle. |
@tfriedel @enricoros Keep alive implementation propose: LastRequest : The request that was used to get the newest ai-answer which is shown in chat. Edit: When the user already sent a new message, of course no request on 4min48sec for the old one is needed as the new sent prompt already keeps the cache alive. Comment to the existing implementation:
That's also what Anthropic proposes, second last user message for cache read, last user message for cache write (because everything is cached till the breakpoints, that's what they mean with prefix). |
Why
By using anthropic's prompt caching feature API input costs can be reduced by up to 90% and latency by up to 80%.
For an explanation see:
https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
and
https://x.com/alexalbert__/status/1823751966893465630
Description
A switch to enable or disable this feature. Because initial costs are higher and the feature is in beta it may make sense to allow to disable it.
The text was updated successfully, but these errors were encountered: