Semantic caching reduces LLM latency and cost by returning cached model responses for semantically similar prompts (not just exact matches). vCache is the first verified semantic cache that guarantees user-defined error rate bounds. vCache replaces static thresholds with online-learned, embedding-specific decision boundaries—no manual fine-tuning required. This enables reliable cached response reuse across any embedding model or workload.
💳 Cost & Latency Optimization
Reduce LLM API Costs by up to 10x. Decrease latency by up to 100x.
💡 Verified Semantic Prompt Caching
Set an error rate bound—vCache enforces it while maximizing cache hits.
🏢 System Agnostic Infrastructure
vCache uses OpenAI by default for both LLM inference and embedding generation, but you can configure any other inference setup.
Install vCache in editable mode:
pip install -e .
Then, set your OpenAI key:
export OPENAI_API_KEY="your_api_key_here"
Finally, use vCache in your Python code:
from vcache import VCache, VCachePolicy, VerifiedDecisionPolicy
error_rate_bound: int = 0.01
policy: VCachePolicy = VerifiedDecisionPolicy(delta=error_rate_bound)
vcache: VCache = VCache(policy=policy)
response: str = vcache.infer("Is the sky blue?")
print(response)
vCache intelligently detects when a new prompt is semantically equivalent to a cached one, and adapts its decision boundaries based on your accuracy requirements. This lets it return cached model responses for semantically similar prompts—not just exact matches—reducing both inference latency and cost without sacrificing correctness.
Please refer to the vCache paper for further details.
Semantic caches sit between the application server and the LLM inference backend.
Applications can range from agentic systems and RAG pipelines to database systems issuing LLM-based SQL queries. The inference backend can be a closed-source API (e.g., OpenAI, Anthropic) or a proprietary model hosted on-prem or in the cloud (e.g., LLaMA on AWS).
[NOTE] vCache is currently in active development. Features and APIs may change as we continue to improve the system.
vCache is modular and highly configurable. Below is an example showing how to customize key components:
from vcache import (
HNSWLibVectorDB,
InMemoryEmbeddingMetadataStorage,
LLMComparisonSimilarityEvaluator,
OpenAIEmbeddingEngine,
OpenAIInferenceEngine,
VCache,
VCacheConfig,
VCachePolicy,
VerifiedDecisionPolicy,
MRUEvictionPolicy,
)
# 1. Configure the components for vCache
config: VCacheConfig = VCacheConfig(
inference_engine=OpenAIInferenceEngine(model_name="gpt-4.1-2025-04-14"),
embedding_engine=OpenAIEmbeddingEngine(model_name="text-embedding-3-small"),
vector_db=HNSWLibVectorDB(),
embedding_metadata_storage=InMemoryEmbeddingMetadataStorage(),
similarity_evaluator=LLMComparisonSimilarityEvaluator(
inference_engine=OpenAIInferenceEngine(model_name="gpt-4.1-nano-2025-04-14")
),
eviction_policy=MRUEvictionPolicy(max_size=4096),
)
# 2. Choose a caching policy
policy: VCachePolicy = VerifiedDecisionPolicy(delta=0.03)
# 3. Initialize vCache with the configuration and policy
vcache: VCache = VCache(config, policy)
response: str = vcache.infer("Is the sky blue?")
You can swap out any component—such as the eviction policy or vector database—for your specific use case.
By default, vCache uses:
OpenAIInferenceEngine
OpenAIEmbeddingEngine
HNSWLibVectorDB
InMemoryEmbeddingMetadataStorage
NoEvictionPolicy
StringComparisonSimilarityEvaluator
VerifiedDecisionPolicy
with a maximum failure rate of 2%
You can find complete working examples in the playground
directory:
example_1.py
- Basic usage with sample data processingexample_2.py
- Advanced usage with cache hit tracking and timing
vCache supports FIFO, LRU, MRU, and a custom SCU eviction policy. See the Eviction Policy Documentation for further details.
For development setup and contribution guidelines, see CONTRIBUTING.md.
vCache includes a benchmarking framework to evaluate:
- Cache hit rate
- Error rate
- Latency improvement
- ...
We provide three open benchmarks:
- SemCacheLmArena (chat-style prompts) - Dataset ↗
- SemCacheClassification (classification queries) - Dataset ↗
- SemCacheSearchQueries (real-world search logs) - Dataset ↗
See the Benchmarking Documentation for instructions.
If you use vCache for your research, please cite our paper.
@article{schroeder2025adaptive,
title={vCache: Verified Semantic Prompt Caching},
author={Schroeder, Luis Gaspar and Desai, Aditya and Cuadron, Alejandro and Chu, Kyle and Liu, Shu and Zhao, Mark and Krusche, Stephan and Kemper, Alfons and Zaharia, Matei and Gonzalez, Joseph E},
journal={arXiv preprint arXiv:2502.03771},
year={2025}
}