Skip to content
Architecture

Prompt Caching

The practice of storing and reusing LLM responses for identical or semantically similar prompt inputs, reducing latency and cost by avoiding redundant model calls.

Prompt caching is a performance optimization technique that stores LLM responses and serves them from cache when the same or sufficiently similar input is seen again. Given that LLM API calls are expensive (measured in tokens) and slow (measured in seconds), caching can dramatically reduce both cost and latency for applications with repetitive query patterns.

Exact-match caching is the simplest form. If the identical prompt — including system prompt, user message, and all parameters — has been seen before, the cached response is returned immediately. This works well for applications where the same questions recur frequently, such as FAQ bots, documentation assistants, and classification pipelines processing items from a limited set of categories.

Semantic caching extends the concept to handle inputs that are different in text but equivalent in meaning. Using embedding similarity, the cache can recognize that "What are your business hours?" and "When are you open?" should return the same response. This requires setting a similarity threshold — too loose and the cache returns irrelevant responses, too strict and it rarely triggers a cache hit.

Cache invalidation is the primary challenge. When a prompt is updated — new instructions, different guardrails, updated context — all cached responses for that prompt must be invalidated because the model would now produce different outputs. Similarly, if the underlying model version changes, cached responses may no longer be representative. Tying cache invalidation to prompt versioning ensures that cache entries are automatically purged when the prompt changes.

Several architectural considerations affect caching strategy. Cache storage can be in-memory (fastest but limited capacity), Redis or similar key-value stores (fast and persistent), or database-backed (durable but slower). Time-to-live (TTL) settings balance freshness against hit rate. Cache warming — pre-populating the cache with responses to common queries — can ensure low latency from the moment a new prompt version is deployed.

Caching interacts with prompt variables in important ways. A prompt template with variables produces different compiled prompts for different variable values, and each unique combination must be cached separately. High-cardinality variables like user IDs effectively defeat caching, while low-cardinality variables like language or region still allow meaningful cache hit rates.

Related Terms

Manage your prompts with PromptOT

Structure, version, and deliver your LLM prompts through a single platform. Start building better AI products today.

Get Started Free