What is prompt caching?

The practice of storing and reusing LLM responses for identical or semantically similar prompt inputs, reducing latency and cost by avoiding redundant model calls.

Prompt Caching - AI Glossary

Prompt caching is a performance optimization technique that stores LLM responses and serves them from cache when the same or sufficiently similar input is seen again. Given that LLM API calls are expensive (measured in tokens) and slow (measured in seconds), caching can dramatically reduce both cost and latency for applications with repetitive query patterns.

Exact-match caching is the simplest form. If the identical prompt — including system prompt, user message, and all parameters — has been seen before, the cached response is returned immediately. This works well for applications where the same questions recur frequently, such as FAQ bots, documentation assistants, and classification pipelines processing items from a limited set of categories.

Semantic caching extends the concept to handle inputs that are different in text but equivalent in meaning. Using embedding similarity, the cache can recognize that "What are your business hours?" and "When are you open?" should return the same response. This requires setting a similarity threshold — too loose and the cache returns irrelevant responses, too strict and it rarely triggers a cache hit.

Cache invalidation is the primary challenge. When a prompt is updated — new instructions, different guardrails, updated context — all cached responses for that prompt must be invalidated because the model would now produce different outputs. Similarly, if the underlying model version changes, cached responses may no longer be representative. Tying cache invalidation to prompt versioning ensures that cache entries are automatically purged when the prompt changes.

Several architectural considerations affect caching strategy. Cache storage can be in-memory (fastest but limited capacity), Redis or similar key-value stores (fast and persistent), or database-backed (durable but slower). Time-to-live (TTL) settings balance freshness against hit rate. Cache warming — pre-populating the cache with responses to common queries — can ensure low latency from the moment a new prompt version is deployed.

Caching interacts with prompt variables in important ways. A prompt template with variables produces different compiled prompts for different variable values, and each unique combination must be cached separately. High-cardinality variables like user IDs effectively defeat caching, while low-cardinality variables like language or region still allow meaningful cache hit rates.

Why prompt caching matters: LLM API calls are expensive and slow by software standards — often costing dollars per thousand requests and taking seconds to complete. For applications with repetitive query patterns, caching can reduce both cost and latency by an order of magnitude. At scale, the difference between caching and not caching can determine whether an AI feature is economically viable.

Prompt Caching

Token Optimization

Prompt API

Prompt Template

Variable Interpolation

Context Window

Manage your prompts with PromptOT.