Semantic caching for LLM Applications and AI Agents

Caching is one of the easiest ways to speed up applications and control cost. But LLM based systems don’t work well with traditional caching because users phrase the same idea in many different ways. This means most queries turn into cache misses.

Still, caching is important. LLM agents can take time to run, and inference is expensive. A smarter caching method is needed.

Why Traditional Caching Fails

  • Works on exact matching.

  • Natural language rarely matches exactly.

  • Same intent written differently becomes a cache miss.

  • Result: almost no benefit for LLM workloads.

What Semantic caching Does



  • Focuses on meaning instead of exact text.

  • Convert each query into an embedding.

  • Store embeddings in a vector database like Redis, Qdrant, or Milvus.

  • Add TTL to control freshness.

  • For each new query:

    • Convert to embedding.

    • Run similarity search.

    • If similar enough, return cached output.

    • Otherwise, run the agent.

Redis Code Sample for Semantic caching

Below is a clean example using Redis and OpenAI embeddings.

Store an embedding in Redis

Search for similar embeddings

How To Make Semantic caching Better

Semantic caching Better

Decide When Cache Should Be Used

  • Some questions rely on time sensitive data.

  • Use an LLM classifier to detect time based intent.

  • If time based, skip cache.

  • If not, cache is fine.

  • Example using OpenAI to classify if a query is time sensitive

Validate Before Storing

Example using OpenAI to check answer quality

  • Use an LLM judge to check answer quality.

  • Only store valid responses.

  • Prevents polluted or wrong cache entries.

Improve Input Quality

Improving input before embedding increases cache hit rate.

  • Use fuzzy matching to fix spelling.

  • Normalize text before embedding.

  • Leads to better similarity matches.

  • Example using OpenAI to correct spelling

Tune the Similarity Threshold

  • Similarity score controls when a cached answer is considered valid.

  • A higher threshold reduces wrong cache hits but may lower hit rate.

  • A lower threshold increases hit rate but risks incorrect matches.

  • Find the right balance by measuring precision and recall.

Rerank Multiple Matches

This gives a fast and reliable ranking of candidates.

  • Vector search may return several hits.

  • Use an LLM reranker to pick the best one.

  • Reranking is cheap because only short text is used.

  • Example using OpenAI to rerank vector search results

Cache More Than the Final Output

Multi step agents run multiple internal sub agents. Each sub agent produces intermediate results that can be cached. Instead of caching only the final answer, cache each step separately.

How to structure caching for multi step agents:

  • Treat each sub agent call as its own cacheable function.

  • Create a unique cache key for each step using:

    • Step name

    • Normalized user input for that step

    • (Optional) metadata like region, model version, or context

  • Embed the sub input and run similarity search before executing that step.

  • If a hit is found, skip running that sub agent and reuse the cached output.

  • Only run the expensive sub agent calls when there is a cache miss.

Short example workflow:

  • Shipment lookup → Cache lookup result using shipment ID or embedded query.

  • Weather check → Cache the weather response with a short TTL.

  • Route calculation → Cache common source destination routes.

  • Cost estimation → Cache the computed cost for reused parameter sets.

This structure cuts latency across the entire pipeline because several steps get short circuited by cache hits.

Cost and Latency Impact

  • In real LLM agent workloads, Semantic caching can reduce latency by 40 to 80 percent for repeated or similar queries.

  • Token and compute costs can drop by 30 to 70 percent depending on hit rate.

  • Heavy multi step agents benefit the most because intermediate steps are cached.

Other Important Points

  • Cache invalidation: Use TTL or event triggers when data changes.

  • Avoid sensitive data: Don’t cache PII or confidential responses.

  • Embedding versioning: Tag cached vectors with model version.

  • Hit rate analytics: Track cache hits, misses, and accuracy to tune thresholds.

Drawbacks

Semantic caching can return the wrong answer if the similarity threshold is too loose. Tuning and validation reduce the risk but cannot remove it entirely.

Additional tradeoffs:

  • Extra LLM calls for classification, validation, and reranking add some overhead.

  • Embedding generation introduces latency for every new query.

  • Vector search also adds a small lookup cost. Tuning and validation reduce the risk but cannot remove it entirely.

Final Takeaway

Semantic caching speeds up LLM and agent workflows, cuts cost, and keeps answers consistent by reusing results that match intent. It’s simple, effective, and a big win for any production AI system.

If you need help setting this up, feel free to comment.

Comments

Popular posts from this blog

Deploying AI Agents in Production Using Open Source Architecture

Welcome to Pods and Prompts