embedcache vs Redis vs GPTCache: Caching for Embedding Computations

The problem

RAG pipelines recompute embeddings for the same text over and over. The same chunk is embedded once at ingest time, but on every query, the query itself is embedded. And on every re-ingest (after a chunk is updated), it’s embedded again. At production scale, this is the single biggest compute cost of a RAG system.

The fix is embedding caching: cache the embedding for a given input text, look it up on subsequent requests, and only compute the embedding on a miss. The hit rate is the difference between a system that costs $1K/month and one that costs $10K/month at production scale.

embedcache is Skelf’s purpose-built embedding cache. This post is the comparison.

What embedcache is

embedcache is a Rust service for caching embedding computations. The key design choices:

Content-addressed. The cache key is a hash of the input text + the model name. Identical inputs are guaranteed to hit.
Pluggable backend. The cache can live in memory, on disk (Sled), or in an external store (Redis, S3).
Multi-model. Cache entries are tagged with the embedding model, so a switch to a new model doesn’t return stale embeddings.
First-class metrics. Hit rate, miss rate, time saved, cost saved — all per model, per input class.
Differential privacy option. For high-stakes caches, embedcache can add noise to the embeddings to bound information leakage.

What each option is

embedcache is the purpose-built embedding cache in Rust. Content-addressed, multi-model, pluggable backend.

Redis is the general-purpose cache. You can use Redis as a key-value store and store embeddings as binary blobs. Works, but no model tagging, no content-addressing.

GPTCache is a Python library for caching LLM responses, including embeddings. Hosted and self-hosted, with semantic similarity lookup (not just exact-match).

Custom in-memory dict is what most teams start with. Works for prototypes, dies at production scale.

The five dimensions

Dimension	embedcache	Redis	GPTCache	Custom dict
Architecture	Purpose-built Rust service	General-purpose cache	Python library	In-memory
Key strategy	Content hash + model	User-defined	Text + similarity threshold	User-defined
Multi-model	Yes (tagged)	DIY	Yes	DIY
Pluggable backend	Memory, Sled, Redis, S3	(it is the backend)	Memory, SQLite, MySQL	Memory only
Hit rate optimisation	Exact-match + content hash	Exact-match (your hash)	Exact + semantic	Exact (your hash)
Metrics	First-class	DIY	Yes	DIY
Differential privacy	Yes (configurable)	No	No	No
License	GPL-3.0	(Redis Source Available)	Apache-2.0	n/a
Production users	(early)	Many	Many	n/a

When to use which

Use embedcache when:

You have a high-volume RAG pipeline and embedding recomputation is a measurable cost.
You want first-class metrics on hit rate, cost saved, etc.
You need content-addressed caching (guaranteed no stale entries on model switch).
You need differential privacy for the cache.

Use Redis when:

You already have Redis.
Your cache keys are well-defined (e.g. by content hash that you compute).
You don’t need the multi-model tagging or differential privacy.

Use GPTCache when:

You are in Python and want a library-style integration.
You want semantic similarity lookup (cache hit when the input is similar to a cached input, not just exact).

Use a custom dict when:

You are prototyping.
The cache is per-process and short-lived.

A concrete example: RAG at production scale

Say you have 10M documents, 100K unique chunks, and 50K queries per day.

Without caching:

50K queries/day * 1 embedding/query = 50K embeddings/day
100K chunks * 0.1 re-embed per chunk per day = 10K embeddings/day
Total: 60K embeddings/day = 1.8M embeddings/month
At $0.02/1K embeddings (text-embedding-3-small): $36/month

With embedcache:

50K queries/day, 95% hit rate (query text repeats a lot)
50K * 0.05 = 2.5K query embeddings/day
100K chunks * 0.1 * 0.01 (cache hit on re-embed) = 100 chunk embeddings/day
Total: 2.6K embeddings/day = 78K embeddings/month
Cost: $1.56/month

Savings: 95%+ reduction in embedding cost.

This is the difference between a RAG system that’s commercially viable and one that’s not.

The semantic-similarity trade-off

embedcache uses exact-match (content hash) for cache keys. This guarantees no false positives (you can never get a wrong embedding back), but it means that slightly-different inputs (whitespace, punctuation, case) are misses.

GPTCache uses semantic similarity (vector similarity): if the input is close to a cached input, return the cached embedding. This can increase the hit rate, at the risk of false positives.

The right choice depends on the use case:

For query embeddings: semantic similarity can give a 5-10% hit rate boost, but the false positive risk is high (a similar but different query gets the wrong embedding).
For chunk embeddings: exact-match is the right choice. The chunk text doesn’t drift; if the text is different, the embedding should be different.

embedcache uses exact-match by default but supports configurable hash strategies (whitespace-insensitive, case-insensitive, etc.) for the cases where they’re safe.

A 5-minute embedcache eval

# Install
cargo install embedcache

# Start embedcache
embedcache serve \
    --backend sled \
    --path /var/lib/embedcache \
    --metrics-port 9090

# Use it from your RAG pipeline
# (Python example)
from embedcache import EmbedCache
cache = EmbedCache("http://localhost:8080")

def embed(text: str) -> list[float]:
    cached = cache.get(text, model="text-embedding-3-small")
    if cached:
        return cached
    embedding = openai.Embedding.create(input=text, model="text-embedding-3-small")
    cache.put(text, model="text-embedding-3-small", embedding=embedding)
    return embedding

That’s it. The cache handles hashing, storage, retrieval, and metrics. You focus on the RAG pipeline.