Cord4

Two years ago, "the vector database" was a side-project. Today it is part of the critical path of every production AI application we operate. The shift from row stores to embedding stores is not a flavour-of-the-month architectural fad — it is a durable infrastructure change, and most teams still under-invest in it.

Why this matters now

When the unit of querying changes from "rows that equal X" to "documents that mean something like X", the physics of your database changes. Every decision — indexing, replication, consistency, cost — needs to be reconsidered.

The database used to be the source of truth. In an AI-first world, the database is the source of context. That is a harder problem.

Four patterns we see in production

1. Hybrid search is the default

Pure vector search is rarely the right answer. The systems that work blend:

Keyword / BM25 matching for precision
Dense vector similarity for recall and semantics
Structured filters (tenant, permission, date, type)
Re-ranking with a small cross-encoder for top-K

Teams that skip the re-ranker discover, the hard way, that their "it works on my laptop" demo breaks down under real customer queries.

2. Tenant isolation is non-trivial

"Put a tenant ID in the metadata" works for ten customers. At a thousand, you will fight noisy neighbours, embedding drift across tenants, and accidental cross-tenant leakage in RAG pipelines. Build for this on day one.

3. Freshness is a first-class concern

Vector indexes are not free to update. Production systems need a tiered approach:

Hot tier — in-memory, sub-second updates, small working set
Warm tier — disk-backed, minute-level freshness, full recent corpus
Cold tier — object storage, daily re-indexed, historical corpus

4. Embedding version is a schema

When you upgrade your embedding model, every vector in your index becomes stale. Treat the embedding model version as part of your schema. Version it. Migrate it. Test retrieval quality before and after.

A reference architecture

Here is the shape of the platform we run for clients doing 10M+ queries/day:

Ingestion layer: streaming pipeline that chunks, embeds, and writes to the hot tier with 2–5s freshness.
Storage layer: Postgres + pgvector for transactional data, a dedicated vector store for scale, object storage for archive.
Retrieval layer: an internal service that fans out hybrid queries, merges results, re-ranks, and returns a cited answer bundle.
Eval layer: offline + online evals on retrieval quality, not just model quality.
Observability layer: per-query logs with latency, cost, retrieved chunks, model answer, user feedback.

Five hard-won lessons

Do not optimise embeddings in a vacuum. A 2% NDCG improvement that costs 5x latency is not a win.
Instrument your chunker. Bad retrieval is almost always bad chunking, not bad embedding.
Measure the model's uncertainty. A confident wrong answer is worse than a hedged right one.
Cache aggressively. Most production queries are near-duplicates. Cache the retrieved context, not just the final answer.
Budget retrieval explicitly. Decide in advance what latency and cost you can spend per query. Design backwards from that.

Where this is going

The next wave is learned retrieval — where the retrieval layer itself is a small model trained on your traffic. It is still early, but the results are striking: for a narrow, well-instrumented domain, a 500M-parameter learned retriever can beat an off-the-shelf pipeline at a fraction of the cost.

TL;DR

Hybrid search > pure vector search. Always.
Tenant isolation, freshness tiers and embedding versioning are production concerns from day one.
Retrieval quality dominates model quality for most real-world apps.
Treat your vector layer as critical infra, not a side car.

Vector-native infrastructure: designing databases for an AI-first decade