Chapter 4 — Selecting the Right Vector Database

Fourth post of the chapter-by-chapter walkthrough of LLM Primer III: Enhancing Enterprise AI with RAG. The part of a RAG system that grows fastest, costs most at scale, and locks the team in hardest — chosen on technical merits and decided on operational ones.

Why this chapter exists

The vector database is the storage layer of the retrieval system, and the choice is a multi-axis decision — performance, data residency, the team’s operational shape, the total cost of ownership over the lifetime of the system. A wrong choice locks in for years, because moving a billion embeddings between systems is a project no team takes on lightly. The technical comparisons are useful, but the choice is usually decided on the three axes the technical comparisons elide: where the data is allowed to live, what the team can operate, and what the system costs over its lifetime.

One line: the right vector database is the one whose operational shape matches the team’s shape and whose residency story matches the regulatory perimeter — pure benchmark performance is almost never the binding constraint.

4.1 Architectures: purpose-built versus extensions

The first decision, often unnamed, is between adopting a new system purpose-built for vector workloads and extending a system the team already operates. A purpose-built database — Pinecone, Qdrant, Milvus, Weaviate, Vespa — is engineered from the index outward. Query planner, storage layout, replication model, and API are all designed for approximate nearest-neighbor queries over high-dimensional vectors. Higher performance ceilings, especially on hybrid filter-plus-vector queries; another system to operate.

An extension approach — pgvector, Elasticsearch dense_vector, MongoDB Atlas Vector Search, Redis with RediSearch — adds vector search to a database the team already runs. No new authentication, backup procedure, on-call rotation, or patching cadence. The performance ceiling is set by the host’s architecture and is usually well above what applications below the tens-of-millions-of-vectors range require. The decision is rarely a pure performance question. It is more often a question of where the team is willing to spend its operational budget — a team running one Postgres database does not want to operate a second; a team running ten data services does not blink at an eleventh.

4.2 Managed leaders: Pinecone and Vertex

Pinecone is the canonical managed vector database. Operational simplicity, predictable latency, a mature SDK, and a serverless tier that decouples storage from compute and prices on actual usage rather than provisioned capacity. The right default for new deployments unless the workload specifically benefits from reserved capacity. The price is the architectural lock-in any managed proprietary system carries — embeddings are portable in principle, but the export-and-reindex cost is real. Vertex AI Vector Search is Google’s offering, built on the ScaNN library that powers Google’s own large-scale similarity search. Higher scale ceiling, tight integration with the rest of GCP — embedding models, IAM, Cloud Monitoring — and the matching strategic commitment to a single cloud. Azure AI Search occupies the same position for teams committed to Microsoft. The choice among the three platform-native options usually follows the existing cloud commitment, which is reasonable as long as the team has verified scale and residency.

4.3 Open-source: Qdrant, Milvus, Weaviate

The open-source category is for teams that want to control their own infrastructure — for cost, residency, or strategic reasons — and have the operational capacity to run distributed systems. Qdrant is the smallest and most focused: written in Rust, single-binary deployable, engineered for low-latency vector search with strong filtering and quantization support. Approachable enough to be running in minutes; the right choice for the smallest possible operational footprint. Milvus is the largest and most enterprise-oriented — cloud-native architecture separating compute, storage, and metadata, with the highest scale ceiling in the category (billion-scale corpora, GPU-accelerated indexes, tiered storage). Significant operational complexity, mitigated by Zilliz Cloud. Weaviate sits between them — more feature-rich than Qdrant, less complex than Milvus, with built-in modules for embedding, reranking, and multi-tenancy. A search platform rather than just a search index. All three are Apache-licensed at their cores with paid managed offerings, and benchmarks are within a small constant factor of each other. The decision is fit, not raw performance.

4.4 Embedded and Postgres: Chroma and pgvector

Most real RAG deployments serve dozens of queries per second over a few hundred thousand vectors, not millions per second over billions. For these workloads the right tool runs in the application process or alongside it. Chroma is the embedded option — in-process by default, persists to local disk, the simplest case requires no configuration. Right for prototypes, tools that ship with their own data, and production deployments that fit on one machine. pgvector adds a vector type, distance operators, and HNSW/IVFFlat indexing to Postgres. For corpora up to roughly ten million vectors on a well-provisioned host, pgvector is a credible production choice and the operationally simplest option for teams already running Postgres. Vector search becomes a SQL query against a table the existing ORM understands; joins with structured metadata are first-class. The hidden virtue of these options is that they lower the cost of changing your mind — the migration to a distributed system, if it happens, is bounded.

4.5 Residency, operations, and cost — the axes that actually decide

The three operational axes deserve naming because they are where production decisions are actually decided. Data residency narrows the candidate list before any technical comparison is meaningful. EU data protection, financial-services regulation, sovereign-cloud commitments — these constraints are not negotiable, and the question to ask first of any vendor is which regions they support and what their contractual data-handling commitments are. A particular pitfall: embeddings are derived data, but they remain personal data under most regulatory frameworks because they can be inverted or used to retrieve the original through similarity search. A contract that covers raw documents but is vague about embeddings is incomplete.

Operational shape is the team’s own capacity. A team of three engineers running one application service should choose the option that adds the least operational surface — pgvector, a managed Pinecone or Qdrant Cloud, embedded Chroma. A team of thirty can absorb the complexity of an open-source distributed system for the cost and capability advantages. The mistake is choosing a system mismatched to the team’s actual capacity to run it. Total cost over the system’s lifetime includes provisioning, monitoring, backup, restore drills, capacity planning, upgrade work, and the proportional cost of on-call time. The honest framing asks the cost in three scenarios — current workload, 10x, 100x — because the slope of the curve matters as much as the starting point.

Worth holding onto: write a one-page decision memo before choosing. Name the residency requirements that are non-negotiable, the team’s operational capacity in concrete terms, and the expected cost over a three-year horizon at three workload sizes. The act of writing it surfaces assumptions that would otherwise stay implicit, and teams who circulate it to one or two skeptical reviewers usually catch at least one material problem before commitment. The memo, archived and updated annually, is the cheapest insurance against re-litigating decisions that have already been made well.

What Chapter 4 sets up

The vector database determines what the storage layer can hold, how fast it can be queried, what filter and metadata patterns it supports. None of these properties alone decide retrieval quality — they decide what is possible to build on top. What gets built on top is the retrieval pipeline, where the gains compound: hybrid search combining dense vectors with BM25, reciprocal rank fusion across heterogeneous retrievers, cross-encoder reranking, and the query-understanding layer that bridges how users ask and how documents answer.

Next — Chapter 5: Architecting the Retrieval Pipeline. How dense vector search and keyword retrieval combine through reciprocal rank fusion, the cross-encoder reranking step that closes most of the remaining quality gap, and the query-understanding layer that does the rest.

Want the full picture? The book walks each candidate system with concrete cost models at three workload scales, includes the residency checklist that survives a security review, and treats Vespa as the rank-profile-driven hybrid engine the rest of the category is slowly evolving toward. View LLM Primer III on Amazon →