January 15, 2024

Vector Database Management Systems: Architecture and Internals

A technical overview of how Vector Database Management Systems work, covering vectorization, indexing, hardware tradeoffs, and the core challenges in building systems that search by meaning rather than exact match.

vector databasesVDBMSembeddingssimilarity searchmachine learning systems

This post covers the generalized architecture of Vector Database Management Systems (VDBMS). It is meant to give a clear mental model of how these systems are structured, not a deep dive into any specific implementation. If you already know what a vector is and have worked with a relational or document database, you have everything you need to follow along.

By the end, you should be able to answer: what a vector database actually stores, why regular databases fall short for AI workloads, how the core pipeline works, and what the active pain points are.

What is a Vector Database?

A vector database stores high-dimensional numerical representations of data, called embeddings, rather than the raw data itself. The key distinction is that it is optimized not for exact lookups, but for approximate nearest neighbor (ANN) search: given a query vector, find the stored vectors most similar to it.

These embeddings are generated by machine learning models. A sentence encoder produces a vector that captures the semantic meaning of a sentence. A vision model produces a vector that captures the visual content of an image. Two pieces of data that are semantically similar will have embeddings that are geometrically close in the vector space, even if they share no exact words or pixels.

This is what makes vector databases useful for AI applications. You are not searching for an exact match. You are searching for meaning.

It is important to distinguish between a vector database (stores embeddings, handles retrieval) and a Vector Database Management System (VDBMS), which also handles the embedding generation, indexing, replication, and query pipeline end to end. In practice the terms are used interchangeably, and for the rest of this post I will do the same.

Why Not a Regular Database?

Relational databases are built around exact match and range queries on structured data. They are excellent at answering "find all orders placed after date X where amount > Y." They are not designed for "find all images that look like this one" or "find documents that mean the same thing as this query, even if they use different words."

You could store a vector as a blob or an array column in Postgres. But running a similarity search over millions of 1536-dimensional vectors with a brute-force scan is O(n * d) per query, where d is the dimensionality. At scale, this is unusable without specialized index structures.

Vector databases exist to make this fast.

General Architecture of a VDBMS

Every VDBMS, regardless of implementation, has three core functional layers: vectorization, indexing, and hardware/operations. The diagram below shows how data and queries flow through the system.

VDBMS Architecture

Insert Path

Raw Data
text / image / audio / video

↓

Embedding Model
Word2Vec, FastText, CLIP, BERT...

↓

Dense Vector
float32[768] or float32[1536]...

↓

Index (ANN)
HNSW / IVF / PQ

↓

Vector Store
+ metadata + replicas

Query Path

Query
"find images of a cat"

↓

Same Embedding Model
must match insert embedding space

↓

Query Vector
same dimensionality as stored vectors

↓

Similarity Search
cosine / dot product / L2 distance

↓

Top-K Results
ranked by similarity score

Layer 1: Vectorization

Every data point that enters the system is passed through an embedding model, which converts it into a fixed-length dense vector. The embedding model is the most consequential design decision in the system. If you use BERT to embed your documents at insert time, you must use BERT (or a compatible model) to embed queries at search time. The vector space has to be the same.

Common embedding models by modality:

Text: Word2Vec, FastText, Doc2Vec (older, shallow); BERT, Sentence-BERT, Ada-002 (modern, transformer-based)
Images: ResNet features, CLIP (cross-modal, so the same space works for text and images)
Audio: wav2vec, Whisper encoder outputs

The dimensionality of the output vector is determined by the model. OpenAI's text-embedding-ada-002 outputs 1536 dimensions. Sentence-BERT outputs 768. Higher dimensionality generally means richer representations, but it also means more compute at every subsequent step.

Layer 2: Indexing

Once vectors are stored, the indexing layer is what makes retrieval fast. Brute-force search over all stored vectors is exact but slow. Approximate nearest neighbor (ANN) indexes trade a small amount of recall for dramatically faster search.

The three main families of ANN index:

HNSW (Hierarchical Navigable Small World): Graph-based. Builds a multi-layer graph where each layer is a coarser approximation of the data. Query traversal starts at the top layer and narrows down. High accuracy, fast queries, but high memory usage.
IVF (Inverted File Index): Clusters vectors using k-means, then at query time only searches within the nearest clusters. Fast, but accuracy depends on cluster quality.
PQ (Product Quantization): Compresses vectors by splitting them into subvectors and quantizing each independently. Drastically reduces memory footprint. Often combined with IVF (IVF-PQ).

Similarity between the query vector and stored vectors is computed using a distance or similarity metric. Cosine similarity is common for semantic embeddings because it measures the angle between vectors, ignoring magnitude. L2 (Euclidean) distance is more appropriate when magnitude carries information.

Layer 3: Hardware and Operations

As the number of stored vectors and their dimensionality grow, compute requirements scale significantly. Embedding generation and ANN index construction are GPU-accelerated in most production systems. Libraries like FAISS (Facebook AI Similarity Search) expose GPU-native HNSW and IVF implementations.

On the operational side, vector databases face the same challenges as any distributed database:

Replication: Data is replicated across nodes to prevent loss on hardware failure.
Sharding: Large vector collections are partitioned across multiple nodes.
Consistency vs. availability tradeoffs: Most VDBMSs lean toward availability given that approximate results are already acceptable.

Use Cases

The core operation of a VDBMS (find semantically similar items) maps onto a surprisingly wide range of applications:

Retrieval-Augmented Generation (RAG): Store document chunks as embeddings. At inference time, embed the user query, retrieve the most relevant chunks, and pass them to an LLM as context. This is how ChatGPT plugins and most enterprise AI search systems work.
Image search: Google Photos uses visual embeddings to find "cat" images without any tags. The query embedding and image embeddings live in the same space via models like CLIP.
Music recognition: Shazam generates an audio fingerprint (a kind of embedding) from a short clip and finds the nearest match in its database.
Recommendation systems: User behavior is encoded as a vector. Items similar to what a user has engaged with are retrieved via ANN search.
Chatbot memory / context: Long conversation histories are stored as embeddings. Relevant past context is retrieved at each turn rather than stuffed into a fixed context window.

Challenges and Limitations

Speed vs. Accuracy

ANN indexes are approximate by design. The recall vs. latency tradeoff is parameterized (e.g., the ef_search parameter in HNSW controls how many candidates are explored). Higher recall means more computation. In domains like medical diagnosis or legal document retrieval, where missing the right result is costly, this tradeoff is an open problem. Exact search is always an option, but it does not scale.

The Curse of Dimensionality

As vector dimensionality grows, the distance between any two random vectors converges. In high-dimensional spaces, everything is roughly equidistant from everything else, which makes distance-based similarity less discriminative. Dimensionality reduction (PCA, UMAP) can help, but projection always loses some information. This is an inherent property of the geometry, not a bug in any particular implementation.

Ecosystem Maturity

Vector databases as a distinct product category emerged around 2019 to 2020 with systems like Pinecone, Weaviate, and Milvus. Chroma and Deep Lake followed as developer-friendly alternatives. Postgres extensions like pgvector have added ANN support directly to relational databases, blurring the line between systems. The tooling is moving fast, and production best practices are still being established.

Summary

A VDBMS solves a specific problem: fast approximate retrieval over high-dimensional embeddings. Its architecture has three layers, vectorization (embedding generation), indexing (ANN data structures for fast search), and hardware/ops (GPU compute, replication, sharding). The core algorithmic tradeoff in the indexing layer is recall versus latency, and this tradeoff is not yet solved in the general case.

For a deeper technical treatment, the original paper linked below covers the formal definitions and benchmarks in detail.

References

Article: VDBMS: Fundamental concepts, use-cases, and current challenges
Video: Building Production-Ready RAG Applications — Jerry Liu
Pinecone Documentation