RAG & Knowledge Systems
Retrieval-augmented generation, embeddings, and working with vector databases.
What You'll Learn
- Explain why LLMs need external knowledge retrieval and what problems RAG solves
- Describe how embeddings convert text into numerical vectors and why similarity search works
- Evaluate vector database options and understand how to choose the right one for a given use case
- Design a RAG pipeline including chunking strategy, retrieval configuration, and generation step
- Identify common RAG failure modes and apply techniques like hybrid search to improve retrieval quality
Why LLMs Need External Knowledge
Language models have two fundamental knowledge limitations that no amount of prompt engineering can fix.
The first is the training cutoff. An LLM's knowledge is frozen at the point its training data was collected. Ask it about events after that date and it either hallucinates or admits ignorance. For many use cases (support documentation, internal company knowledge, research on recent developments) this is a dealbreaker.
The second is context window capacity. Even with very large context windows, you cannot simply paste your entire knowledge base into every request. A 200-page technical manual, a database of thousands of product specs, or years of customer support tickets: these exceed what any context window can hold, and even if they fit, forcing the model to read irrelevant material degrades response quality and drives up cost.
Retrieval-augmented generation (RAG) solves both problems by separating knowledge storage from knowledge use. Instead of baking facts into a model's weights or cramming them into a prompt, you store knowledge in an external database and retrieve only the most relevant pieces at query time. The model then generates its response using those retrieved pieces as grounded context.
The result is a system that can work with arbitrarily large knowledge bases, stay current with new information, cite specific sources, and produce answers grounded in your actual data rather than the model's potentially stale or hallucinated recollections. These properties make RAG the most widely deployed architecture for enterprise AI applications.
Embeddings: Turning Text into Numbers
The magic behind RAG (and behind semantic search generally) is embeddings. An embedding is a numerical representation of text: a list of floating-point numbers, usually hundreds or thousands of values long, called a vector. The key property is that the numbers encode meaning. Texts with similar meaning produce vectors that are mathematically close to each other.
Here is a useful analogy. Imagine a giant map where every piece of text occupies a location. Related concepts cluster together: "dog" and "canine" are near neighbors, "machine learning" and "neural network" live in the same neighborhood, "Paris" and "capital of France" are extremely close. Unrelated concepts are far apart. An embedding model learns to assign coordinates on this map.
When you search this map by meaning rather than keywords, you find relevant content even when the exact words do not match. A query about "fixing a login error" will retrieve documentation about "authentication failures" because their embedding vectors are close on the map, even though they share no words.
Embedding models are separate from chat models and are significantly cheaper to run. Common options include OpenAI's text-embedding-3-small and text-embedding-3-large, Cohere's embedding models, and open-source options like the sentence-transformers family. Each model produces vectors of a fixed dimension (the length of the list of numbers), and you must use the same model at both indexing time (when you store documents) and retrieval time (when you embed the query), because mixing models breaks the math.
One important constraint: embedding models have their own token limits, typically 8,000-512,000 tokens depending on the model. Documents longer than this limit must be split before embedding, which brings us to chunking.
Embedding distance is not percentage similarity
When vector databases return a similarity score, a score of 0.85 does not mean 85% accurate. Similarity scores are relative, not absolute. A score of 0.85 might be excellent for one type of content and mediocre for another. Always calibrate what score thresholds mean for your specific data by testing with representative queries.
Vector Databases: Storing and Searching at Scale
A vector database is a storage system purpose-built for storing embedding vectors and searching them efficiently. Standard databases (SQL, NoSQL) store and query structured or textual data well, but running a similarity search across millions of floating-point vectors is a fundamentally different computational problem, one that requires specialized indexing algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to do at practical speed.
The main options in production use today:
Pinecone is a managed, cloud-native vector database. Zero infrastructure to manage, scales automatically, and has a clean API. The tradeoff is cost at scale and vendor lock-in. It is the fastest path to production for teams that do not want to run their own infrastructure.
Weaviate is open-source with a managed cloud option. It supports multi-tenancy natively, has a rich query language, and combines vector search with structured filtering out of the box. Good choice for complex applications that need both semantic and attribute-based filtering.
ChromaDB is open-source and designed for simplicity. It runs in-process (no separate server needed) and is the fastest option to get started with for prototyping and smaller-scale applications. For a RAG proof of concept, ChromaDB gets you running in minutes.
pgvector is a PostgreSQL extension that adds vector similarity search to a standard relational database. If you are already running Postgres, this lets you keep everything in one system. It scales well for hundreds of thousands of vectors; at tens of millions, dedicated vector databases tend to outperform it.
For most production applications using vector-only RAG, the choice comes down to: prototype with ChromaDB, ship to production with Pinecone or Weaviate, consider pgvector if you want to minimize infrastructure complexity and your scale fits it.
Beyond vectors: Graph RAG. Vector databases excel at finding semantically similar chunks of text, but they struggle with questions that require understanding relationships between entities. For example: "Which authors cited by paper X also worked with researcher Y?" or "What products does our largest customer also buy from competitors?" For these relationship-heavy queries, knowledge graphs combined with retrieval form a pattern called Graph RAG.
Neo4j is the most established graph database and has become the default for Graph RAG implementations. It stores data as nodes and relationships, making it natural to model complex networks of entities and their connections. Microsoft GraphRAG is an open-source framework that automatically extracts entities and relationships from documents and builds a knowledge graph, then uses both the graph structure and text chunks for retrieval. FalkorDB is a high-performance graph database optimized for AI workloads, combining graph queries with vector similarity search in a single system.
The practical guidance: use vector-only RAG when your questions are about finding relevant passages in documents. Use Graph RAG when your questions involve relationships, connections, and multi-hop reasoning across entities. Many production systems are increasingly combining both approaches: vector search for content relevance, graph queries for structural relationships.
Building a RAG Pipeline: Chunking, Retrieval, Generation
A RAG pipeline has two phases that operate at different times: indexing (happens once, or on a schedule) and retrieval + generation (happens on every query).
Indexing phase:
You start with your source documents: PDFs, web pages, database records, markdown files, whatever. These get loaded, cleaned, and split into chunks: smaller pieces of text that will each get their own embedding. Each chunk, along with its vector and metadata (source document, page number, date, etc.), gets stored in the vector database.
Chunking strategy is one of the most important decisions in the entire pipeline. Common approaches:
- Fixed-size chunking: Split every 500 tokens with a 50-token overlap between adjacent chunks. Simple and predictable, but may split ideas awkwardly mid-sentence. - Semantic chunking: Split at natural boundaries (paragraphs, sections, sentences) to keep coherent ideas together. Produces variable-length chunks, better for conceptual content. - Hierarchical chunking: Store both a summary chunk and its detailed sub-chunks. Retrieve summaries first, then fetch detail on demand. Powerful for long documents but complex to implement.
Overlap between chunks (repeating a small amount of text at chunk boundaries) prevents the system from missing information that spans a boundary.
Retrieval + Generation phase:
When a query arrives, you embed the query using the same model used at indexing time, run a similarity search to find the top-k most relevant chunks (k is typically 3-10), and assemble those chunks into the prompt alongside the original query. The model generates its response grounded in the retrieved context.
The prompt structure matters. Be explicit about what the retrieved context is and instruct the model to cite it. A system prompt like: "Answer the user's question using only the context provided below. If the context does not contain enough information to answer, say so." dramatically reduces hallucination by anchoring the model to the retrieved material.
Profile your chunk size
The optimal chunk size depends on your content and queries. Before building a full pipeline, test three chunk sizes (e.g., 256, 512, 1024 tokens) against 20 representative queries and score the retrieval relevance manually. You will almost always find a clear winner for your specific domain. Do this experiment before writing production indexing code.
Hybrid Search and Retrieval Quality
Pure semantic search has a well-known weakness: it struggles with precise, specific queries. If a user asks about "RFC 7231" or a specific product SKU like "SKU-9042-X", semantic search may return loosely related results instead of the exact document that mentions those terms, because the embedding for an unfamiliar identifier is not meaningfully different from other identifiers.
This is why production RAG systems almost always use hybrid search: a combination of semantic (vector) search and traditional keyword (BM25 or TF-IDF) search, with results merged and re-ranked.
The keyword component catches exact matches and rare terms. The semantic component catches conceptual matches and handles paraphrasing. A reranker model (a cross-encoder that scores query-document pairs directly rather than comparing vectors) can then re-rank the merged results to surface the most genuinely relevant chunks before passing them to the language model.
Retrieval quality is measured in two dimensions:
Precision: Of the chunks you retrieved, what fraction were actually relevant? Low precision means the model gets distracted by irrelevant context, degrading response quality.
Recall: Of all the relevant chunks in the database, what fraction did you retrieve? Low recall means the model cannot answer questions that the data actually supports.
These metrics trade off against each other through the retrieval threshold (how similar does a chunk need to be to be retrieved?) and k (how many chunks to retrieve). Raising k improves recall but hurts precision. Finding the right balance for your use case requires testing with real queries and real users.
Beyond retrieval metrics, evaluate the full pipeline end-to-end with frameworks like RAGAS (an open-source evaluation framework for RAG systems) that measure faithfulness (does the answer stick to the retrieved context?), answer relevancy (does the answer address the question?), and context relevance (were the retrieved chunks actually useful?).
Common RAG Failure Modes
Knowing where RAG goes wrong lets you design around the failure modes rather than debugging them in production.
Retrieval misses: The right document exists in the database but does not get retrieved. Causes include poor chunking (the relevant text got split across chunks), inadequate overlap, or a query that uses different vocabulary than the source document. Mitigation: test with a diverse query set during development, improve chunking strategy, add hybrid search.
Context overload: You retrieve too many chunks and the model loses focus, generating a response that averages over conflicting sources rather than answering precisely. Mitigation: use a reranker to trim to the most relevant 3-5 chunks rather than dumping 10 or 20 into the prompt.
Stale data: Your index was built months ago and the knowledge base has been updated. Users ask questions about recent changes and get outdated answers with full confidence. Mitigation: implement incremental indexing (re-embed updated or new documents on a schedule), store ingestion timestamps in metadata, and filter by recency when appropriate.
Hallucination despite retrieval: The model receives relevant context but still generates information not in that context. This happens when the prompt does not strongly enough anchor the model to the retrieved material, or when the model's prior training knowledge "bleeds through." Mitigation: use explicit instructions in the system prompt, ask the model to quote from the retrieved context, and flag when a query cannot be answered from the available documents.
Metadata filtering blind spots: A user asks a question that is only relevant to a specific category (e.g., "refund policy for enterprise plans") but the retrieval system searches across all customer tiers. Mitigation: extract and store rich metadata at indexing time, and filter on metadata attributes before running vector search so you only search the relevant subset of the index.
RAG is not a hallucination cure
RAG significantly reduces hallucination by grounding the model in retrieved context, but it does not eliminate it. A model can still hallucinate when the retrieved context is ambiguous, when it contradicts the model's strong priors, or when the user's question falls outside what the knowledge base covers. Always implement a "I don't have enough information to answer this" path in your system prompt.
Key Takeaways
- RAG solves the training cutoff and context size limitations of LLMs by retrieving relevant knowledge at query time rather than storing it in model weights
- Embeddings convert text into numerical vectors where semantic similarity corresponds to mathematical proximity, enabling meaning-based search across large corpora
- Chunking strategy is one of the highest-leverage decisions in a RAG pipeline: chunk size and overlap directly determine whether the right information gets retrieved
- Hybrid search (combining semantic vector search with keyword BM25 search) outperforms either approach alone, especially for precise queries involving specific identifiers or terminology
- RAG quality must be measured end-to-end: retrieval precision and recall at the database level, plus faithfulness and answer relevancy at the generation level