RAG: Teaching AI What It Doesn't Know

This post covers Retrieval-Augmented Generation (RAG) in depth. It builds directly on the embedding and vector concepts from Embeddings and Vector Space and connects to prompting strategies covered in the next post.

Introduction

LLMs have two fundamental limitations: they only know what was in their training data, and that training data has a cutoff date. They know nothing about your company’s internal documentation, your product catalog, your customer records, or anything that happened after training ended.

Retrieval-Augmented Generation (RAG) solves this by retrieving relevant information from your own data and including it in the prompt. Instead of relying solely on what the model memorized during training, RAG gives it reference material to work from.

This is the pattern behind most enterprise AI applications: internal knowledge bases, document Q&A systems, customer support bots that know your products, and coding assistants that understand your codebase. If you are building an AI application that needs to work with private or current data, you are probably building a RAG system.

How RAG Works

The RAG pipeline has two phases: indexing (done once, or periodically) and retrieval + generation (done on every query).

Indexing Phase

Collect your source documents (knowledge base articles, PDFs, wiki pages, code, etc.)
Chunk each document into smaller pieces
Embed each chunk using an embedding model
Store the embeddings and their associated text in a vector database

Query Phase

A user asks a question
Embed the question using the same embedding model
Search the vector database for the most similar chunks
Construct a prompt that includes the retrieved chunks plus the user’s question
Send that prompt to the LLM
The LLM generates a response grounded in the retrieved context

The model is not searching for answers. It is receiving relevant context and using its language capabilities to synthesize a response. The quality of that response depends heavily on the quality of the retrieved context.

Chunking: The Most Underestimated Step

Chunking is where most RAG implementations succeed or fail. How you split your documents into pieces determines what the retrieval system can find and how useful the results are.

Why Chunking Matters

An embedding represents the overall meaning of a piece of text. If the chunk is too large, the embedding becomes a vague average of many topics. If it is too small, the embedding loses context. The goal is chunks that are topically cohesive: each chunk should be about one thing.

Chunking Strategies

Fixed-size chunking splits text into pieces of a set token count (e.g., 512 tokens) with optional overlap between chunks. Simple to implement, but splits can land in the middle of a paragraph or thought.

Document: [=====|=====|=====|=====]
Chunks:    [--1--][--2--][--3--][--4--]
With overlap: [--1--]
                 [--2--]
                    [--3--]
                       [--4--]

Recursive/structural chunking splits along natural boundaries: paragraphs, sections, headers. Falls back to smaller splits only when sections are too large. Preserves document structure better than fixed-size.

Semantic chunking uses embeddings to detect topic shifts within a document. When the embedding similarity between adjacent sentences drops below a threshold, a new chunk starts. More expensive to compute but produces the most topically cohesive chunks.

Document-aware chunking uses the document’s own structure. For markdown, split on headers. For code, split on function or class boundaries. For HTML, split on structural elements. This leverages the author’s own organization of the content.

Chunk Size Guidelines

Chunk Size	Tradeoffs
128-256 tokens	High precision, low context. Good for FAQ-style retrieval.
256-512 tokens	Good balance for most use cases.
512-1,024 tokens	More context per chunk, but embeddings become less precise.
1,024+ tokens	Risk of topic mixing. Use only when content is highly cohesive.

Overlap

Adding overlap between chunks (e.g., 50-100 tokens) helps preserve context that falls on chunk boundaries. Without overlap, information that spans two chunks might not be retrievable because neither chunk has the complete context.

When this matters in practice:

Start with recursive chunking at 512 tokens with 50-token overlap. This works well for most document types.
Test your chunking with real queries. Retrieve chunks for representative questions and check whether the relevant information appears in the results.
Different content types may need different chunking strategies. Your API documentation and your blog posts probably should not be chunked the same way.

Retrieval: Finding the Right Context

Retrieval is the process of finding the most relevant chunks for a given query. The basic approach is vector similarity search (covered in the previous post), but there are several refinements that improve results.

Hybrid Search

Pure vector search finds semantically similar content. But sometimes the user’s query contains specific terms (product names, error codes, technical identifiers) that need exact matching. Hybrid search combines vector similarity with keyword search (like BM25) to get the best of both.

Example: A user asks “How do I fix error ERR_SSL_PROTOCOL in Chrome?” Vector search finds chunks about SSL errors in general. Keyword search finds chunks containing the exact string “ERR_SSL_PROTOCOL.” Hybrid search surfaces both.

Re-ranking

Initial retrieval returns the top-k most similar chunks. A re-ranker takes those candidates and re-orders them using a more sophisticated (and slower) model that considers the query and each chunk together. This improves precision at a modest computational cost.

The flow: vector search returns 20 candidates, the re-ranker scores each one against the query, and the top 5 are passed to the LLM.

Metadata Filtering

Not all chunks are equally relevant regardless of semantic similarity. Filtering by metadata (date, source, category, author, access level) narrows the search to appropriate content before similarity comparison.

Example: A user asks about the current return policy. Filtering to documents updated in the last 6 months prevents the system from retrieving an outdated policy.

Multi-query Retrieval

A single query might not capture all aspects of what the user needs. Multi-query retrieval generates multiple reformulations of the original query, retrieves chunks for each, and merges the results. This improves recall for complex questions.

When this matters in practice:

Start with basic vector search. Add hybrid search, re-ranking, or metadata filtering when retrieval quality is not meeting needs.
The number of chunks you retrieve (top-k) is a tradeoff. More chunks provide more context but use more tokens (and cost more). Start with 3-5 chunks and adjust based on results.
Always test retrieval independently from generation. If the right chunks are not being retrieved, no amount of prompt tuning will fix the output.

Generation: Using Retrieved Context

Once you have retrieved relevant chunks, you need to present them to the LLM along with the user’s question. This is the generation step.

A typical RAG prompt structure:

System: You are a helpful assistant that answers questions
based on the provided context. If the context does not contain
enough information to answer the question, say so.

Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]

User: [Original question]

Prompt Design for RAG

Tell the model to use the context. Without explicit instructions, the model might ignore the retrieved chunks and answer from its training data.

Tell the model what to do when context is insufficient. Without this instruction, the model will often hallucinate an answer rather than admitting it does not know.

Cite sources. Ask the model to reference which chunks informed its answer. This makes responses verifiable and builds trust.

Separate context from instructions. Use clear delimiters (XML tags, headers, or other markers) so the model can distinguish between the retrieved content and the instructions for how to use it.

Handling No Results

When retrieval returns no relevant chunks (or chunks with low similarity scores), the system needs a fallback. Options include:

Tell the user the information was not found
Fall back to the model’s general knowledge with a disclaimer
Suggest related topics that do have coverage
Route to a human agent

The worst option is generating an answer without relevant context and presenting it as authoritative.

When this matters in practice:

RAG quality is a function of retrieval quality AND prompt quality. Both need tuning.
Including too many chunks dilutes the signal. The model might latch onto a less relevant chunk instead of the best one. Quality over quantity.
Test end-to-end: same question, check what was retrieved, check what was generated. Debug failures at each step independently.

RAG vs. Fine-Tuning vs. Long Context

RAG is not the only way to give a model access to information. Understanding when to use each approach is important.

Approach	Best For	Tradeoffs
RAG	Large, changing knowledge bases. Private data. Current information.	Requires infrastructure (vector DB, embedding pipeline). Quality depends on retrieval.
Fine-tuning	Changing model behavior or style. Teaching domain-specific patterns.	Expensive to create and maintain. Does not handle frequently changing data well.
Long context	Small, stable document sets. Single-session analysis.	Uses more tokens per request (higher cost). Retrieval is implicit, not controlled.
Prompt engineering	Small amounts of reference information. Behavioral instructions.	Limited by context window. Manual curation required.

These approaches are not mutually exclusive. A well-designed system might use fine-tuning for domain-specific behavior, RAG for knowledge retrieval, and careful prompting for output formatting.

When this matters in practice:

If your data changes frequently (daily or weekly), RAG is almost always the right choice. Fine-tuning cannot keep up with that pace of change.
If your entire knowledge base fits comfortably in the context window, you might not need RAG at all. Just include it in the prompt.
RAG adds latency (embedding the query + vector search + LLM call). For latency-sensitive applications, measure the full pipeline.

Building a RAG System: Step by Step

Here is a practical path for building your first RAG system:

Start with your data. Identify the documents, articles, or records the system needs access to.
Choose an embedding model. OpenAI’s text-embedding-3-small is a reasonable default. Open source options like BGE-large work well if you want to self-host.
Choose a vector database. pgvector if you already use PostgreSQL. Chroma for prototyping. Pinecone or Qdrant for production scale.
Implement chunking. Start with recursive chunking at 512 tokens with overlap.
Index your documents. Chunk, embed, and store.
Build the retrieval pipeline. Query embedding, vector search, return top-k chunks.
Build the generation prompt. System instructions, retrieved context, user question.
Test with real questions. Evaluate both retrieval accuracy and generation quality.
Iterate. Adjust chunk size, retrieval count, prompt wording, and add hybrid search or re-ranking as needed.

Common RAG Failures

Wrong chunks retrieved. The most common failure. The embedding model does not capture the right semantic relationship, or the chunking split relevant information across multiple chunks. Fix: test retrieval independently, try different chunking strategies, consider hybrid search.

Right chunks retrieved, wrong answer generated. The model has the right context but misinterprets it, ignores it, or adds information from its own training. Fix: improve the generation prompt, add explicit instructions to only use provided context.

Stale data. The indexed documents are out of date. Fix: build a reindexing pipeline that runs on a schedule or is triggered by document changes.

Context overflow. Too many chunks are stuffed into the prompt, exceeding the context window or degrading quality. Fix: reduce top-k, improve retrieval precision so fewer chunks are needed, summarize retrieved chunks before including them.

What Comes Next

This post covered the RAG pattern from chunking through retrieval to generation. The next post in this series explores Prompting and Inference: how to structure prompts effectively, control model output, and get consistent results.

Closing Thoughts

RAG is the most practical pattern in the AI stack for enterprise applications. It solves the fundamental problem of getting models to work with your data without retraining them. The concept is simple: retrieve relevant context, include it in the prompt, generate a response.

The execution has depth. Chunking strategy, embedding model choice, retrieval tuning, prompt design, and failure handling all affect the quality of the final output. But the barrier to entry is low. A basic RAG system can be built in a day. Making it production-quality takes iteration and attention to each stage of the pipeline.

If you are building one AI-powered feature for your organization, it is probably a RAG system. Understanding this pattern well is the highest-leverage investment you can make in applied AI.

Introduction

How RAG Works

Indexing Phase

Query Phase

Chunking: The Most Underestimated Step

Why Chunking Matters

Chunking Strategies

Chunk Size Guidelines

Overlap

Retrieval: Finding the Right Context

Hybrid Search

Re-ranking

Metadata Filtering

Multi-query Retrieval

Generation: Using Retrieved Context

Prompt Design for RAG

Handling No Results

RAG vs. Fine-Tuning vs. Long Context

Building a RAG System: Step by Step

Common RAG Failures

What Comes Next

Closing Thoughts

Comments