Retrieval-augmented generation (RAG) is an architecture pattern that grounds LLM responses in external, up-to-date knowledge. Instead of relying solely on the model's training data, RAG systems retrieve relevant documents from a knowledge base and inject them into the prompt as context, enabling the model to generate responses informed by specific, verifiable information.
The RAG pipeline has three stages. First, the retrieval stage takes the user's query and searches a knowledge base — typically a vector database containing embedded documents — to find the most semantically relevant passages. Second, the augmentation stage formats the retrieved documents and inserts them into the prompt alongside the user's query. Third, the generation stage sends the augmented prompt to the LLM, which produces a response grounded in the retrieved context.
RAG solves several fundamental LLM limitations. It addresses knowledge cutoff by providing access to information beyond the model's training date. It reduces hallucinations by grounding responses in verifiable source material. It enables domain-specific expertise without fine-tuning by injecting specialized documents at query time.
The quality of a RAG system depends heavily on the retrieval component. Chunking strategy — how documents are split into searchable passages — affects whether relevant information is found. Embedding model selection impacts semantic search accuracy. Re-ranking algorithms can improve precision by reordering retrieved results before they enter the prompt.
Prompt design is critical in RAG systems. The system prompt must instruct the model to base its responses on the provided context, cite sources when possible, and acknowledge when the retrieved documents don't contain sufficient information to answer the question. Managing the context window budget between retrieved documents and the rest of the prompt requires careful optimization.