Skip to content
Architecture

Context Window

The maximum number of tokens (input plus output) that an LLM can process in a single request, which determines how much information can be included in a prompt and response.

The context window is the total token capacity of an LLM for a single request — it includes both the input (system prompt, user message, retrieved documents) and the generated output. Understanding and managing the context window is fundamental to building reliable LLM applications.

Context window sizes vary dramatically across models. Early GPT-3 had a 4K token window. GPT-4 Turbo expanded to 128K tokens. Claude models offer up to 200K tokens. These larger windows enable new use cases — processing entire codebases, analyzing long documents, maintaining extended conversation histories — but they come with tradeoffs in cost and latency.

Effective context window management involves budgeting tokens across competing needs. A typical production prompt allocates tokens to the system prompt (instructions, guardrails, examples), retrieved context (RAG documents, conversation history), the user's input, and reserved space for the model's response. If the total exceeds the window, something must be truncated or summarized.

Strategies for managing context window limits include prompt compression (removing redundant instructions, shortening examples), selective retrieval (fetching only the most relevant documents in RAG), conversation summarization (compressing earlier messages into a summary), and hierarchical processing (splitting long documents into chunks processed separately, then combining results).

The "lost in the middle" phenomenon is an important consideration. Research shows that LLMs pay more attention to information at the beginning and end of the context window, with reduced attention to content in the middle. This means that critical instructions and the most relevant context should be placed at the boundaries of the prompt for optimal performance.

Token estimation helps teams plan their context budgets before runtime. A rough approximation of 4 characters per token works for English text, though exact counts depend on the model's tokenizer.

Related Terms

Manage your prompts with PromptOT

Structure, version, and deliver your LLM prompts through a single platform. Start building better AI products today.

Get Started Free