What is context window?

The maximum number of tokens (input plus output) that an LLM can process in a single request, which determines how much information can be included in a prompt and response.

Context Window - AI Glossary

The context window is the total token capacity of an LLM for a single request — it includes both the input (system prompt, user message, retrieved documents) and the generated output. Understanding and managing the context window is fundamental to building reliable LLM applications.

Context window sizes vary dramatically across models. Early GPT-3 had a 4K token window. GPT-4 Turbo expanded to 128K tokens. Claude models offer up to 200K tokens. These larger windows enable new use cases — processing entire codebases, analyzing long documents, maintaining extended conversation histories — but they come with tradeoffs in cost and latency.

Effective context window management involves budgeting tokens across competing needs. A typical production prompt allocates tokens to the system prompt (instructions, guardrails, examples), retrieved context (RAG documents, conversation history), the user's input, and reserved space for the model's response. If the total exceeds the window, something must be truncated or summarized.

Strategies for managing context window limits include prompt compression (removing redundant instructions, shortening examples), selective retrieval (fetching only the most relevant documents in RAG), conversation summarization (compressing earlier messages into a summary), and hierarchical processing (splitting long documents into chunks processed separately, then combining results).

The "lost in the middle" phenomenon is an important consideration. Research shows that LLMs pay more attention to information at the beginning and end of the context window, with reduced attention to content in the middle. This means that critical instructions and the most relevant context should be placed at the boundaries of the prompt for optimal performance.

Token estimation helps teams plan their context budgets before runtime. A rough approximation of 4 characters per token works for English text, though exact counts depend on the model's tokenizer.

Why context window matters: Context window size is one of the primary constraints shaping what LLM applications can do. Applications that need to analyze long documents, maintain extended conversation histories, or retrieve large amounts of supporting context must carefully engineer their prompts to stay within this limit. Exceeding it results in silent truncation — the model loses information it needs to generate accurate responses, often without any error signal to the developer.

Context Window

Retrieval-Augmented Generation (RAG)

Prompt Template

Prompt Optimization

Manage your prompts with PromptOT.