Tokens and Context Windows
This post explores tokenization and context windows in depth. It builds on the foundation established in How LLMs Actually Work and provides context that is essential for understanding RAG, prompting, and cost management in later posts.
Introduction
The previous post explained that models do not read words. They read tokens. This post explores what that actually means in practice: how tokenization works, why different models tokenize differently, what context windows are, and how these constraints shape the way you build with AI.
Tokens and context windows are not abstract concepts. They directly determine what you can do with a model, how much it costs, and where things break. If you have ever hit a context limit, gotten an unexpectedly large API bill, or wondered why a model seemed to forget something you told it three messages ago, the answer is usually here.
What Is a Token?
A token is the basic unit of text that a model processes. It is not a word, not a character, and not a sentence. It is a chunk of text determined by the model’s tokenizer, a preprocessing step that breaks input text into pieces the model can work with.
Common patterns:
- Short, common words are usually one token: “the”, “is”, “and”
- Longer words get split: “understanding” might become “under” + “standing”
- Spaces often attach to the following word: ” hello” (with a leading space) is one token
- Punctuation is typically its own token: ”.”, ”,”, ”!”
- Numbers can be unpredictable: “2024” might be one token, “12345” might be “123” + “45”
- Code tokens follow similar patterns: variable names, operators, and keywords each become tokens
A rough rule of thumb for English text: one token is approximately 3/4 of a word. A 1,000-word document is roughly 1,300 tokens. But this varies by content. Code tends to use more tokens per line than prose. Languages with longer words or non-Latin scripts can use more tokens per word.
How Tokenizers Work
Each model family has its own tokenizer. GPT-4, Claude, Llama, and Mistral all tokenize differently. The most common approach is Byte-Pair Encoding (BPE).
BPE works by starting with individual characters and iteratively merging the most frequently occurring pairs into new tokens. After training the tokenizer on a large corpus:
- Very common words become single tokens
- Less common words get split into subword pieces
- Rare words get split into smaller fragments or even individual characters
This is why a model can handle any input, even words it has never seen. It just breaks them into smaller pieces it does recognize.
Different tokenizers, different results. The same sentence can produce different token counts across models:
| Text | GPT-4 Tokens | Claude Tokens | Llama 3 Tokens |
|---|---|---|---|
| ”Hello, world!“ | 4 | 4 | 4 |
| ”Tokenization is fascinating” | 4 | 4 | 5 |
| ”Pneumonoultramicroscopicsilicovolcanoconiosis” | 9 | 8 | 11 |
The exact counts vary, but the principle is the same: common text is efficient, unusual text costs more tokens.
When this matters in practice:
- If you are comparing costs between providers, you cannot just compare per-token pricing. You also need to compare how many tokens each provider uses for the same input.
- Tokenizer differences explain why the same prompt might work fine on one model but exceed the context window on another.
Why Tokenization Matters for Cost
API-based models charge per token, both input and output. Understanding tokenization is understanding your bill.
A typical pricing structure (using approximate values):
| Input Tokens | Output Tokens | |
|---|---|---|
| Frontier model (e.g., Claude Opus, GPT-4) | $10-15 / 1M tokens | $30-75 / 1M tokens |
| Mid-tier model (e.g., Claude Sonnet, GPT-4o) | $3 / 1M tokens | $15 / 1M tokens |
| Efficient model (e.g., Claude Haiku, GPT-4o mini) | $0.25-0.80 / 1M tokens | $1-4 / 1M tokens |
Output tokens are more expensive than input tokens because generation requires more computation per token than processing input.
Practical cost example: You are building a customer support chatbot. Each conversation averages 2,000 input tokens (system prompt + conversation history + user message) and 500 output tokens (the response). At mid-tier pricing ($3/$15 per million tokens):
- Input cost: 2,000 tokens x $3/1M = $0.006
- Output cost: 500 tokens x $15/1M = $0.0075
- Total per conversation: ~$0.014
- At 10,000 conversations per day: ~$140/day
That system prompt is included in every single request. If your system prompt is 500 tokens, that is 25% of your input cost on every call. Optimizing prompt length has direct budget impact.
When this matters in practice:
- Long system prompts multiply across every request. A 2,000-token system prompt that could be trimmed to 800 tokens saves 60% of that cost on every call.
- Conversation history grows with each turn. Without management (summarization, truncation, or windowing), costs increase throughout a conversation.
- Choosing the right model tier for the task matters more than micro-optimizing tokens. If a cheaper model handles the task well, that is the biggest cost lever.
Context Windows: The Model’s Working Memory
A context window is the maximum number of tokens a model can process in a single request. This includes everything: the system prompt, conversation history, any retrieved documents, the user’s message, and the model’s response.
Think of it as the model’s working memory. Everything the model can “see” and reason about must fit within this window.
Current context window sizes:
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| Claude Sonnet/Opus | 200K tokens |
| Gemini 1.5 Pro | 1M+ tokens |
| Llama 3 (8B/70B) | 8K-128K tokens |
| Mistral Large | 128K tokens |
128K tokens is roughly 96,000 words, or about 300 pages of text. That sounds like a lot. In practice, it fills up faster than you expect.
How Context Windows Fill Up
Here is how a typical AI application uses context:
| Component | Typical Size | Purpose |
|---|---|---|
| System prompt | 200-2,000 tokens | Sets model behavior, role, constraints |
| Retrieved documents (RAG) | 1,000-10,000 tokens | Context from your data |
| Conversation history | 500-50,000+ tokens | Prior messages in the conversation |
| User message | 50-2,000 tokens | The current request |
| Model response | 100-4,000 tokens | The output |
A chatbot with a detailed system prompt, RAG context, and a 20-turn conversation history can easily consume 30,000-50,000 tokens per request. At that point, even a 128K window is only handling a few more rounds before something needs to give.
The hidden cost of conversation history: In a multi-turn conversation, the model re-processes the entire history on every turn. Turn 1 might use 1,000 tokens. Turn 10 might use 10,000. Turn 30 might use 40,000. Both cost and latency increase with every message.
What Happens When You Hit the Limit
When your input exceeds the context window, one of several things happens depending on the system:
- The API returns an error. The request is rejected outright. You need to reduce the input.
- The system truncates. Some implementations silently drop older messages or content to fit within the window. This can cause the model to lose important context without warning.
- The model degrades. Some models accept long inputs but perform worse as they approach the window boundary. Information in the middle of very long contexts can receive less attention than information at the beginning or end. This is sometimes called the “lost in the middle” problem.
None of these outcomes is ideal. Managing context window usage is a core part of building reliable AI applications.
Strategies for Managing Context
Conversation History Management
Sliding window: Keep only the most recent N messages. Simple, but loses early context that might be important.
Summarization: Periodically summarize older conversation history into a shorter form and replace the full messages with the summary. Preserves key information while reducing token count.
Selective inclusion: Only include messages that are relevant to the current query. Requires more logic but uses tokens most efficiently.
Prompt Optimization
Trim your system prompt. Every token in the system prompt is repeated on every request. Be concise. Remove examples that are not pulling their weight. Test whether shorter prompts produce equivalent results.
Use structured formats. Bullet points and structured data often convey the same information in fewer tokens than prose paragraphs.
Separate instructions from data. Keep the static instruction portion of your prompt short and put variable data (retrieved documents, user context) in a clearly delineated section.
Chunking for Long Documents
When working with documents that exceed the context window, you need to break them into smaller pieces. This is chunking, and it is a core part of RAG pipelines (covered in RAG: Teaching AI What It Doesn’t Know).
The goal is to split documents into pieces that are small enough to fit in the context window alongside other content, but large enough to preserve meaning.
Tokenization Quirks and Edge Cases
Tokenization is not perfect, and its quirks show up in model behavior:
Counting and spelling. Ask a model how many letters are in “strawberry” and it might get it wrong. The model does not see individual letters. It sees tokens like “str”, “aw”, “berry”. Counting characters requires the model to reason about something it cannot directly observe.
Non-English languages. Languages like Chinese, Japanese, Korean, and Arabic often require more tokens per word than English. A sentence that is 20 tokens in English might be 35 tokens in Japanese. This means non-English users effectively get smaller context windows and pay more per word.
Code. Code tokenization varies by language and style. Python with descriptive variable names uses more tokens than Python with short names. Whitespace-heavy languages use tokens on indentation. Minified code uses fewer tokens but is harder for the model to reason about.
Special tokens. Models use special tokens that are not visible to users but consume space: beginning-of-sequence markers, end-of-turn markers, role indicators (system/user/assistant). These overhead tokens add up across a conversation.
When this matters in practice:
- If your application serves non-English users, budget for higher token costs and test context window usage with representative content.
- For code-heavy applications, test with realistic code samples to estimate token usage accurately. A “128K context” does not mean “128K characters of code.”
- When debugging unexpected model behavior (bad counting, strange spelling), consider whether tokenization is the root cause.
Counting Tokens Before You Send Them
Most model providers offer tokenizer libraries that let you count tokens before making an API call:
- OpenAI:
tiktokenlibrary (Python) - Anthropic:
anthropicSDK includes token counting - Open source models:
transformerslibrary from Hugging Face
Counting tokens before sending is important for:
- Staying within context limits
- Estimating costs before committing to a request
- Deciding when to truncate or summarize conversation history
- Validating that your chunking strategy produces appropriately sized chunks
You do not want to discover you have exceeded the context window by getting an error in production. Count first.
What Comes Next
This post covered how text becomes tokens and how context windows constrain what models can process. The next post in this series, Embeddings and Vector Space, explores how text is converted into numerical representations that capture meaning, and how those representations power search, classification, and retrieval.
Closing Thoughts
Tokens and context windows are the practical constraints that shape every AI application. They determine cost, capability, and failure modes. Understanding them is not optional for anyone building with LLMs.
The good news: these constraints are manageable. Smart prompt design, conversation history management, and appropriate chunking strategies let you work within the limits effectively. The key is knowing the limits exist and designing for them from the start, not discovering them when something breaks.
Found this useful?
If this post helped you, consider buying me a coffee.
Comments