Tokens and Context Windows

This post explores tokenization and context windows in depth. It builds on the foundation established in How LLMs Actually Work and provides context that is essential for understanding RAG, prompting, and cost management in later posts.

Introduction

The previous post explained that models do not read words. They read tokens. This post explores what that actually means in practice: how tokenization works, why different models tokenize differently, what context windows are, and how these constraints shape the way you build with AI.

Tokens and context windows are not abstract concepts. They directly determine what you can do with a model, how much it costs, and where things break. If you have ever hit a context limit, gotten an unexpectedly large API bill, or wondered why a model seemed to forget something you told it three messages ago, the answer is usually here.

What Is a Token?

A token is the basic unit of text that a model processes. It is not a word, not a character, and not a sentence. It is a chunk of text determined by the model’s tokenizer, a preprocessing step that breaks input text into pieces the model can work with.

Common patterns:

Short, common words are usually one token: “the”, “is”, “and”
Longer words get split: “understanding” might become “under” + “standing”
Spaces often attach to the following word: ” hello” (with a leading space) is one token
Punctuation is typically its own token: ”.”, ”,”, ”!”
Numbers can be unpredictable: “2024” might be one token, “12345” might be “123” + “45”
Code tokens follow similar patterns: variable names, operators, and keywords each become tokens

A rough rule of thumb for English text: one token is approximately 3/4 of a word. A 1,000-word document is roughly 1,300 tokens. But this varies by content. Code tends to use more tokens per line than prose. Languages with longer words or non-Latin scripts can use more tokens per word.

How Tokenizers Work

Each model family has its own tokenizer. GPT-4, Claude, Llama, and Mistral all tokenize differently. The most common approach is Byte-Pair Encoding (BPE).

BPE works by starting with individual characters and iteratively merging the most frequently occurring pairs into new tokens. After training the tokenizer on a large corpus:

Very common words become single tokens
Less common words get split into subword pieces
Rare words get split into smaller fragments or even individual characters

This is why a model can handle any input, even words it has never seen. It just breaks them into smaller pieces it does recognize.

Different tokenizers, different results. The same sentence can produce different token counts across models:

Text	GPT-4 Tokens	Claude Tokens	Llama 3 Tokens
”Hello, world!“	4	4	4
”Tokenization is fascinating”	4	4	5
”Pneumonoultramicroscopicsilicovolcanoconiosis”	9	8	11

The exact counts vary, but the principle is the same: common text is efficient, unusual text costs more tokens.

When this matters in practice:

If you are comparing costs between providers, you cannot just compare per-token pricing. You also need to compare how many tokens each provider uses for the same input.
Tokenizer differences explain why the same prompt might work fine on one model but exceed the context window on another.

Why Tokenization Matters for Cost

API-based models charge per token, both input and output. Understanding tokenization is understanding your bill.

A typical pricing structure (using approximate values):

	Input Tokens	Output Tokens
Frontier model (e.g., Claude Opus, GPT-4)	$10-15 / 1M tokens	$30-75 / 1M tokens
Mid-tier model (e.g., Claude Sonnet, GPT-4o)	$3 / 1M tokens	$15 / 1M tokens
Efficient model (e.g., Claude Haiku, GPT-4o mini)	$0.25-0.80 / 1M tokens	$1-4 / 1M tokens

Output tokens are more expensive than input tokens because generation requires more computation per token than processing input.

Practical cost example: You are building a customer support chatbot. Each conversation averages 2,000 input tokens (system prompt + conversation history + user message) and 500 output tokens (the response). At mid-tier pricing ($3/$15 per million tokens):

Input cost: 2,000 tokens x $3/1M = $0.006
Output cost: 500 tokens x $15/1M = $0.0075
Total per conversation: ~$0.014
At 10,000 conversations per day: ~$140/day

That system prompt is included in every single request. If your system prompt is 500 tokens, that is 25% of your input cost on every call. Optimizing prompt length has direct budget impact.

When this matters in practice:

Long system prompts multiply across every request. A 2,000-token system prompt that could be trimmed to 800 tokens saves 60% of that cost on every call.
Conversation history grows with each turn. Without management (summarization, truncation, or windowing), costs increase throughout a conversation.
Choosing the right model tier for the task matters more than micro-optimizing tokens. If a cheaper model handles the task well, that is the biggest cost lever.

Context Windows: The Model’s Working Memory

A context window is the maximum number of tokens a model can process in a single request. This includes everything: the system prompt, conversation history, any retrieved documents, the user’s message, and the model’s response.

Think of it as the model’s working memory. Everything the model can “see” and reason about must fit within this window.

Current context window sizes:

Model	Context Window
GPT-4o	128K tokens
Claude Sonnet/Opus	200K tokens
Gemini 1.5 Pro	1M+ tokens
Llama 3 (8B/70B)	8K-128K tokens
Mistral Large	128K tokens

128K tokens is roughly 96,000 words, or about 300 pages of text. That sounds like a lot. In practice, it fills up faster than you expect.

How Context Windows Fill Up

Here is how a typical AI application uses context:

Component	Typical Size	Purpose
System prompt	200-2,000 tokens	Sets model behavior, role, constraints
Retrieved documents (RAG)	1,000-10,000 tokens	Context from your data
Conversation history	500-50,000+ tokens	Prior messages in the conversation
User message	50-2,000 tokens	The current request
Model response	100-4,000 tokens	The output

A chatbot with a detailed system prompt, RAG context, and a 20-turn conversation history can easily consume 30,000-50,000 tokens per request. At that point, even a 128K window is only handling a few more rounds before something needs to give.

The hidden cost of conversation history: In a multi-turn conversation, the model re-processes the entire history on every turn. Turn 1 might use 1,000 tokens. Turn 10 might use 10,000. Turn 30 might use 40,000. Both cost and latency increase with every message.

What Happens When You Hit the Limit

When your input exceeds the context window, one of several things happens depending on the system:

The API returns an error. The request is rejected outright. You need to reduce the input.
The system truncates. Some implementations silently drop older messages or content to fit within the window. This can cause the model to lose important context without warning.
The model degrades. Some models accept long inputs but perform worse as they approach the window boundary. Information in the middle of very long contexts can receive less attention than information at the beginning or end. This is sometimes called the “lost in the middle” problem.

None of these outcomes is ideal. Managing context window usage is a core part of building reliable AI applications.

Strategies for Managing Context

Conversation History Management

Sliding window: Keep only the most recent N messages. Simple, but loses early context that might be important.

Summarization: Periodically summarize older conversation history into a shorter form and replace the full messages with the summary. Preserves key information while reducing token count.

Selective inclusion: Only include messages that are relevant to the current query. Requires more logic but uses tokens most efficiently.

Prompt Optimization

Trim your system prompt. Every token in the system prompt is repeated on every request. Be concise. Remove examples that are not pulling their weight. Test whether shorter prompts produce equivalent results.

Use structured formats. Bullet points and structured data often convey the same information in fewer tokens than prose paragraphs.

Separate instructions from data. Keep the static instruction portion of your prompt short and put variable data (retrieved documents, user context) in a clearly delineated section.

Chunking for Long Documents

When working with documents that exceed the context window, you need to break them into smaller pieces. This is chunking, and it is a core part of RAG pipelines (covered in RAG: Teaching AI What It Doesn’t Know).

The goal is to split documents into pieces that are small enough to fit in the context window alongside other content, but large enough to preserve meaning.

Tokenization Quirks and Edge Cases

Tokenization is not perfect, and its quirks show up in model behavior:

Counting and spelling. Ask a model how many letters are in “strawberry” and it might get it wrong. The model does not see individual letters. It sees tokens like “str”, “aw”, “berry”. Counting characters requires the model to reason about something it cannot directly observe.

Non-English languages. Languages like Chinese, Japanese, Korean, and Arabic often require more tokens per word than English. A sentence that is 20 tokens in English might be 35 tokens in Japanese. This means non-English users effectively get smaller context windows and pay more per word.

Code. Code tokenization varies by language and style. Python with descriptive variable names uses more tokens than Python with short names. Whitespace-heavy languages use tokens on indentation. Minified code uses fewer tokens but is harder for the model to reason about.

Special tokens. Models use special tokens that are not visible to users but consume space: beginning-of-sequence markers, end-of-turn markers, role indicators (system/user/assistant). These overhead tokens add up across a conversation.

When this matters in practice:

If your application serves non-English users, budget for higher token costs and test context window usage with representative content.
For code-heavy applications, test with realistic code samples to estimate token usage accurately. A “128K context” does not mean “128K characters of code.”
When debugging unexpected model behavior (bad counting, strange spelling), consider whether tokenization is the root cause.

Counting Tokens Before You Send Them

Most model providers offer tokenizer libraries that let you count tokens before making an API call:

OpenAI: tiktoken library (Python)
Anthropic: anthropic SDK includes token counting
Open source models: transformers library from Hugging Face

Counting tokens before sending is important for:

Staying within context limits
Estimating costs before committing to a request
Deciding when to truncate or summarize conversation history
Validating that your chunking strategy produces appropriately sized chunks

You do not want to discover you have exceeded the context window by getting an error in production. Count first.

What Comes Next

This post covered how text becomes tokens and how context windows constrain what models can process. The next post in this series, Embeddings and Vector Space, explores how text is converted into numerical representations that capture meaning, and how those representations power search, classification, and retrieval.

Closing Thoughts

Tokens and context windows are the practical constraints that shape every AI application. They determine cost, capability, and failure modes. Understanding them is not optional for anyone building with LLMs.

The good news: these constraints are manageable. Smart prompt design, conversation history management, and appropriate chunking strategies let you work within the limits effectively. The key is knowing the limits exist and designing for them from the start, not discovering them when something breaks.

Introduction

What Is a Token?

How Tokenizers Work

Why Tokenization Matters for Cost

Context Windows: The Model’s Working Memory

How Context Windows Fill Up

What Happens When You Hit the Limit

Strategies for Managing Context

Conversation History Management

Prompt Optimization

Chunking for Long Documents

Tokenization Quirks and Edge Cases

Counting Tokens Before You Send Them

What Comes Next

Closing Thoughts

Comments