How LLMs Actually Work
This post explores how large language models work under the hood. It builds on the concepts introduced in the series overview and provides the foundation for understanding the topics covered in later posts.
Introduction
The first post in this series mapped the components of the modern AI stack. This post goes deeper on the most central one: the large language model itself.
Most people interact with LLMs through a chat interface. You type something, it types something back. That interaction hides a lot of machinery. Understanding what happens between input and output changes how you think about what these models can and cannot do.
This is not a math-heavy deep dive into model architecture. It is a practical explanation of the core concepts: what neural networks are, how transformers work, what attention does, how models are trained, and what actually happens when a model generates a response. By the end, you should have a working mental model of what is going on inside the box.
Neural Networks: The Foundation
Every LLM is built on a neural network. A neural network is a system of interconnected nodes (called neurons) organized in layers. Data enters through the input layer, passes through one or more hidden layers where it gets transformed, and exits through the output layer.
Each connection between neurons has a weight, a number that determines how much influence one neuron has on the next. The network also uses bias values that shift the output of each neuron. Together, weights and biases are what the network “learns.” When someone says a model has 70 billion parameters, they are talking about 70 billion of these weights and biases.
Here is the basic flow:
- Input data enters the network (for a language model, this is text converted to numbers)
- Each layer applies mathematical operations using its weights and biases
- The output represents a prediction (for a language model, this is the next word)
A single neuron does something simple: it takes inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function that determines whether and how much the neuron “fires.” The power comes from having millions or billions of these neurons working together.
When this matters in practice:
- When people say a model is “70B parameters” or “405B parameters,” they are describing how many learnable values the network contains. More parameters generally means more capacity to learn patterns, but also more compute required to run.
- Understanding that models are pattern-matching systems (not knowledge databases) helps set realistic expectations about what they can do.
From Words to Numbers
Neural networks operate on numbers, not words. Before text can enter a model, it needs to be converted into numerical form.
This happens in two steps:
Tokenization breaks text into smaller units called tokens. These might be whole words, parts of words, or individual characters. The sentence “Understanding language models” might become [“Under”, “standing”, ” language”, ” models”]. Each token is assigned an ID number from the model’s vocabulary.
Embedding converts each token ID into a dense vector, a list of numbers that represents the token’s meaning in a high-dimensional space. These are not hand-crafted. The embedding values are learned during training, just like the weights in the rest of the network. Tokens that appear in similar contexts end up with similar embedding vectors.
The result: a sentence becomes a sequence of vectors, and the model can start doing math on meaning.
When this matters in practice:
- Different models use different tokenizers. The same text might be 100 tokens in one model and 130 in another. This directly affects cost and context window usage.
- Tokenization explains some model quirks. Models are sometimes bad at counting letters in words because they do not see individual letters. They see tokens.
The Transformer Architecture
The transformer is the architecture behind every major LLM. Introduced in a 2017 research paper titled “Attention Is All You Need,” it replaced earlier approaches that processed text one word at a time.
The key innovation: transformers process all tokens in a sequence simultaneously rather than sequentially. This makes them faster to train and better at capturing relationships between words that are far apart in a sentence.
A transformer is built from a stack of identical layers (called transformer blocks). Each block contains two main components:
- Self-attention mechanism (covered in the next section)
- Feed-forward neural network that processes each token independently
The input flows through these blocks in sequence. Each block refines the model’s understanding of the text. Early layers tend to capture basic patterns like grammar and syntax. Deeper layers capture more abstract concepts like reasoning, sentiment, and factual relationships.
When this matters in practice:
- The transformer architecture is why modern LLMs can handle long documents and maintain coherence across thousands of words. Earlier architectures struggled with this.
- When someone describes a model as a “transformer model,” they are describing this specific architecture, not a generic term. GPT stands for “Generative Pre-trained Transformer.”
Attention: How Models Understand Context
Attention is the mechanism that allows each token to look at every other token in the input and decide which ones are relevant. This is what makes transformers work.
Consider the sentence: “The cat sat on the mat because it was tired.” What does “it” refer to? A human knows “it” means the cat, not the mat. The attention mechanism is how the model figures this out. It assigns higher attention weights between “it” and “cat” than between “it” and “mat.”
Here is how it works at a high level:
For each token, the model creates three vectors:
- Query: “What am I looking for?”
- Key: “What do I contain?”
- Value: “What information do I provide?”
The model compares each token’s Query against every other token’s Key to compute attention scores. Higher scores mean stronger relevance. These scores determine how much of each token’s Value gets included in the output representation.
This happens multiple times in parallel through multi-head attention. Each “head” learns to focus on different types of relationships. One head might focus on grammatical structure. Another might focus on semantic meaning. Another might track coreference (what “it” refers to). The outputs of all heads are combined.
When this matters in practice:
- Attention is why LLMs can follow instructions that reference earlier parts of a long prompt. The model can “look back” at any token in the context window.
- It is also why longer contexts cost more. Attention computes relationships between every pair of tokens. Doubling the input length roughly quadruples the attention computation.
- When a model seems to “forget” something from earlier in a conversation, it is often because the attention mechanism is distributing its focus across too many tokens, not because the information is literally gone.
Training: How Models Learn
Training an LLM is the process of adjusting all those billions of parameters so the model produces useful outputs. This happens in two main phases.
Pre-training
Pre-training is the expensive part. The model is shown massive amounts of text (books, websites, code, articles) and learns to predict the next token given all the preceding tokens. This is called next-token prediction or causal language modeling.
The process is iterative:
- Feed the model a sequence of tokens
- The model predicts the next token
- Compare the prediction to the actual next token
- Calculate the error (called loss)
- Adjust the parameters slightly to reduce that error (using backpropagation and gradient descent)
- Repeat, billions of times
Through this process, the model learns grammar, facts, reasoning patterns, code structure, and much more. It is not memorizing text. It is learning statistical patterns about how language works.
Pre-training requires enormous compute resources. Training a frontier model can cost tens of millions of dollars and take weeks or months on thousands of GPUs.
Post-training
After pre-training, the model can predict text, but it is not yet useful as an assistant. It might continue a sentence, but it will not answer a question helpfully or follow instructions reliably. Post-training fixes this.
Supervised Fine-Tuning (SFT) shows the model examples of good behavior: question-answer pairs, instruction-following demonstrations, and other examples of the kind of output humans want.
Reinforcement Learning from Human Feedback (RLHF) takes it further. Human evaluators rank different model outputs by quality, and the model is trained to prefer the higher-ranked responses. This is how models learn to be helpful, refuse harmful requests, and match the style and safety standards their creators intend.
When this matters in practice:
- The pre-training data has a cutoff date. The model does not know about events after that date unless given external information (through RAG or other means).
- When a model “hallucinates” (generates plausible but false information), it is because the statistical patterns it learned produce confident-sounding text that happens to be wrong. The model is not lying. It is generating the most probable next tokens based on its training.
- The distinction between pre-training and post-training explains why a base model (pre-trained only) behaves differently from a chat model (pre-trained plus post-trained). Base models are good at text completion. Chat models are good at following instructions.
Parameters: What the Model “Knows”
A model’s parameters are the numerical values learned during training. They encode everything the model has learned about language, facts, reasoning, and patterns. But “knows” deserves quotes because parameters do not store knowledge the way a database stores records.
There is no parameter that contains the fact “Paris is the capital of France.” Instead, that knowledge is distributed across millions of parameters that collectively make the model likely to produce “Paris” when asked about the capital of France. This is called distributed representation, and it is fundamentally different from lookup-based storage.
Model size (parameter count) is one axis of capability, but not the only one:
| Model Size | Examples | Typical Use |
|---|---|---|
| 1-3B parameters | Phi-3 Mini, Gemma 2B | On-device tasks, specific narrow tasks |
| 7-13B parameters | Llama 3 8B, Mistral 7B | Good balance of capability and efficiency |
| 30-70B parameters | Llama 3 70B, Mixtral 8x7B | Strong general performance |
| 100B+ parameters | GPT-4, Claude, Gemini | Frontier capability across broad tasks |
More parameters does not always mean better performance for a given task. A well-trained 8B model can outperform a poorly trained 70B model. Training data quality, training methodology, and post-training alignment all matter as much or more than raw parameter count.
When this matters in practice:
- Parameter count determines hardware requirements. A 70B model needs far more memory (VRAM) to run than an 8B model. This directly affects whether you can self-host and at what cost.
- For many production tasks, a smaller model that is well-suited to the task will outperform a larger general-purpose model while costing less to run. The choice should be driven by the use case, not by the spec sheet.
Inference: How Models Generate Text
Inference is what happens when you use a trained model to generate output. The model takes your input, processes it through all its layers, and produces a prediction for the next token. Then it repeats that process, feeding its own output back in as input, one token at a time.
This is called autoregressive generation: the model generates one token, appends it to the input, and predicts the next one. The response you see streaming in word by word in ChatGPT or Claude is literally the model making one prediction at a time.
At each step, the model does not just pick the single most likely token. It produces a probability distribution over its entire vocabulary (typically 30,000 to 100,000+ tokens). Several parameters control how the final token is selected:
Temperature controls randomness. At temperature 0, the model always picks the most probable token (deterministic). At higher temperatures, less probable tokens have a better chance of being selected (more creative, but also more unpredictable).
Top-p (nucleus sampling) limits the selection to the smallest set of tokens whose combined probability exceeds a threshold (e.g., 0.9). This prevents the model from selecting extremely unlikely tokens while still allowing variety.
Top-k limits the selection to the k most probable tokens.
These parameters are why you can get different responses to the same prompt. The model is not being inconsistent. It is sampling from a probability distribution, and the sampling parameters control how much variation is allowed.
When this matters in practice:
- For code generation or factual Q&A, lower temperature (0 to 0.3) gives more consistent, predictable results.
- For creative writing or brainstorming, higher temperature (0.7 to 1.0) produces more varied and surprising output.
- Inference speed depends on model size and hardware. Each token requires a full forward pass through the network. This is why larger models respond more slowly and why GPU selection matters for self-hosted deployments.
- Streaming responses is not a feature. It is a reflection of how generation works. The model produces tokens one at a time, and streaming shows them as they are generated rather than waiting for the full response.
The Full Picture
Here is what happens end-to-end when you send a message to an LLM:
- Tokenization: Your text is split into tokens and converted to numerical IDs
- Embedding: Token IDs are converted to dense vectors
- Transformer blocks: The vectors pass through dozens or hundreds of transformer layers, each applying self-attention and feed-forward transformations
- Output layer: The final layer produces a probability distribution over the vocabulary for the next token
- Sampling: A token is selected based on the probability distribution and sampling parameters (temperature, top-p, etc.)
- Repeat: The selected token is appended to the input, and steps 2-5 repeat until the model produces a stop token or hits a length limit
This entire process is the model running forward through its trained parameters. No learning happens during inference. The model is not updating its weights based on your conversation. It is applying fixed patterns learned during training to your specific input.
When this matters in practice:
- Models do not learn from your conversations (during inference). If you correct a model and it adjusts, it is because your correction is now part of the context window, not because it learned something new.
- Each message in a conversation is not truly a “continuation.” The model re-processes the entire conversation history on every turn. This is why conversation history consumes tokens and why very long conversations can hit context window limits.
Common Misconceptions
“LLMs understand language.” They process patterns in language with remarkable effectiveness. Whether that constitutes “understanding” is a philosophical debate. What matters practically: they produce outputs that are useful in many contexts and unreliable in others. Treat them as capable tools, not as thinking entities.
“LLMs are databases of facts.” They are pattern-matching systems trained on text. They can produce correct facts because those facts appeared frequently in training data, but they can also produce incorrect facts with equal confidence. Always verify claims that matter.
“Bigger models are always better.” For a specific task, a smaller model that is well-suited (through fine-tuning, good prompting, or architecture choices) often outperforms a larger general-purpose model at lower cost.
“LLMs are random.” They are deterministic at temperature 0. At higher temperatures, they sample from a probability distribution, which introduces controlled variability. The randomness is a feature, not a flaw.
“The model remembers our previous conversations.” Unless the system explicitly stores and retrieves conversation history, each session starts fresh. What feels like memory is the current conversation’s context window.
What Comes Next
This post covered the internal mechanics of large language models. The next post in this series, Tokens and Context Windows, explores how tokenization works across different models, how context limits affect real-world applications, and strategies for working within those constraints.
Closing Thoughts
LLMs are prediction engines. They take a sequence of tokens, process it through billions of learned parameters, and predict what comes next. The transformer architecture and attention mechanism make this prediction remarkably capable across a wide range of tasks.
Understanding this does not diminish what these models can do. It sharpens your ability to use them well. Knowing that a model is predicting tokens, not retrieving facts, changes how you prompt it. Knowing that attention has computational costs changes how you design your inputs. Knowing that training data has a cutoff changes when you reach for RAG instead of relying on the model’s internal knowledge.
The model is a tool. Understanding how it works makes you better at using it.
Found this useful?
If this post helped you, consider buying me a coffee.
Comments