The Modern AI Stack: A Practical Overview

This post provides a high-level overview of the modern AI stack. Each topic is introduced conceptually and explored in depth in follow-up posts throughout this series.

Introduction

Purpose

AI has moved from research labs into production systems. Developers, architects, and business leaders are now expected to make decisions about tools and approaches they may not fully understand. The terminology alone can be a barrier: LLMs, SLMs, embeddings, vector databases, RAG, fine-tuning, agents. These terms get thrown around in meetings, vendor pitches, and architecture discussions, often without shared understanding of what they mean or when they matter.

This post maps the landscape. It introduces the core components of the modern AI stack, explains how they relate to each other, and provides enough context to make informed decisions about when and where each one applies. It is not a tutorial or implementation guide. It is a starting point for building a mental model that the rest of this series will deepen.

By the end of this post, you should have:

A clear picture of the major components in the AI stack and how they connect
Enough vocabulary to follow technical AI discussions without getting lost
A practical sense of which components solve which problems
Context for the deeper explorations in the rest of this series

How This Post Fits Into a Broader Series

This overview introduces concepts that each warrant their own deep dive. Follow-up posts will explore individual topics with concrete examples, implementation details, and practical guidance. This post establishes the map. The rest of the series fills in the territory.

The Problem: Too Many Pieces, Not Enough Context

Most introductions to AI start with one piece: here is what an LLM is, here is how to prompt it. That is useful, but it leaves out the bigger picture. When someone asks “should we use RAG or fine-tuning?” the answer depends on understanding what both of those things actually do and where they sit in the larger system.

The AI stack is not a single technology. It is a set of components that serve different purposes and combine in different ways depending on the problem. Without a map of those components, teams end up making decisions based on vendor marketing, blog hype, or whatever a colleague mentioned in a meeting.

This post provides that map.

The Core Components

The modern AI stack has several layers. Each layer solves a different problem, and not every application needs all of them.

Here is the high-level view:

Models (LLMs and SLMs) generate text, reason, and follow instructions
Tokens are how models break down and process language
Context windows determine how much information a model can consider at once
Embeddings convert text into numerical representations that capture meaning
Vector databases store and search those representations efficiently
RAG (Retrieval-Augmented Generation) feeds relevant information to models at query time
Prompting shapes how models respond to requests
Fine-tuning modifies a model’s behavior by training it on additional data
Agents give models the ability to take actions, use tools, and work autonomously

These components are not independent. They build on each other. Understanding tokens is necessary to understand context windows. Understanding embeddings is necessary to understand RAG. Understanding prompting is necessary to understand agents.

Let’s walk through each one.

Models: LLMs and SLMs

A Large Language Model (LLM) is a neural network trained on massive amounts of text data. It learns patterns in language: grammar, facts, reasoning patterns, code structure, and more. When you interact with ChatGPT, Claude, or Gemini, you are interacting with an LLM.

LLMs are “large” in two senses: they are trained on large datasets, and they have a large number of parameters (the internal values the model learned during training). GPT-4, Claude, and similar models have hundreds of billions of parameters. More parameters generally mean more capability, but also more computational cost.

A Small Language Model (SLM) is architecturally similar but deliberately smaller. Models like Phi, Gemma, and Llama (in its smaller variants) are designed to run on less hardware while still performing well for specific tasks. The tradeoff is straightforward: SLMs are cheaper and faster, but less capable across broad tasks.

When this matters in practice:

Building a customer-facing chatbot that handles complex, open-ended questions? You probably need an LLM.
Running sentiment analysis on device without an internet connection? An SLM might be the better choice.
Processing thousands of documents per hour where cost matters? The model size decision has direct budget impact.

The choice between LLM and SLM is not about which is “better.” It is about matching capability to the problem, the infrastructure, and the budget. This is explored in depth in Choosing the Right Model.

Tokens: How Models See Text

Models do not read words the way humans do. They break text into tokens, which are pieces of words, whole words, or even punctuation. The word “understanding” might be split into “under” and “standing.” Common words like “the” are usually a single token.

Tokenization matters for two practical reasons:

Cost. API-based models charge per token. Every token in your input (the prompt) and output (the response) costs money. Understanding tokenization helps you estimate costs and optimize spending.

Context limits. Models can only process a fixed number of tokens at once. This is the context window. If your input exceeds the window, the model cannot see all of it. A model with a 128,000-token context window can process roughly 96,000 words at once. That sounds like a lot, but it fills up fast when you are working with codebases, legal documents, or long conversation histories.

When this matters in practice:

Submitting a 200-page contract for analysis? You need to know whether it fits in the context window, or whether you need to chunk it.
Building an API integration where cost matters? Token counts directly determine your bill.
Debugging why a model “forgot” something you told it earlier in a conversation? The context window may have been exceeded.

Tokens and context windows are covered in depth in Tokens and Context Windows.

Embeddings: Capturing Meaning as Numbers

An embedding is a numerical representation of text. It converts words, sentences, or entire documents into lists of numbers (vectors) that capture semantic meaning. Text with similar meaning produces similar vectors.

This is a foundational concept. The sentence “the cat sat on the mat” and “a feline rested on the rug” would produce embedding vectors that are close together in vector space, even though they share almost no words. “The stock market crashed” would produce a vector far from both.

Embeddings are generated by specialized models (not the same models you chat with, typically). They are the bridge between human language and mathematical operations. Without embeddings, you cannot do semantic search, clustering, classification, or any of the retrieval techniques that make AI applications practical.

When this matters in practice:

Building a search feature that understands meaning, not just keywords? Embeddings power that.
Grouping customer support tickets by topic automatically? Embed the tickets and cluster the vectors.
Finding similar products, articles, or documents? Embedding similarity is the standard approach.

Embeddings are explored in depth in Embeddings and Vector Space.

Vector Databases: Storing and Searching Meaning

Once you have embeddings, you need somewhere to store them and a way to search them efficiently. That is what a vector database does.

Traditional databases search by exact match or keyword. Vector databases search by similarity. You provide a query, it gets converted to a vector, and the database returns the vectors (and their associated text) that are most similar.

Popular vector databases include Pinecone, Weaviate, Qdrant, Chroma, and pgvector (a PostgreSQL extension). Each makes different tradeoffs around scale, hosting, and integration.

When this matters in practice:

You have 10,000 internal documents and want to find the ones relevant to a user’s question. A vector database stores the embeddings and retrieves the closest matches.
You are building a recommendation system based on content similarity. Vector search is the core operation.
You need to search across multiple languages without translation. Embeddings can capture meaning across languages, and vector search works on the embeddings regardless of the source language.

Vector databases are covered alongside embeddings in Embeddings and Vector Space.

RAG: Teaching AI What It Doesn’t Know

Retrieval-Augmented Generation (RAG) is a pattern that combines search with generation. Instead of relying solely on what a model learned during training, RAG retrieves relevant information from your own data and includes it in the prompt.

The process works like this:

A user asks a question
The system converts the question into an embedding
A vector database finds the most relevant documents or passages
Those passages are included in the prompt alongside the question
The model generates a response using both its training and the retrieved context

RAG solves a critical problem: LLMs have a knowledge cutoff (they only know what was in their training data) and they do not know anything about your private data. RAG bridges that gap without requiring you to retrain or fine-tune a model.

Chunking is a key part of the RAG pipeline. Documents are too long to embed as single units, so they are split into smaller pieces (chunks). How you chunk matters. Split in the wrong place and you lose context. Split too small and you lose meaning. Split too large and your embeddings become vague.

When this matters in practice:

Building a chatbot that answers questions about your company’s internal documentation? RAG is the standard approach.
Creating a coding assistant that understands your specific codebase? RAG retrieves relevant code files and includes them as context.
Answering questions about recent events that the model was not trained on? RAG retrieves up-to-date information.

RAG is covered in depth in RAG: Teaching AI What It Doesn’t Know.

Prompting: Shaping Model Behavior

Prompting is how you communicate with a model. The text you send to a model (the prompt) determines what you get back. This sounds simple, but the difference between a vague prompt and a well-structured one can be the difference between a useless response and a useful one.

Prompting has several layers:

System prompts set the model’s role, constraints, and behavior for an entire conversation
Few-shot examples show the model what good output looks like by including examples in the prompt
Chain-of-thought prompting asks the model to reason step by step before answering
Structured output instructs the model to respond in a specific format (JSON, XML, markdown)

Inference is the process of the model generating a response. Parameters like temperature (how creative vs. deterministic the output is) and top-p (how many word choices the model considers) give you control over this process.

When this matters in practice:

Getting inconsistent results from an AI feature? Better prompting is usually the first fix, before considering fine-tuning or other approaches.
Need a model to reliably return JSON for an API integration? Structured output prompting handles that.
Want a model to show its reasoning for auditing or debugging? Chain-of-thought prompting makes the reasoning visible.

Prompting and inference are covered in depth in Prompting and Inference.

Fine-Tuning: Changing How a Model Behaves

Fine-tuning takes a pre-trained model and trains it further on a specific dataset. This changes the model’s behavior, making it better at particular tasks, more aligned with a specific tone, or more knowledgeable about a specialized domain.

Fine-tuning is not the same as RAG. RAG provides information at query time without changing the model. Fine-tuning changes the model itself. Think of it this way: RAG is like giving someone a reference book to consult. Fine-tuning is like sending them to a specialized training program.

There are several approaches to fine-tuning:

Full fine-tuning updates all model parameters. Expensive and resource-intensive.
LoRA (Low-Rank Adaptation) updates a small subset of parameters. Much cheaper, often nearly as effective.
RLHF (Reinforcement Learning from Human Feedback) uses human preferences to align model behavior. This is how models like ChatGPT and Claude were trained to be helpful and safe.

When this matters in practice:

Your company has a specific writing style that the model needs to match consistently? Fine-tuning can teach that.
You need a model that understands specialized terminology in medicine, law, or finance? Fine-tuning on domain-specific data helps.
You have tried better prompts and RAG, and performance still is not where it needs to be? Fine-tuning is the next lever to pull.

Fine-tuning is covered in depth in Fine-Tuning and Model Customization.

Agents: AI That Takes Action

An agent is a model that can use tools, make decisions, and take actions. Instead of just generating text, an agent can search the web, run code, call APIs, read files, and chain multiple steps together to accomplish a goal.

The simplest form of this is function calling (also called tool use). The model decides that it needs to perform an action, generates a structured request, and the system executes it. The result is fed back to the model, which can then decide on the next step.

More advanced agents can plan multi-step workflows, recover from errors, and operate with minimal human oversight. This is an active area of development, and the boundaries of what agents can handle reliably are still being defined.

When this matters in practice:

Building a customer support system that can look up orders, check shipping status, and process returns? An agent can chain those actions together.
Want AI to handle code changes across multiple files based on a specification? Coding agents like Claude Code, Cursor, and GitHub Copilot use tool-calling agents.
Need to automate a research workflow that involves searching, reading, summarizing, and compiling? Agents can orchestrate that pipeline.

Agents and tool use are covered in depth in AI Agents and Tool Use.

Putting It Together: Which Components Solve Which Problems

Not every AI application needs every component. Here is a practical guide:

“I want to add a chatbot to my product.” Start with a model (LLM), good prompting, and a system prompt. If it needs to know about your data, add RAG. If it needs to take actions, add tool use.

“I want to search our internal documents by meaning, not keywords.” Embeddings, a vector database, and a search interface. You may not need a generative model at all.

“I want AI to answer questions about our documentation.” RAG. Embed your documents, store them in a vector database, retrieve relevant chunks, and pass them to an LLM with the user’s question.

“I want the AI to match our brand voice perfectly.” Try prompting first (system prompt with examples and guidelines). If that is not consistent enough, fine-tune.

“I want AI to automate a multi-step workflow.” Agents with tool use. Define the tools, give the agent clear instructions, and build in oversight for high-stakes actions.

“I need to run AI on edge devices without internet.” SLMs. Choose a model that fits your hardware constraints and optimize for the specific task.

“I need to classify thousands of support tickets per hour.” Embeddings for classification, possibly an SLM for cost efficiency. Fine-tuning if the categories are specialized.

What Comes Next

Planned Deep-Dive Topics

This overview introduces several concepts that warrant deeper exploration. Follow-up posts in this series will focus on individual topics, including:

Each post will examine its topic in depth, with concrete examples and practical guidance.

How These Topics Build on This Overview

This post establishes the conceptual foundation for the series. Subsequent posts will assume familiarity with the ideas introduced here and build upon them, moving from understanding toward application.

The series follows a deliberate structure. Posts 2 through 4 cover the foundational mechanics: how models work, how they process language, and how meaning is represented numerically. Posts 5 through 7 cover the patterns for applying those mechanics: retrieval, prompting, and customization. Posts 8 and 9 cover the decisions: which model to choose and how to build autonomous systems. Post 10 makes the case for organizational adoption.

Closing Thoughts

The AI stack is not one thing. It is a set of components that solve different problems and combine in different ways. Understanding those components, what they do, when they matter, and how they relate, is the foundation for making good decisions about AI adoption and implementation.

This is not about chasing the latest model release or vendor announcement. It is about building a mental model that holds up as the technology evolves. The specifics will change. The fundamentals will not.

This overview is the starting point. The rest of the series fills in the details.

Introduction

Purpose

How This Post Fits Into a Broader Series

The Problem: Too Many Pieces, Not Enough Context

The Core Components

Models: LLMs and SLMs

Tokens: How Models See Text

Embeddings: Capturing Meaning as Numbers

Vector Databases: Storing and Searching Meaning

RAG: Teaching AI What It Doesn’t Know

Prompting: Shaping Model Behavior

Fine-Tuning: Changing How a Model Behaves

Agents: AI That Takes Action

Putting It Together: Which Components Solve Which Problems

What Comes Next

Planned Deep-Dive Topics

How These Topics Build on This Overview

Closing Thoughts

Comments