Embeddings and Vector Space
This post explores how text is converted into numerical representations that capture meaning. It builds on the tokenization concepts from Tokens and Context Windows and provides the foundation for understanding RAG in the next post.
Introduction
The previous posts explained how models process text as tokens and operate within context windows. This post covers a different but related concept: how text is converted into numerical representations that capture meaning, not just structure.
Embeddings are the bridge between human language and mathematical operations. They are what make semantic search, document retrieval, classification, clustering, and recommendation systems possible. If you want AI to understand that “car” and “automobile” mean the same thing, or that a customer complaint about “slow delivery” is related to one about “shipping delays,” embeddings are how that happens.
What Is an Embedding?
An embedding is a list of numbers (a vector) that represents a piece of text in a way that captures its meaning. The text can be a word, a sentence, a paragraph, or an entire document. The output is always the same: a fixed-length list of floating-point numbers.
For example, the sentence “The weather is nice today” might produce an embedding like:
[0.023, -0.142, 0.891, 0.034, ..., -0.567]
A typical embedding has 768 to 3,072 dimensions (that many numbers in the list). Each dimension captures some aspect of meaning, though individual dimensions are not human-interpretable. You cannot look at dimension 47 and say “this represents sentiment.” The meaning is distributed across all dimensions collectively.
The key property: texts with similar meaning produce similar embeddings. “The weather is nice today” and “It’s a beautiful day outside” would produce vectors that are close together in the high-dimensional space. “The stock market crashed” would produce a vector far from both.
How Embeddings Are Created
Embeddings are generated by embedding models, which are different from the chat models you interact with in ChatGPT or Claude. Embedding models are trained specifically to produce vectors that place semantically similar text near each other.
Common embedding models:
| Model | Dimensions | Provider |
|---|---|---|
| text-embedding-3-small | 1,536 | OpenAI |
| text-embedding-3-large | 3,072 | OpenAI |
| voyage-3 | 1,024 | Voyage AI |
| embed-v4 | 1,024 | Cohere |
| BGE-large | 1,024 | Open source (BAAI) |
| all-MiniLM-L6-v2 | 384 | Open source (Sentence Transformers) |
The process is straightforward: you send text to the embedding model, and it returns a vector. There is no generation, no chat, no prompt engineering. Input text, output numbers.
When this matters in practice:
- Embedding models are much cheaper and faster than chat models. Embedding a document costs a fraction of a cent.
- You choose an embedding model once for a given application and stick with it. Embeddings from different models are not compatible. You cannot search OpenAI embeddings with a Cohere query vector.
- Larger embedding dimensions capture more nuance but use more storage and compute for similarity search. For most applications, 1,024 dimensions is a good balance.
Vector Similarity: Measuring Meaning
Once you have embeddings, you need a way to compare them. This is vector similarity, and it is the mathematical operation that powers semantic search.
The most common similarity measure is cosine similarity. It measures the angle between two vectors, ignoring their magnitude. A cosine similarity of 1.0 means the vectors point in the same direction (identical meaning). A similarity of 0 means they are unrelated. A similarity of -1 means they point in opposite directions.
In practice, similarity scores for related text typically fall between 0.7 and 0.95. Unrelated text usually scores below 0.3.
Example:
| Text A | Text B | Cosine Similarity |
|---|---|---|
| ”How do I reset my password?" | "I forgot my login credentials” | ~0.89 |
| ”How do I reset my password?" | "What are your business hours?” | ~0.21 |
| ”The cat sat on the mat" | "A feline rested on the rug” | ~0.85 |
| ”The cat sat on the mat" | "Stock prices fell sharply” | ~0.08 |
Other similarity measures exist (Euclidean distance, dot product), but cosine similarity is the most widely used for text embeddings because it handles vectors of different magnitudes well.
When this matters in practice:
- Similarity thresholds are application-specific. A search application might show results above 0.5, while a duplicate detection system might require 0.9+.
- Similarity is not the same as entailment or correctness. Two sentences can be similar in topic but say opposite things (“I love this product” vs. “I hate this product” might have relatively high similarity because both are product opinions).
Vector Databases: Storing and Searching at Scale
Comparing every embedding against every other embedding works fine for small datasets. With thousands or millions of embeddings, you need a vector database that can search efficiently.
Vector databases are optimized for one operation: given a query vector, find the most similar vectors in the collection. They use specialized indexing structures (like HNSW, IVF, or product quantization) to make this search fast without comparing against every stored vector.
Popular Vector Databases
Managed services:
- Pinecone: Fully managed, simple API, scales well. Good for teams that do not want to manage infrastructure.
- Weaviate: Managed or self-hosted, supports hybrid search (combining vector and keyword search).
- Qdrant: Open source, self-hosted or cloud. Strong filtering capabilities.
Embedded/lightweight:
- Chroma: Open source, designed for prototyping and smaller datasets. Runs in-process.
- FAISS: Meta’s library for efficient similarity search. Not a database, but a building block.
Extensions to existing databases:
- pgvector: PostgreSQL extension. Use your existing Postgres database for vector search without adding another system.
- MongoDB Atlas Vector Search: Vector search built into MongoDB.
- Elasticsearch: Supports dense vector fields and approximate nearest neighbor search.
Choosing a Vector Database
The choice depends on your constraints:
| Consideration | Recommendation |
|---|---|
| Already using PostgreSQL? | Start with pgvector |
| Prototyping or small dataset? | Chroma or FAISS |
| Need managed infrastructure at scale? | Pinecone or Weaviate Cloud |
| Need hybrid search (vector + keyword)? | Weaviate or Elasticsearch |
| Need maximum control and self-hosting? | Qdrant or Weaviate |
For many teams, pgvector is the right starting point. It avoids adding a new database to the stack and handles moderate scale well. Move to a specialized vector database when pgvector’s performance is no longer sufficient.
When this matters in practice:
- Vector databases are not replacements for traditional databases. You still need your relational or document database for structured data. The vector database handles the similarity search layer.
- Index build time matters. Adding or updating embeddings is not instant. Plan for reindexing time as your dataset grows.
- Filtering (e.g., “find similar documents, but only from the last 30 days”) is important in most real applications. Not all vector databases handle filtering equally well.
Practical Applications
Semantic Search
Traditional search relies on keyword matching. Searching for “how to fix a leaky faucet” will not find a document titled “repairing a dripping tap” unless it shares the right keywords. Semantic search using embeddings finds it because the meanings are similar.
The implementation: embed all your documents at index time, store the vectors in a vector database, embed the user’s query at search time, and return the nearest vectors.
Classification
Embeddings can classify text without training a custom model. Embed examples of each category, embed the text to classify, and assign it to the category with the most similar examples. This is sometimes called zero-shot or few-shot classification because it requires minimal or no training data for the classifier itself.
Example: Classifying customer support tickets into categories (billing, technical, shipping, general). Embed a few example tickets for each category, then classify new tickets by finding the nearest category.
Clustering
Group similar items together by embedding them and running a clustering algorithm (like k-means) on the vectors. This surfaces patterns and groupings that might not be obvious from keywords alone.
Example: Analyzing thousands of product reviews to find common themes. Embed all reviews, cluster them, and examine each cluster to identify topics like “battery life,” “build quality,” or “customer service.”
Recommendation
“Users who liked this also liked that” can be powered by embedding similarity. Embed items (products, articles, songs) and recommend items whose embeddings are close to items the user has engaged with.
Duplicate Detection
Find near-duplicate documents, support tickets, or database entries by comparing embedding similarity. This catches duplicates that differ in wording but convey the same information.
Embedding Strategies for Different Content Types
Not all text should be embedded the same way.
Short text (queries, titles, labels): Embed directly. Short text works well with most embedding models out of the box.
Long documents (articles, reports, contracts): Break into chunks first, then embed each chunk individually. A single embedding of an entire document loses detail because the meaning gets averaged across too many concepts. Chunking strategies are covered in detail in the next post on RAG.
Structured data (tables, forms, metadata): Convert to natural language descriptions before embedding. A row with “status: overdue, amount: $5,000, customer: Acme Corp” embeds better as “Acme Corp has an overdue payment of $5,000.”
Code: Use embedding models trained on code (like OpenAI’s code embeddings or CodeBERT). General-purpose text embedding models perform poorly on code because the relationship between syntax and meaning is different from natural language.
When this matters in practice:
- The quality of your embeddings depends on how you prepare the text before embedding. Garbage in, garbage out applies here.
- Embedding models have maximum input lengths (typically 512 to 8,192 tokens). Text longer than the limit gets truncated. This is another reason long documents need chunking.
Common Pitfalls
Mixing embedding models. Embeddings from different models live in different vector spaces. You cannot embed your documents with one model and your queries with another. Pick one model and use it consistently.
Ignoring embedding model updates. When an embedding model gets updated (e.g., text-embedding-ada-002 to text-embedding-3-small), the new model produces different vectors. You need to re-embed all your stored documents if you switch models.
Embedding too much at once. A single embedding of an entire 50-page document produces a vague vector that matches everything loosely and nothing well. Chunk your content.
Not testing with real queries. Embedding quality varies by domain. A model that works well for general text might perform poorly on legal documents, medical records, or highly technical content. Test with representative queries from your actual use case.
Over-indexing on benchmark scores. Embedding model leaderboards test on specific benchmarks that may not reflect your use case. A model ranked #3 on a benchmark might be the best choice for your specific domain.
What Comes Next
This post covered how text is converted into numerical representations and how those representations enable similarity-based operations. The next post in this series explores RAG: Teaching AI What It Doesn’t Know, the pattern that combines embedding-based retrieval with language model generation to answer questions using your own data.
Closing Thoughts
Embeddings turn the fuzzy concept of “meaning” into something a computer can work with. They are not perfect representations. They lose nuance, they can encode biases from training data, and they do not capture every aspect of what text means. But they are good enough to power search, classification, clustering, and retrieval at scale.
The practical value is straightforward: embeddings let you find things by meaning instead of by keywords. For most AI applications, that capability is foundational. Understanding how embeddings work, what they can and cannot capture, and how to use them effectively is essential for building systems that go beyond simple prompt-and-response interactions.
Found this useful?
If this post helped you, consider buying me a coffee.
Comments