Choosing the Right Model

This post provides a framework for choosing between models. It draws on concepts from across the series, particularly How LLMs Actually Work, Prompting and Inference, and Fine-Tuning and Model Customization.

Introduction

The previous posts covered what models are, how they work, and how to customize their behavior. This post answers the practical question: which model should you use?

The answer is never just “the biggest one” or “the cheapest one.” It depends on the task, the constraints, the budget, and the deployment environment. A model that is perfect for one use case might be wasteful or insufficient for another. This post provides a framework for making that decision.

The Model Landscape

The AI model ecosystem has expanded rapidly. Choosing a model now involves decisions across several dimensions:

Size: Large (100B+ parameters) vs. small (1-13B parameters)
Access: Closed/proprietary (API access only) vs. open (weights available for download)
Deployment: Cloud API vs. self-hosted vs. on-device
Specialization: General-purpose vs. domain-specific
Provider: OpenAI, Anthropic, Google, Meta, Mistral, and others

Each dimension involves tradeoffs. Understanding those tradeoffs is the framework for making good choices.

LLMs vs. SLMs: Size Matters, But Not Always

Large Language Models (LLMs)

Models with 70B+ parameters. Examples: GPT-4o, Claude Opus/Sonnet, Gemini Pro.

Strengths:

Broad knowledge across many domains
Strong reasoning and multi-step problem solving
Better at following complex instructions
Better at nuanced, ambiguous tasks
State-of-the-art performance on benchmarks

Weaknesses:

Higher cost per token
Higher latency (slower responses)
Require significant compute (cloud-only for most)
Overkill for simple tasks

Small Language Models (SLMs)

Models with 1-13B parameters. Examples: Phi-3, Gemma 2, Llama 3 8B, Mistral 7B.

Strengths:

Lower cost per token (often 10-50x cheaper than frontier LLMs)
Lower latency
Can run on consumer hardware or edge devices
Can run without internet connectivity
Easier to fine-tune
Sufficient for many focused tasks

Weaknesses:

Less capable at complex reasoning
Narrower knowledge
Worse at following complex or nuanced instructions
May need fine-tuning to match LLM quality on specific tasks

When to Use Which

Task	Recommended Size	Reasoning
Complex code generation	LLM	Requires broad knowledge and multi-step reasoning
Customer support chatbot (general)	LLM or mid-tier	Needs to handle diverse, unpredictable queries
Sentiment classification	SLM	Focused task, well-defined output
Text summarization	Mid-tier or SLM	Depends on content complexity
On-device autocomplete	SLM	Latency and connectivity constraints
Data extraction from forms	SLM	Structured task, consistent format
Research and analysis	LLM	Requires synthesis across broad knowledge
Translation	Mid-tier	Well-defined task, but nuance matters

When this matters in practice:

Start with a frontier LLM to establish a quality baseline. Then test whether a smaller model produces acceptable results for your specific task. If it does, you save money and latency.
A fine-tuned SLM can match or exceed a general-purpose LLM on a narrow task. If your use case is focused, this is often the best path.
The “mid-tier” sweet spot (Claude Sonnet, GPT-4o mini) handles a wide range of tasks well at moderate cost. This is where most production applications land.

Open vs. Closed Models

Closed Models (Proprietary)

Models accessed only through a provider’s API. You cannot download the weights, modify the model, or run it on your own infrastructure.

Examples: GPT-4o, Claude, Gemini Pro

Advantages:

Best-in-class performance (frontier models are typically closed)
No infrastructure to manage
Continuous improvements from the provider
Enterprise support and SLAs

Disadvantages:

Vendor lock-in
Data sent to a third party (privacy/compliance concerns)
No control over model behavior changes
Usage-based pricing can be unpredictable at scale
Provider can deprecate or change the model

Open Models

Models with publicly available weights that you can download, inspect, modify, and deploy.

Examples: Llama 3, Mistral, Gemma, Phi-3, Qwen

Advantages:

Full control over deployment and data
No per-token API costs (you pay for infrastructure instead)
Can fine-tune without restrictions
No vendor lock-in
Data stays on your infrastructure (important for regulated industries)
Community ecosystem of tools, fine-tunes, and optimizations

Disadvantages:

Generally lower performance than frontier closed models
You manage the infrastructure (GPUs, scaling, monitoring)
You handle security, patching, and updates
Requires ML engineering expertise for deployment and optimization

The Open Model Ecosystem

Open models have matured rapidly. The current landscape:

Model Family	Provider	Notable Sizes	Strengths
Llama 3	Meta	8B, 70B, 405B	Strong general performance, large community
Mistral	Mistral AI	7B, 8x7B (Mixtral), Large	Efficient, good multilingual support
Gemma 2	Google	2B, 9B, 27B	Strong for its size, good for on-device
Phi-3/4	Microsoft	3.8B, 14B	Punches above weight class, good for edge
Qwen 2.5	Alibaba	0.5B to 72B	Strong multilingual, competitive performance

How to Choose

Situation	Recommendation
Need best possible quality, no infrastructure team	Closed model API
Regulated industry, data cannot leave your network	Open model, self-hosted
High volume, cost is primary concern	Open model or mid-tier closed
Rapid prototyping, unknown requirements	Closed model API (fastest to start)
Need to fine-tune with full control	Open model
On-device or edge deployment	Open SLM

When this matters in practice:

Many teams use a hybrid approach: closed model APIs for development and prototyping, open models for production at scale or where data privacy matters.
Open models require infrastructure investment that closed models do not. Factor in GPU costs, engineering time, and operational overhead when comparing.
The gap between open and closed models is narrowing. Evaluate based on current benchmarks and your specific use case, not assumptions from a year ago.

Deployment Options

Cloud APIs

The simplest deployment model. Send requests to a provider’s API, receive responses. No infrastructure to manage.

Best for: Teams without ML infrastructure. Applications where latency of 0.5-2 seconds per request is acceptable. Variable or unpredictable workloads.

Cost model: Pay per token. Predictable per-request cost, but total cost scales linearly with usage.

Self-Hosted Cloud

Run open models on your own cloud infrastructure (AWS, GCP, Azure) using GPU instances.

Best for: High-volume applications where per-token API costs exceed infrastructure costs. Organizations with data residency requirements. Teams that need full control over the model and infrastructure.

Cost model: Fixed infrastructure cost (GPU instances by the hour) regardless of usage. Economical at high volume, expensive at low volume.

On-Device / Edge

Run small models directly on end-user devices (phones, laptops, IoT devices).

Best for: Offline-capable applications. Latency-sensitive use cases. Privacy-critical applications where data cannot leave the device.

Constraints: Limited to small models (typically under 7B parameters). Limited by device memory and compute. Battery impact on mobile devices.

Tools: Ollama (desktop), llama.cpp, MLX (Apple Silicon), MediaPipe (mobile), ONNX Runtime.

Cost Comparison Framework

Cost is often the deciding factor. Here is a framework for comparison:

API Pricing (Per Million Tokens)

Tier	Input	Output	Example Models
Frontier	$10-15	$30-75	GPT-4o, Claude Opus
Mid-tier	$1-3	$5-15	Claude Sonnet, GPT-4o mini
Efficient	$0.10-0.80	$0.40-4	Claude Haiku, GPT-4o mini

Self-Hosted Cost Estimate

Running a 70B parameter model requires approximately:

2x A100 80GB GPUs (~$4-6/hour on cloud)
At sustained throughput, this can serve thousands of requests per hour
Break-even vs. API pricing typically occurs at 10,000-50,000+ requests per day

Decision Rule

Calculate your monthly API cost at projected volume. If it exceeds $2,000-5,000/month, evaluate self-hosting economics. Below that, the operational overhead of self-hosting rarely justifies the savings.

Evaluating Models for Your Use Case

Benchmarks and leaderboards are a starting point, not a decision. Here is a practical evaluation process:

Define your task. What specific inputs and outputs does your application need?
Create a test set. 50-100 representative input/output pairs from your actual use case.
Test 3-4 models. Include at least one frontier, one mid-tier, and one open model.
Measure what matters. Accuracy, format compliance, latency, cost per request.
Test edge cases. How does each model handle unusual inputs, ambiguity, or missing information?
Calculate total cost. Include tokens, infrastructure, engineering time, and operational overhead.
Choose the cheapest model that meets your quality threshold. Not the best model. The cheapest one that is good enough.

When this matters in practice:

A model that scores 95% on your test set at $0.002 per request is usually better than one that scores 98% at $0.05 per request. The extra 3% rarely justifies a 25x cost increase.
Evaluate on YOUR data. A model that leads a public benchmark might rank differently on your specific domain.
Re-evaluate quarterly. Model capabilities and pricing change quickly. The right choice today might not be the right choice in six months.

Multi-Model Architectures

Many production systems use multiple models for different tasks:

Router pattern: A fast, cheap model classifies the incoming request and routes it to the appropriate model. Simple queries go to an SLM. Complex queries go to an LLM.

Pipeline pattern: Different models handle different stages. An SLM extracts data, an LLM reasons about it, another SLM formats the output.

Fallback pattern: Start with a cheaper model. If confidence is low or the output fails validation, retry with a more capable model.

These patterns optimize cost without sacrificing quality where it matters.

When this matters in practice:

The router pattern alone can reduce costs by 50-80% in applications where most queries are simple.
Multi-model architectures add complexity. Only adopt them when the cost savings justify the engineering investment.

What Comes Next

This post covered how to choose the right model for your use case. The next post in this series explores AI Agents and Tool Use: how models move beyond text generation to take actions, use tools, and orchestrate multi-step workflows.

Closing Thoughts

Model selection is not a permanent decision. It is a hypothesis you test and revise. Start with a model that is easy to use (typically a cloud API with a mid-tier model), validate that it meets your quality requirements, and optimize from there.

The most common mistake is over-indexing on model capability and under-indexing on cost and operational complexity. For most applications, the frontier model is not necessary. A mid-tier or even small model, combined with good prompting and RAG, delivers results that are good enough at a fraction of the cost.

Choose the smallest, cheapest model that meets your quality bar. Save the frontier models for the tasks that truly need them.

Introduction

The Model Landscape

LLMs vs. SLMs: Size Matters, But Not Always

Large Language Models (LLMs)

Small Language Models (SLMs)

When to Use Which

Open vs. Closed Models

Closed Models (Proprietary)

Open Models

The Open Model Ecosystem

How to Choose

Deployment Options

Cloud APIs

Self-Hosted Cloud

On-Device / Edge

Cost Comparison Framework

API Pricing (Per Million Tokens)

Self-Hosted Cost Estimate

Decision Rule

Evaluating Models for Your Use Case

Multi-Model Architectures

What Comes Next

Closing Thoughts

Comments