Choosing the Right Model
This post provides a framework for choosing between models. It draws on concepts from across the series, particularly How LLMs Actually Work, Prompting and Inference, and Fine-Tuning and Model Customization.
Introduction
The previous posts covered what models are, how they work, and how to customize their behavior. This post answers the practical question: which model should you use?
The answer is never just “the biggest one” or “the cheapest one.” It depends on the task, the constraints, the budget, and the deployment environment. A model that is perfect for one use case might be wasteful or insufficient for another. This post provides a framework for making that decision.
The Model Landscape
The AI model ecosystem has expanded rapidly. Choosing a model now involves decisions across several dimensions:
- Size: Large (100B+ parameters) vs. small (1-13B parameters)
- Access: Closed/proprietary (API access only) vs. open (weights available for download)
- Deployment: Cloud API vs. self-hosted vs. on-device
- Specialization: General-purpose vs. domain-specific
- Provider: OpenAI, Anthropic, Google, Meta, Mistral, and others
Each dimension involves tradeoffs. Understanding those tradeoffs is the framework for making good choices.
LLMs vs. SLMs: Size Matters, But Not Always
Large Language Models (LLMs)
Models with 70B+ parameters. Examples: GPT-4o, Claude Opus/Sonnet, Gemini Pro.
Strengths:
- Broad knowledge across many domains
- Strong reasoning and multi-step problem solving
- Better at following complex instructions
- Better at nuanced, ambiguous tasks
- State-of-the-art performance on benchmarks
Weaknesses:
- Higher cost per token
- Higher latency (slower responses)
- Require significant compute (cloud-only for most)
- Overkill for simple tasks
Small Language Models (SLMs)
Models with 1-13B parameters. Examples: Phi-3, Gemma 2, Llama 3 8B, Mistral 7B.
Strengths:
- Lower cost per token (often 10-50x cheaper than frontier LLMs)
- Lower latency
- Can run on consumer hardware or edge devices
- Can run without internet connectivity
- Easier to fine-tune
- Sufficient for many focused tasks
Weaknesses:
- Less capable at complex reasoning
- Narrower knowledge
- Worse at following complex or nuanced instructions
- May need fine-tuning to match LLM quality on specific tasks
When to Use Which
| Task | Recommended Size | Reasoning |
|---|---|---|
| Complex code generation | LLM | Requires broad knowledge and multi-step reasoning |
| Customer support chatbot (general) | LLM or mid-tier | Needs to handle diverse, unpredictable queries |
| Sentiment classification | SLM | Focused task, well-defined output |
| Text summarization | Mid-tier or SLM | Depends on content complexity |
| On-device autocomplete | SLM | Latency and connectivity constraints |
| Data extraction from forms | SLM | Structured task, consistent format |
| Research and analysis | LLM | Requires synthesis across broad knowledge |
| Translation | Mid-tier | Well-defined task, but nuance matters |
When this matters in practice:
- Start with a frontier LLM to establish a quality baseline. Then test whether a smaller model produces acceptable results for your specific task. If it does, you save money and latency.
- A fine-tuned SLM can match or exceed a general-purpose LLM on a narrow task. If your use case is focused, this is often the best path.
- The “mid-tier” sweet spot (Claude Sonnet, GPT-4o mini) handles a wide range of tasks well at moderate cost. This is where most production applications land.
Open vs. Closed Models
Closed Models (Proprietary)
Models accessed only through a provider’s API. You cannot download the weights, modify the model, or run it on your own infrastructure.
Examples: GPT-4o, Claude, Gemini Pro
Advantages:
- Best-in-class performance (frontier models are typically closed)
- No infrastructure to manage
- Continuous improvements from the provider
- Enterprise support and SLAs
Disadvantages:
- Vendor lock-in
- Data sent to a third party (privacy/compliance concerns)
- No control over model behavior changes
- Usage-based pricing can be unpredictable at scale
- Provider can deprecate or change the model
Open Models
Models with publicly available weights that you can download, inspect, modify, and deploy.
Examples: Llama 3, Mistral, Gemma, Phi-3, Qwen
Advantages:
- Full control over deployment and data
- No per-token API costs (you pay for infrastructure instead)
- Can fine-tune without restrictions
- No vendor lock-in
- Data stays on your infrastructure (important for regulated industries)
- Community ecosystem of tools, fine-tunes, and optimizations
Disadvantages:
- Generally lower performance than frontier closed models
- You manage the infrastructure (GPUs, scaling, monitoring)
- You handle security, patching, and updates
- Requires ML engineering expertise for deployment and optimization
The Open Model Ecosystem
Open models have matured rapidly. The current landscape:
| Model Family | Provider | Notable Sizes | Strengths |
|---|---|---|---|
| Llama 3 | Meta | 8B, 70B, 405B | Strong general performance, large community |
| Mistral | Mistral AI | 7B, 8x7B (Mixtral), Large | Efficient, good multilingual support |
| Gemma 2 | 2B, 9B, 27B | Strong for its size, good for on-device | |
| Phi-3/4 | Microsoft | 3.8B, 14B | Punches above weight class, good for edge |
| Qwen 2.5 | Alibaba | 0.5B to 72B | Strong multilingual, competitive performance |
How to Choose
| Situation | Recommendation |
|---|---|
| Need best possible quality, no infrastructure team | Closed model API |
| Regulated industry, data cannot leave your network | Open model, self-hosted |
| High volume, cost is primary concern | Open model or mid-tier closed |
| Rapid prototyping, unknown requirements | Closed model API (fastest to start) |
| Need to fine-tune with full control | Open model |
| On-device or edge deployment | Open SLM |
When this matters in practice:
- Many teams use a hybrid approach: closed model APIs for development and prototyping, open models for production at scale or where data privacy matters.
- Open models require infrastructure investment that closed models do not. Factor in GPU costs, engineering time, and operational overhead when comparing.
- The gap between open and closed models is narrowing. Evaluate based on current benchmarks and your specific use case, not assumptions from a year ago.
Deployment Options
Cloud APIs
The simplest deployment model. Send requests to a provider’s API, receive responses. No infrastructure to manage.
Best for: Teams without ML infrastructure. Applications where latency of 0.5-2 seconds per request is acceptable. Variable or unpredictable workloads.
Cost model: Pay per token. Predictable per-request cost, but total cost scales linearly with usage.
Self-Hosted Cloud
Run open models on your own cloud infrastructure (AWS, GCP, Azure) using GPU instances.
Best for: High-volume applications where per-token API costs exceed infrastructure costs. Organizations with data residency requirements. Teams that need full control over the model and infrastructure.
Cost model: Fixed infrastructure cost (GPU instances by the hour) regardless of usage. Economical at high volume, expensive at low volume.
On-Device / Edge
Run small models directly on end-user devices (phones, laptops, IoT devices).
Best for: Offline-capable applications. Latency-sensitive use cases. Privacy-critical applications where data cannot leave the device.
Constraints: Limited to small models (typically under 7B parameters). Limited by device memory and compute. Battery impact on mobile devices.
Tools: Ollama (desktop), llama.cpp, MLX (Apple Silicon), MediaPipe (mobile), ONNX Runtime.
Cost Comparison Framework
Cost is often the deciding factor. Here is a framework for comparison:
API Pricing (Per Million Tokens)
| Tier | Input | Output | Example Models |
|---|---|---|---|
| Frontier | $10-15 | $30-75 | GPT-4o, Claude Opus |
| Mid-tier | $1-3 | $5-15 | Claude Sonnet, GPT-4o mini |
| Efficient | $0.10-0.80 | $0.40-4 | Claude Haiku, GPT-4o mini |
Self-Hosted Cost Estimate
Running a 70B parameter model requires approximately:
- 2x A100 80GB GPUs (~$4-6/hour on cloud)
- At sustained throughput, this can serve thousands of requests per hour
- Break-even vs. API pricing typically occurs at 10,000-50,000+ requests per day
Decision Rule
Calculate your monthly API cost at projected volume. If it exceeds $2,000-5,000/month, evaluate self-hosting economics. Below that, the operational overhead of self-hosting rarely justifies the savings.
Evaluating Models for Your Use Case
Benchmarks and leaderboards are a starting point, not a decision. Here is a practical evaluation process:
- Define your task. What specific inputs and outputs does your application need?
- Create a test set. 50-100 representative input/output pairs from your actual use case.
- Test 3-4 models. Include at least one frontier, one mid-tier, and one open model.
- Measure what matters. Accuracy, format compliance, latency, cost per request.
- Test edge cases. How does each model handle unusual inputs, ambiguity, or missing information?
- Calculate total cost. Include tokens, infrastructure, engineering time, and operational overhead.
- Choose the cheapest model that meets your quality threshold. Not the best model. The cheapest one that is good enough.
When this matters in practice:
- A model that scores 95% on your test set at $0.002 per request is usually better than one that scores 98% at $0.05 per request. The extra 3% rarely justifies a 25x cost increase.
- Evaluate on YOUR data. A model that leads a public benchmark might rank differently on your specific domain.
- Re-evaluate quarterly. Model capabilities and pricing change quickly. The right choice today might not be the right choice in six months.
Multi-Model Architectures
Many production systems use multiple models for different tasks:
Router pattern: A fast, cheap model classifies the incoming request and routes it to the appropriate model. Simple queries go to an SLM. Complex queries go to an LLM.
Pipeline pattern: Different models handle different stages. An SLM extracts data, an LLM reasons about it, another SLM formats the output.
Fallback pattern: Start with a cheaper model. If confidence is low or the output fails validation, retry with a more capable model.
These patterns optimize cost without sacrificing quality where it matters.
When this matters in practice:
- The router pattern alone can reduce costs by 50-80% in applications where most queries are simple.
- Multi-model architectures add complexity. Only adopt them when the cost savings justify the engineering investment.
What Comes Next
This post covered how to choose the right model for your use case. The next post in this series explores AI Agents and Tool Use: how models move beyond text generation to take actions, use tools, and orchestrate multi-step workflows.
Closing Thoughts
Model selection is not a permanent decision. It is a hypothesis you test and revise. Start with a model that is easy to use (typically a cloud API with a mid-tier model), validate that it meets your quality requirements, and optimize from there.
The most common mistake is over-indexing on model capability and under-indexing on cost and operational complexity. For most applications, the frontier model is not necessary. A mid-tier or even small model, combined with good prompting and RAG, delivers results that are good enough at a fraction of the cost.
Choose the smallest, cheapest model that meets your quality bar. Save the frontier models for the tasks that truly need them.
Found this useful?
If this post helped you, consider buying me a coffee.
Comments