Fine-Tuning and Model Customization

This post covers fine-tuning and model customization. It builds on the prompting strategies from Prompting and Inference and the RAG concepts from RAG: Teaching AI What It Doesn’t Know, providing a framework for deciding when each approach is appropriate.

Introduction

The previous two posts covered two ways to shape model behavior: prompting (changing the input) and RAG (providing external context). This post covers the third: fine-tuning (changing the model itself).

Fine-tuning takes a pre-trained model and trains it further on a specific dataset. This modifies the model’s parameters, changing how it behaves at a fundamental level. The result is a model that is better at specific tasks, more aligned with a particular style, or more knowledgeable about a specialized domain.

Fine-tuning is powerful, but it is also the most expensive and complex customization approach. Understanding when it is the right choice, and when prompting or RAG would serve better, is one of the most important decisions in applied AI.

What Fine-Tuning Actually Does

When you fine-tune a model, you are resuming the training process with new data. The model’s parameters (weights and biases) are adjusted to minimize prediction error on your specific dataset.

This is different from prompting and RAG in a fundamental way:

Prompting changes what the model sees. The model itself is unchanged.
RAG gives the model reference material. The model itself is unchanged.
Fine-tuning changes the model itself. Its parameters are modified.

An analogy: prompting is like giving someone instructions for a task. RAG is like giving them a reference book. Fine-tuning is like sending them through a training program that changes how they approach the work.

After fine-tuning, the model behaves differently even without special prompts or retrieved context. The new behavior is baked into the parameters.

Approaches to Fine-Tuning

Full Fine-Tuning

All model parameters are updated during training. This provides the most flexibility but requires the most resources.

Requirements:

The full model loaded in memory (a 70B parameter model needs hundreds of GB of GPU memory)
A training dataset (typically thousands to tens of thousands of examples)
Significant compute time (hours to days on high-end GPUs)

When to use: When you need deep behavioral changes across a wide range of tasks and have the infrastructure to support it. This is rare outside of organizations with dedicated ML teams.

LoRA (Low-Rank Adaptation)

LoRA is the most practical fine-tuning approach for most teams. Instead of updating all parameters, LoRA adds small trainable layers (adapters) alongside the frozen base model. Only the adapter weights are trained.

Advantages:

Uses a fraction of the memory (often 10-20% of full fine-tuning)
Training is faster and cheaper
The adapter is small (megabytes vs. gigabytes) and can be swapped
Multiple adapters can be created for different tasks using the same base model
Results are often comparable to full fine-tuning

Example: You want a customer support model and a technical writing model. Fine-tune two LoRA adapters on the same base model. Swap adapters based on the use case. The base model is loaded once.

QLoRA takes this further by quantizing the base model (reducing its precision from 16-bit to 4-bit) while training the LoRA adapters in higher precision. This allows fine-tuning large models on consumer GPUs.

RLHF (Reinforcement Learning from Human Feedback)

RLHF is not a technique you typically apply yourself. It is the process model providers use to align base models into helpful assistants. The process:

Human evaluators compare multiple model outputs for the same input
A reward model is trained on those preferences
The LLM is trained to maximize the reward model’s scores

RLHF is how models learn to be helpful rather than just predictive. It is why ChatGPT and Claude behave like assistants rather than autocomplete engines.

DPO (Direct Preference Optimization) is a newer alternative that simplifies the RLHF process by eliminating the separate reward model. It achieves similar results with less infrastructure.

When this matters in practice:

For most teams, LoRA is the right approach. It balances capability with practicality.
Full fine-tuning is for organizations with dedicated ML infrastructure and specific requirements that LoRA cannot meet.
You do not need to implement RLHF yourself. Use models that have already been aligned via RLHF, and fine-tune from there.

Preparing Training Data

Fine-tuning quality depends entirely on training data quality. The format depends on the task:

Instruction Fine-Tuning

For teaching the model to follow instructions in a specific way:

{
  "messages": [
    {"role": "system", "content": "You are a medical coding assistant."},
    {"role": "user", "content": "Patient presents with acute bronchitis."},
    {"role": "assistant", "content": "ICD-10: J20.9 - Acute bronchitis, unspecified"}
  ]
}

Completion Fine-Tuning

For teaching the model to continue text in a specific style:

{
  "prompt": "Q3 Revenue Summary:\n",
  "completion": "Total revenue reached $4.2M, representing a 15% increase over Q2..."
}

Data Requirements

Dataset Size	Expected Outcome
50-100 examples	Minimal behavior change. May work for simple format/style adjustments.
500-1,000 examples	Noticeable improvement for focused tasks.
1,000-10,000 examples	Strong performance for the target task.
10,000+ examples	Diminishing returns unless the task is very complex or diverse.

Data Quality Guidelines

Consistent format. Every example should follow the same structure.
Representative distribution. Your training data should reflect the real-world distribution of inputs and outputs.
High-quality outputs. The model will learn to produce what you show it. Bad examples produce bad behavior.
Diverse inputs. Cover the range of inputs the model will encounter. Edge cases in training prevent edge case failures in production.
Clean data. Remove duplicates, fix errors, and ensure accuracy. Fine-tuning amplifies whatever is in the data, including mistakes.

When this matters in practice:

Data preparation is usually the most time-consuming part of fine-tuning. Budget accordingly.
Start with a small, high-quality dataset and iterate. 200 excellent examples often beat 5,000 mediocre ones.
Generate training data with a more capable model and review it manually. This is a common and effective shortcut: use GPT-4 or Claude to generate draft examples, then have humans verify and correct them.

When to Fine-Tune (and When Not To)

This is the critical decision. Fine-tuning is expensive in time, compute, and maintenance. It should be a deliberate choice, not a default.

Fine-Tune When

Style consistency matters. You need the model to consistently match a specific voice, format, or communication style that prompting cannot reliably achieve.
Domain-specific behavior is required. The model needs to handle specialized terminology, workflows, or reasoning patterns that general-purpose models do not do well.
Prompt engineering has hit its ceiling. You have optimized your prompts and the model still does not produce the quality you need.
Latency or cost matters. A fine-tuned smaller model can replace a larger model with good prompting, reducing both latency and cost per request.
You need consistent structured output. If the model needs to reliably produce a complex format, fine-tuning on examples of that format is more reliable than prompting alone.

Do Not Fine-Tune When

You need the model to know specific facts. Use RAG instead. Fine-tuning is bad at injecting specific, retrievable knowledge. The model might learn the facts, or it might not. RAG is deterministic.
Your data changes frequently. Fine-tuning bakes information into the model at training time. If your product catalog changes weekly, fine-tuning cannot keep up. RAG can.
Better prompting would solve the problem. Always try prompting and few-shot examples before fine-tuning. Fine-tuning for something that prompting can handle wastes time and money.
You lack quality training data. Fine-tuning with bad data produces bad models. If you cannot produce a clean, representative dataset, do not fine-tune.
You are not prepared to maintain it. Fine-tuned models need retraining when the base model updates, when your requirements change, or when performance drifts. If you do not have a plan for ongoing maintenance, the fine-tuned model will degrade over time.

The Decision Sequence

Try these in order:

Better prompting. Optimize your system prompt, add few-shot examples, use chain-of-thought. This is free and fast.
RAG. If the model needs information it does not have, give it that information at query time.
Fine-tuning. If the model needs to behave differently at a fundamental level, and prompting and RAG are not enough, fine-tune.
Custom model. If none of the above work, you may need a model trained from scratch. This is extremely rare and extremely expensive.

Fine-Tuning with Providers

Most model providers offer fine-tuning APIs:

Provider	Models Available for Fine-Tuning	Approach
OpenAI	GPT-4o, GPT-4o mini, GPT-3.5 Turbo	Upload data, API handles training
Anthropic	Claude (select models)	Custom arrangements
Google	Gemini models	Vertex AI fine-tuning
Together AI	Open source models (Llama, Mistral, etc.)	Upload data, managed training
Self-hosted	Any open source model	Full control, your infrastructure

Provider-managed fine-tuning is the simplest path. You upload training data, configure parameters, and the provider handles the infrastructure. The tradeoff is less control and potential vendor lock-in.

Self-hosted fine-tuning (using tools like Hugging Face, Axolotl, or LLaMA-Factory) provides full control but requires GPU infrastructure and ML engineering expertise.

Evaluating Fine-Tuned Models

After fine-tuning, you need to verify the model improved without breaking other capabilities.

Hold out a test set. Never evaluate on the same data you trained on. Split your data: 80-90% for training, 10-20% for evaluation.

Test the target task. Does the model perform better on the specific task you fine-tuned for? Measure with concrete metrics: accuracy, format compliance, similarity to reference outputs.

Test for regression. Does the model still handle general tasks well? Fine-tuning can cause the model to “forget” capabilities it had before (catastrophic forgetting). Test a representative sample of general-purpose queries.

Compare to prompting baseline. Run the same test set through the base model with your best prompts. If the fine-tuned model does not outperform the prompted base model, fine-tuning was not worth the investment.

Test with real users. Automated metrics do not capture everything. Have real users interact with the fine-tuned model and provide feedback before deploying to production.

What Comes Next

This post covered when and how to fine-tune models. The next post in this series explores Choosing the Right Model: how to select between LLMs and SLMs, open and closed models, and cloud versus on-device deployment.

Closing Thoughts

Fine-tuning is a powerful tool that is often used too early. The most common mistake is reaching for fine-tuning before exhausting simpler approaches. Better prompts and RAG solve most problems. Fine-tuning solves the rest.

When fine-tuning is the right choice, LoRA makes it practical for most teams. The barrier is not compute or infrastructure. It is training data. If you can produce a clean, representative dataset of a few hundred to a few thousand examples, you can fine-tune a model that performs well on your specific task.

The key insight: prompting changes what the model sees, RAG changes what the model knows, and fine-tuning changes how the model behaves. Understanding which lever to pull, and in what order, is the skill that separates effective AI applications from expensive experiments.