Prompting and Inference
This post covers prompting strategies and inference parameters in depth. It connects to concepts introduced in How LLMs Actually Work and provides context for understanding when to reach for fine-tuning instead.
Introduction
Prompting is how you communicate with a model. It is also the most accessible lever you have for controlling output quality. Before reaching for fine-tuning, RAG, or a different model, the first question should always be: can better prompting solve this?
Often, it can. The difference between a vague prompt and a well-structured one is often the difference between a useless response and a useful one. This post covers the strategies and parameters that give you control over what a model produces.
The Anatomy of a Prompt
A prompt sent to an LLM typically has several components:
System prompt: Sets the model’s role, behavior, constraints, and style for the entire conversation. This is processed before any user messages and shapes every response.
User message: The current input from the user. A question, instruction, or piece of content to process.
Assistant message: Previous responses from the model. In multi-turn conversations, the alternating user/assistant messages form the conversation history.
Context: Any additional information included to inform the response. Retrieved documents (in RAG), examples, data to analyze, or reference material.
The model processes all of these together as a single sequence of tokens. It does not treat the system prompt as fundamentally different from user messages at a technical level. But the convention of placing behavioral instructions in the system prompt helps organize the prompt and gives the model a consistent frame for the conversation.
System Prompts: Setting the Frame
A system prompt defines who the model is and how it should behave. It is the most underused tool in most AI applications.
A weak system prompt:
You are a helpful assistant.
A strong system prompt:
You are a senior technical support agent for Acme Cloud Platform.
Your responsibilities:
- Answer questions about Acme's products and services
- Help users troubleshoot technical issues
- Escalate billing and account issues to the billing team
Rules:
- Only answer questions related to Acme's products
- If you do not know the answer, say so. Do not guess.
- Never share internal pricing or roadmap information
- Respond in the same language the user writes in
Tone: Professional but approachable. Concise. No filler.
The second prompt produces more consistent, appropriate, and constrained responses because it gives the model clear boundaries and expectations.
What to Include in a System Prompt
- Role: What the model is and is not
- Scope: What topics it should and should not address
- Constraints: Rules it must follow
- Tone and style: How it should communicate
- Output format: Structure expectations for responses
- Failure behavior: What to do when it does not know or cannot help
What to Keep Out of a System Prompt
- Information that changes per request (put this in the user message or context)
- Very long reference documents (use RAG instead)
- Contradictory instructions (the model will follow some and ignore others unpredictably)
When this matters in practice:
- A well-crafted system prompt can eliminate entire categories of bad output without any code changes.
- System prompts are included in every request, so they consume tokens on every call. Keep them as concise as effective. Test whether removing a line changes output quality. If not, remove it.
- Test your system prompt adversarially. Ask it questions outside its scope. Try to get it to break its own rules. Fix the gaps.
Few-Shot Prompting: Teaching by Example
Few-shot prompting includes examples of the desired input-output pattern directly in the prompt. Instead of explaining what you want, you show it.
Zero-shot (no examples):
Classify the following customer message as positive, negative,
or neutral:
"The product arrived on time and works great!"
Few-shot (with examples):
Classify customer messages as positive, negative, or neutral.
Message: "Love this product, will buy again!"
Classification: positive
Message: "Terrible experience, requesting a refund."
Classification: negative
Message: "The package arrived today."
Classification: neutral
Message: "The product arrived on time and works great!"
Classification:
Few-shot prompting works because the model recognizes the pattern from the examples and continues it. The more consistent and representative your examples, the more reliable the output.
Guidelines for Few-Shot Examples
- Include edge cases. If ambiguous inputs are common, show how they should be handled.
- Match the distribution. If 80% of real inputs are positive, do not make 80% of your examples negative.
- Be consistent. Use the same format in every example. The model will replicate formatting inconsistencies.
- Use 3-5 examples for most tasks. More examples use more tokens but often hit diminishing returns.
When this matters in practice:
- Few-shot prompting is the fastest way to get consistent output formatting without fine-tuning.
- For classification, extraction, and transformation tasks, few-shot examples are often more effective than lengthy natural language instructions.
- Each example consumes tokens. Balance the benefit of more examples against the cost and context window usage.
Chain-of-Thought: Making Models Reason
Chain-of-thought (CoT) prompting asks the model to show its reasoning before producing a final answer. This improves accuracy on tasks that require multi-step reasoning, math, logic, or complex analysis.
Without chain-of-thought:
Q: A store has 45 apples. They sell 12 in the morning and
receive a shipment of 30 in the afternoon. Then they sell 18
more. How many apples do they have?
A: 45
With chain-of-thought:
Q: A store has 45 apples. They sell 12 in the morning and
receive a shipment of 30 in the afternoon. Then they sell 18
more. How many apples do they have? Think step by step.
A: Let me work through this step by step:
1. Starting: 45 apples
2. Sell 12 in the morning: 45 - 12 = 33
3. Receive shipment of 30: 33 + 30 = 63
4. Sell 18 more: 63 - 18 = 45
The store has 45 apples.
The phrase “think step by step” (or similar) is often enough to trigger this behavior. For more complex tasks, you can provide a structured reasoning template.
When Chain-of-Thought Helps
- Math and calculations
- Multi-step logic problems
- Code debugging (reason about what each line does)
- Complex analysis where the conclusion depends on multiple factors
- Any task where showing work makes errors visible and correctable
When Chain-of-Thought Hurts
- Simple factual lookups (“What is the capital of France?”)
- Tasks where speed matters more than accuracy
- When the extra tokens in the reasoning are not worth the cost
When this matters in practice:
- Chain-of-thought adds output tokens (and cost). Use it when accuracy matters enough to justify the overhead.
- You can ask for reasoning in a structured format (“Reasoning: … Answer: …”) and then parse only the answer programmatically. The user does not need to see the reasoning.
- For auditing and debugging, chain-of-thought is invaluable. When the model gets something wrong, the reasoning shows you where it went off track.
Structured Output: Getting Predictable Formats
For programmatic use, you often need the model to return data in a specific format: JSON, XML, CSV, or a custom structure. Structured output prompting achieves this.
Basic approach:
Extract the following information from the customer message
and return it as JSON:
- sentiment (positive, negative, neutral)
- topic (billing, technical, shipping, general)
- urgency (low, medium, high)
Message: "My order #12345 hasn't arrived and I need it by
Friday for a presentation!"
JSON:
With schema definition:
Return a JSON object matching this schema:
{
"sentiment": "positive" | "negative" | "neutral",
"topic": "billing" | "technical" | "shipping" | "general",
"urgency": "low" | "medium" | "high",
"order_id": "string or null"
}
Most model APIs now support structured output modes or JSON mode that constrain the model to produce valid JSON. Use these when available. They are more reliable than relying on prompt instructions alone.
Tool Use / Function Calling
A related capability is function calling (covered in depth in AI Agents and Tool Use). The model returns a structured object that specifies which function to call and with what arguments. The system executes the function and returns the result to the model.
This is the bridge between generating text and taking action, and it relies on the same structured output capabilities.
When this matters in practice:
- If you are parsing model output in code, always use the provider’s structured output mode when available. Prompt-based JSON generation occasionally produces invalid JSON.
- Define your schema explicitly. Do not rely on the model inferring the structure from examples alone.
- Test with edge cases. What happens when a field should be null? When the input does not contain the requested information? Handle these in your schema and instructions.
Inference Parameters: Controlling Generation
When a model generates text, several parameters control the process. Understanding these gives you fine-grained control over output behavior.
Temperature
Temperature controls the randomness of token selection. It ranges from 0 to 2 (typically), with 0 being fully deterministic.
| Temperature | Behavior | Use Case |
|---|---|---|
| 0 | Always picks the most probable token | Code generation, factual Q&A, data extraction |
| 0.1-0.3 | Slight variation, mostly predictable | Business writing, summarization |
| 0.5-0.7 | Balanced creativity and coherence | General conversation, explanations |
| 0.8-1.0 | More creative, less predictable | Brainstorming, creative writing |
| 1.0+ | High variation, risk of incoherence | Experimental, rarely useful in production |
Top-p (Nucleus Sampling)
Top-p limits the selection to the smallest set of tokens whose cumulative probability exceeds the threshold. At top-p 0.9, the model considers only the tokens that together account for 90% of the probability mass.
Lower top-p values produce more focused output. Higher values allow more diversity. Most applications use top-p between 0.8 and 1.0.
Top-k
Top-k limits selection to the k most probable tokens. At top-k 50, only the 50 most likely next tokens are candidates, regardless of their probabilities.
Max Tokens
Max tokens caps the length of the model’s response. If the model has not finished its response when the limit is reached, the output is truncated mid-sentence. Set this high enough for your expected output but low enough to prevent runaway generation and cost.
Stop Sequences
Stop sequences tell the model to stop generating when a specific string appears. Useful for structured output (stop at the closing bracket) or for preventing the model from continuing past the desired endpoint.
Practical Defaults
For most applications, start with:
- Temperature: 0 for deterministic tasks, 0.5 for conversational
- Top-p: 1.0 (let temperature do the work)
- Max tokens: Set based on expected output length plus buffer
Avoid changing temperature and top-p simultaneously. They interact in non-obvious ways. Pick one and tune it.
When this matters in practice:
- For production applications that need consistency (classification, extraction, structured output), use temperature 0.
- For user-facing conversations where variety matters, temperature 0.5-0.7 prevents the model from giving identical responses to similar questions.
- Max tokens is a safety net, not a quality control. The model does not write better because you set a lower limit. It just stops sooner.
Advanced Prompting Patterns
Role Prompting
Assigning the model a specific expert role can improve output quality for specialized tasks:
You are a senior database administrator with 15 years of
experience in PostgreSQL performance tuning. A junior developer
has asked you the following question. Explain clearly and
provide specific, actionable recommendations.
The model draws on patterns from its training data associated with that expertise. This is not guaranteed to produce expert-quality output, but it often improves specificity and depth.
Prompt Chaining
Break complex tasks into a sequence of simpler prompts, where each step’s output feeds into the next:
- Extract key information from a document
- Analyze the extracted information for patterns
- Generate recommendations based on the analysis
Each step uses a focused prompt that does one thing well. This is more reliable than a single prompt that tries to do everything at once.
Self-Consistency
For tasks where accuracy matters, run the same prompt multiple times with temperature > 0 and take the majority answer. If 4 out of 5 runs produce the same answer, confidence is high.
This trades cost for reliability. It is appropriate for high-stakes decisions, not routine queries.
Debugging Prompts
When a prompt produces bad output, diagnose systematically:
- Is the instruction clear? Remove ambiguity. Be specific about what you want and do not want.
- Is the context sufficient? Does the model have the information it needs to produce a good answer?
- Are there conflicting instructions? The model cannot follow two contradictory rules simultaneously.
- Is the task too complex for one prompt? Consider breaking it into a chain.
- Is the model the right size? Smaller models struggle with complex reasoning. If prompting is not working, a more capable model might be the answer.
When adjusting prompts, change one thing at a time and test with multiple inputs. A prompt that works for one example might fail for another.
What Comes Next
This post covered how to communicate effectively with LLMs through prompting and inference parameters. The next post in this series explores Fine-Tuning and Model Customization: when prompting is not enough and you need to change the model’s behavior at a deeper level.
Closing Thoughts
Prompting is the highest-leverage skill in applied AI. It requires no infrastructure, no training data, and no engineering beyond editing text. Yet it is often treated as an afterthought.
The pattern is consistent: teams build complex AI systems, get disappointing results, and look at the prompt last. Inverting that order saves time and money. Start with the prompt. Get it right. Then add complexity only when the prompt alone is not enough.
Good prompting is not about tricks or magic phrases. It is about clear communication: telling the model what you want, showing it what good output looks like, giving it the context it needs, and controlling the parameters that shape generation. The same skills that make you a clear communicator with humans make you effective at prompting.
Found this useful?
If this post helped you, consider buying me a coffee.
Comments