Few-Shot Prompting: Teaching LLMs by Example

Zero-shot prompts work surprisingly well for simple tasks. But the moment output needs to follow a specific format or requires domain knowledge, they fall apart. The model guesses, and that guess isn't always right.

Few-shot prompting fixes this by giving the model concrete examples. No need to explain what to do — just show it.

Why Examples Beat Instructions

Take a classification task. A support ticket needs a category: billing, technical, feature-request, or other.

The zero-shot approach:

Classify the following support ticket into one of these categories:
billing, technical, feature-request, other.

Ticket: "My invoice is wrong, there's a double charge."

Does this work? Usually. But the output format is unpredictable. Sometimes Billing, sometimes billing, sometimes a paragraph explaining the reasoning. In a pipeline expecting JSON, that breaks immediately.

With few-shot:

Classify support tickets. Return only the category, lowercase.

Ticket: "I can't log in since the update"
Category: technical

Ticket: "Can you add dark mode?"
Category: feature-request

Ticket: "My invoice is wrong, there's a double charge."
Category:

Now the model gets the format. No explanation needed. The examples say everything.

Finding the Right Number of Examples

More examples isn't automatically better. For most models:

1-2 examples — enough for simple formatting tasks
3-5 examples — ideal for classification and extraction
5+ — rarely necessary unless there are subtle edge cases

Every example costs tokens. With GPT-4 or Claude, that adds up fast. A prompt with twenty examples is expensive and slow, while five well-chosen examples deliver the same result.

Quality of examples matters more than quantity. Two poorly chosen examples do more harm than no examples at all.

Enforcing Structured Output

Few-shot helps a lot, but for production systems, hoping the model follows the format isn't enough. Modern APIs provide tools to enforce output structure.

OpenAI's Response Format

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "ticket_classification",
            "schema": {
                "type": "object",
                "properties": {
                    "category": {
                        "type": "string",
                        "enum": ["billing", "technical",
                                 "feature-request", "other"]
                    },
                    "confidence": {
                        "type": "number"
                    }
                },
                "required": ["category", "confidence"]
            }
        }
    },
    messages=[
        {"role": "system", "content": "Classify support tickets."},
        {"role": "user", "content": "My invoice is incorrect."}
    ]
)

The output is guaranteed valid JSON matching the schema. No parsing hacks, no regex to extract the category.

Anthropic's Tool Use as Structured Output

With Claude, the same principle works through tool definitions:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=256,
    tools=[{
        "name": "classify_ticket",
        "description": "Classify a support ticket",
        "input_schema": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": ["billing", "technical",
                             "feature-request", "other"]
                },
                "confidence": {"type": "number"}
            },
            "required": ["category", "confidence"]
        }
    }],
    tool_choice={"type": "tool", "name": "classify_ticket"},
    messages=[
        {"role": "user",
         "content": "Classify: My invoice is incorrect."}
    ]
)

Same idea, different API. The model is forced to respond through the tool, ensuring the output always matches the schema.

Combining Few-Shot with Structured Output

The sweet spot is the combination. Few-shot examples in the system prompt teach the model the content rules. Structured output enforces the format.

system_prompt = """Classify support tickets.

Examples:
- "Can't log in" → technical, confidence 0.95
- "Invoice is wrong" → billing, confidence 0.9
- "Please add dark mode" → feature-request, confidence 0.85
- "You guys are awesome!" → other, confidence 0.7

Note: vague compliments or feedback without action = other.
Uncertain cases get lower confidence."""

The examples steer the classification logic. The JSON schema prevents format issues. Two layers of reliability.

Common Mistakes

Examples that look too similar. If every example is about invoices, the model learns nothing about the other categories. Variation is essential.

Contradictory examples. One example classifying "can't log in" as technical while another labels "login problem" as billing. The model gets confused, and output becomes random.

Overly long examples. A ten-sentence example when real input is typically two sentences. The model starts generating lengthy outputs too. Keep examples representative of actual data.

When Few-Shot Isn't Enough

Some tasks are too complex for prompt engineering alone. When classification depends on company-specific knowledge that doesn't fit in a few examples, or when accuracy needs to exceed 95%, fine-tuning is the better option.

But for 80% of use cases? Few-shot with structured output is quick to set up, easy to iterate on, and good enough. No training data required, no GPU time, and changes are a prompt edit instead of a retrain.

Choose pragmatically. Not every screw needs a sledgehammer.

Few-Shot Prompting: Teaching LLMs by Example

Few-Shot Prompting: Teaching LLMs by Example

Why Examples Beat Instructions

Finding the Right Number of Examples

Enforcing Structured Output

OpenAI's Response Format

Anthropic's Tool Use as Structured Output

Combining Few-Shot with Structured Output

Common Mistakes

When Few-Shot Isn't Enough

Related Articles

AI Code Generation in Practice: What Works and What Doesn't

Prompt Engineering: Getting Better Results from Large Language Models

Want to stay updated?