Testing LLM Prompts: From Guesswork to Measurable Quality

Writing a prompt is easy. Knowing whether that prompt actually works well — that's a completely different story. Most teams test their prompts by manually throwing a few inputs at them, concluding it "looks fine," and shipping to production. Until the complaints roll in.

The issue isn't bad prompts. The issue is having no way to tell whether a change makes the output better or worse. Prompt development without tests is like writing software without unit tests — works until it doesn't, and then things go sideways fast.

Why Manual Testing Doesn't Scale

With three use cases and a handful of edge cases, it's still manageable to quickly eyeball the output. But once an application has dozens of prompt variants, different user inputs, and edge cases stacking up, manual checking becomes a full-time job.

Then there's the non-determinism problem. The same prompt with the same input doesn't always produce the same output. Setting temperature to 0 helps, but even then there are subtle differences between API calls. That makes "quick manual checks" fundamentally unreliable as a testing strategy.

Setting Up a Prompt Test Suite

The approach really isn't that different from what already exists for regular software. A prompt test suite consists of three components:

Test cases — a collection of inputs with expected outputs or criteria the output must meet. Not necessarily exact matches, but measurable properties.

Evaluation functions — the logic that determines whether an output is "good." Sometimes simple (does the output contain the right JSON schema?), sometimes complex (is the tone professional but not overly formal?).

A runner — something that feeds all test cases through the prompt and scores the results.

In Python, that looks roughly like this:

import json
from openai import OpenAI

client = OpenAI()

test_cases = [
    {
        "input": "My invoice is wrong, the amount doesn't match",
        "expected_category": "billing",
        "expected_sentiment": "negative"
    },
    {
        "input": "Could you add a dark mode option?",
        "expected_category": "feature-request",
        "expected_sentiment": "neutral"
    },
    {
        "input": "The app crashes when I click export",
        "expected_category": "technical",
        "expected_sentiment": "negative"
    }
]

def run_prompt(user_input: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": """Classify the support ticket.
Return JSON with: category (billing|technical|feature-request|other)
and sentiment (positive|neutral|negative)."""},
            {"role": "user", "content": user_input}
        ]
    )
    return json.loads(response.choices[0].message.content)

def evaluate(test_case: dict, result: dict) -> dict:
    return {
        "category_correct": result["category"] == test_case["expected_category"],
        "sentiment_correct": result["sentiment"] == test_case["expected_sentiment"]
    }

Nothing fancy. Just inputs, expected outputs, and a check.

Scoring: Beyond Pass/Fail

Binary pass/fail works for structured output — the JSON schema is valid or it isn't. But for free-form text, a more nuanced system is needed.

Evaluation Type	When to Use	Example
Exact match	Classification, labels, categories	Output must be exactly "billing"
Contains check	Required elements in output	Summary must include the company name
Regex validation	Format checks	Date must be in YYYY-MM-DD format
LLM-as-judge	Subjective quality	Rate whether the tone is professional (1-5)
Embedding similarity	Semantic similarity	Output should semantically match reference answer

The LLM-as-judge approach is particularly useful. A second model evaluates the output of the first. Sounds circular, but in practice it works surprisingly well — especially when the judging model is stronger than the one generating the output.

def llm_judge(output: str, criteria: str) -> int:
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[
            {"role": "system", "content": f"""Evaluate the following text on: {criteria}
Give a score from 1-5. Reply with only the number."""},
            {"role": "user", "content": output}
        ]
    )
    return int(response.choices[0].message.content.strip())

Regression Tests: Where the Real Value Lives

The biggest payoff from prompt tests isn't the initial evaluation. It's regression testing. Every time a prompt gets modified — adding a sentence, rewriting instructions, trying a different model — the entire suite runs again. That makes it immediately visible whether the change improves or degrades quality.

This fits nicely into a CI/CD pipeline:

# .github/workflows/prompt-tests.yml
name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run prompt tests
        run: python -m pytest tests/prompts/ -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check score threshold
        run: |
          python scripts/check_scores.py --min-accuracy 0.85

On every PR that touches prompts, tests run automatically. If accuracy drops below the threshold, the pipeline blocks. No more debates about whether a prompt "feels better" — the numbers speak for themselves.

Practical Tips

Start small. Five to ten solid test cases per prompt are enough to get going. Add cases when bugs show up in production — same as regular software.

Log everything. Every prompt call in production deserves logging: the input, the full output, the model, temperature, and latency. Those are tomorrow's test cases.

Version your prompts. Prompts belong in version control, not in a database or hardcoded in the application. Changing a prompt is a code change and deserves the same review process.

Test with real data. Synthetic test cases make a fine starting point, but the real edge cases come from production data. Anonymize where needed and build a golden dataset of verified input-output pairs.

Watch out for overfitting. A prompt that scores perfectly on the test suite but fails on new inputs is just as worthless as an overfit ML model. Maintain a separate holdout set that isn't used during prompt development.

The Investment Pays Off

Setting up prompt testing takes an afternoon. Maybe two if the evaluation functions are more complex. But that investment pays for itself many times over the first time a "small tweak" to a prompt breaks the entire output — and the test suite catches it before production.

It's ultimately the same discipline as unit testing. Not glamorous, not exciting, but the difference between a robust application and one that regularly falls apart.

Testing LLM Prompts: From Guesswork to Measurable Quality

Testing LLM Prompts: From Guesswork to Measurable Quality

Why Manual Testing Doesn't Scale

Setting Up a Prompt Test Suite

Scoring: Beyond Pass/Fail

Regression Tests: Where the Real Value Lives

Practical Tips

The Investment Pays Off

Related Articles

AI for Project Scaffolding: Generating Boilerplate Without Losing Your Mind

Preventing Prompt Injection: Securing LLM Applications

AI-Powered Code Reviews: Catching Bugs Faster in Pull Requests

Want to stay updated?