Prompt Chaining: Breaking Complex Tasks Into LLM Pipelines

Why a single prompt rarely cuts it for complex tasks, and how prompt chaining and multi-step pipelines deliver reliable results.

Jean-Pierre Broeders

Freelance DevOps Engineer

March 9, 20268 min. read

Prompt Chaining: Breaking Complex Tasks Into LLM Pipelines

One prompt to rule them all. That's what most teams try when they first start working with LLMs. And it works — until it doesn't. As complexity grows, that single mega-prompt becomes an unmanageable beast that sometimes produces brilliant output and sometimes produces garbage.

The fix? Prompt chaining. The same approach that's been standard in software engineering for decades: divide and conquer.

What is prompt chaining exactly?

Prompt chaining breaks a complex task into multiple steps, where the output of step N becomes the input of step N+1. Each step has a focused prompt that does one thing well.

Say a technical RFC needs to be generated from a Slack thread. One prompt doing that in a single pass? That's going to break. But split it up:

  1. Extraction — Pull the key points from the Slack messages
  2. Structuring — Organize those points into RFC sections
  3. Generation — Write the actual RFC
  4. Review — Check for consistency and missing information

Each step is small, testable, and debuggable.

Real-world example: Code Review Pipeline

A pipeline that actually runs in production:

import openai

def analyze_diff(diff: str) -> dict:
    """Step 1: Analyze the code diff."""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Analyze this git diff. Return: "
                       "changed_files, complexity_score (1-10), "
                       "risk_areas. JSON output only."
        }, {
            "role": "user",
            "content": diff
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

def generate_review(analysis: dict, diff: str) -> str:
    """Step 2: Generate review based on analysis."""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"Write a code review. Focus on these "
                       f"risk areas: {analysis['risk_areas']}. "
                       f"Complexity: {analysis['complexity_score']}/10. "
                       f"Be specific, reference line numbers."
        }, {
            "role": "user",
            "content": diff
        }]
    )
    return response.choices[0].message.content

def prioritize_findings(review: str) -> str:
    """Step 3: Prioritize and categorize findings."""
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Categorize these review findings into: "
                       "MUST_FIX, SHOULD_FIX, NICE_TO_HAVE. "
                       "Add a brief rationale for each finding."
        }, {
            "role": "user",
            "content": review
        }]
    )
    return response.choices[0].message.content

Three steps, each with a sharp focus. The analysis step doesn't need to write prose, the writing step doesn't need to analyze. Result: better output at every step.

When to chain and when not to

Not every task needs chaining. A simple translation or summary? Just one prompt. But as soon as multiple cognitive steps are involved, it pays off.

ScenarioApproachWhy
Summarize an emailSingle promptSimple task, low error margin
Bug report → fix suggestion2-step chainUnderstand first, then solve
Document a codebaseMulti-step pipelineAnalysis, structuring, writing, validation
Data extraction + reportMulti-step pipelineKeep extraction, transformation, presentation separate

Gate checks between steps

The real advantage of chaining is the ability to validate between steps. No blind trust in the output — check it.

def run_pipeline(diff: str) -> str:
    # Step 1
    analysis = analyze_diff(diff)
    
    # Gate check: is the analysis usable?
    if analysis.get("complexity_score", 0) < 1:
        raise ValueError("Analysis produced no usable score")
    
    if not analysis.get("risk_areas"):
        # No risks found? Skip the expensive review
        return "No significant risks detected."
    
    # Step 2
    review = generate_review(analysis, diff)
    
    # Gate check: does the review contain concrete points?
    if len(review) < 100:
        # Too short, probably hallucinated "looks good"
        review = generate_review(analysis, diff)  # retry
    
    # Step 3
    return prioritize_findings(review)

Those gate checks prevent bad output from propagating to the next step. With a single mega-prompt, that's impossible.

Parallel chains

Not everything needs to be sequential. Sometimes steps can run in parallel and results get merged afterwards:

import asyncio

async def parallel_analysis(code: str):
    security, performance, readability = await asyncio.gather(
        analyze_security(code),
        analyze_performance(code),
        analyze_readability(code)
    )
    
    # Merge step
    return await merge_reviews(security, performance, readability)

Three analyses running simultaneously, then one merge step combining everything. Faster than sequential, and each analysis gets its own optimized prompt.

Cost and latency

Prompt chaining costs more API calls. That's a fact. But total token costs are often comparable or even lower than a single mega-prompt, because each step needs less context.

Latency-wise: sequential chains are slower. Parallel chains partially compensate. In practice it's a tradeoff — for realtime chat a single prompt is often better, for background processes that extra second doesn't matter.

Error handling

Every step can fail. Plan for it.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def robust_step(prompt: str, input_data: str) -> str:
    """Each pipeline step with retry logic."""
    try:
        result = call_llm(prompt, input_data)
        validate_output(result)  # raise if output doesn't check out
        return result
    except ValidationError:
        logger.warning(f"Validation failed, retrying...")
        raise

Retries per step, not for the entire pipeline. If step 3 fails, steps 1 and 2 don't need to rerun.

Logging and observability

In production, logging every step is essential. Not just the final output, but the intermediate results. When something goes wrong — and it will — knowing exactly where it broke saves hours.

Store the input and output of each step, including the model, temperature, and tokens used. Sounds like overhead, but it saves hours of debugging down the line.

The takeaway

Prompt chaining isn't rocket science. It's just solid software architecture applied to LLM integrations. Small, focused steps. Validation in between. Retry on failure. Parallelization where possible.

Next time a prompt gets too complex and the results become inconsistent: break it apart. It takes a bit more setup, but reliability improves dramatically.

Want to stay updated?

Subscribe to my newsletter or get in touch for freelance projects.

Get in Touch