Prompt Chaining: Breaking Complex Tasks Into LLM Pipelines
Why a single prompt rarely cuts it for complex tasks, and how prompt chaining and multi-step pipelines deliver reliable results.
Jean-Pierre Broeders
Freelance DevOps Engineer
Prompt Chaining: Breaking Complex Tasks Into LLM Pipelines
One prompt to rule them all. That's what most teams try when they first start working with LLMs. And it works — until it doesn't. As complexity grows, that single mega-prompt becomes an unmanageable beast that sometimes produces brilliant output and sometimes produces garbage.
The fix? Prompt chaining. The same approach that's been standard in software engineering for decades: divide and conquer.
What is prompt chaining exactly?
Prompt chaining breaks a complex task into multiple steps, where the output of step N becomes the input of step N+1. Each step has a focused prompt that does one thing well.
Say a technical RFC needs to be generated from a Slack thread. One prompt doing that in a single pass? That's going to break. But split it up:
- Extraction — Pull the key points from the Slack messages
- Structuring — Organize those points into RFC sections
- Generation — Write the actual RFC
- Review — Check for consistency and missing information
Each step is small, testable, and debuggable.
Real-world example: Code Review Pipeline
A pipeline that actually runs in production:
import openai
def analyze_diff(diff: str) -> dict:
"""Step 1: Analyze the code diff."""
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Analyze this git diff. Return: "
"changed_files, complexity_score (1-10), "
"risk_areas. JSON output only."
}, {
"role": "user",
"content": diff
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
def generate_review(analysis: dict, diff: str) -> str:
"""Step 2: Generate review based on analysis."""
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": f"Write a code review. Focus on these "
f"risk areas: {analysis['risk_areas']}. "
f"Complexity: {analysis['complexity_score']}/10. "
f"Be specific, reference line numbers."
}, {
"role": "user",
"content": diff
}]
)
return response.choices[0].message.content
def prioritize_findings(review: str) -> str:
"""Step 3: Prioritize and categorize findings."""
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Categorize these review findings into: "
"MUST_FIX, SHOULD_FIX, NICE_TO_HAVE. "
"Add a brief rationale for each finding."
}, {
"role": "user",
"content": review
}]
)
return response.choices[0].message.content
Three steps, each with a sharp focus. The analysis step doesn't need to write prose, the writing step doesn't need to analyze. Result: better output at every step.
When to chain and when not to
Not every task needs chaining. A simple translation or summary? Just one prompt. But as soon as multiple cognitive steps are involved, it pays off.
| Scenario | Approach | Why |
|---|---|---|
| Summarize an email | Single prompt | Simple task, low error margin |
| Bug report → fix suggestion | 2-step chain | Understand first, then solve |
| Document a codebase | Multi-step pipeline | Analysis, structuring, writing, validation |
| Data extraction + report | Multi-step pipeline | Keep extraction, transformation, presentation separate |
Gate checks between steps
The real advantage of chaining is the ability to validate between steps. No blind trust in the output — check it.
def run_pipeline(diff: str) -> str:
# Step 1
analysis = analyze_diff(diff)
# Gate check: is the analysis usable?
if analysis.get("complexity_score", 0) < 1:
raise ValueError("Analysis produced no usable score")
if not analysis.get("risk_areas"):
# No risks found? Skip the expensive review
return "No significant risks detected."
# Step 2
review = generate_review(analysis, diff)
# Gate check: does the review contain concrete points?
if len(review) < 100:
# Too short, probably hallucinated "looks good"
review = generate_review(analysis, diff) # retry
# Step 3
return prioritize_findings(review)
Those gate checks prevent bad output from propagating to the next step. With a single mega-prompt, that's impossible.
Parallel chains
Not everything needs to be sequential. Sometimes steps can run in parallel and results get merged afterwards:
import asyncio
async def parallel_analysis(code: str):
security, performance, readability = await asyncio.gather(
analyze_security(code),
analyze_performance(code),
analyze_readability(code)
)
# Merge step
return await merge_reviews(security, performance, readability)
Three analyses running simultaneously, then one merge step combining everything. Faster than sequential, and each analysis gets its own optimized prompt.
Cost and latency
Prompt chaining costs more API calls. That's a fact. But total token costs are often comparable or even lower than a single mega-prompt, because each step needs less context.
Latency-wise: sequential chains are slower. Parallel chains partially compensate. In practice it's a tradeoff — for realtime chat a single prompt is often better, for background processes that extra second doesn't matter.
Error handling
Every step can fail. Plan for it.
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def robust_step(prompt: str, input_data: str) -> str:
"""Each pipeline step with retry logic."""
try:
result = call_llm(prompt, input_data)
validate_output(result) # raise if output doesn't check out
return result
except ValidationError:
logger.warning(f"Validation failed, retrying...")
raise
Retries per step, not for the entire pipeline. If step 3 fails, steps 1 and 2 don't need to rerun.
Logging and observability
In production, logging every step is essential. Not just the final output, but the intermediate results. When something goes wrong — and it will — knowing exactly where it broke saves hours.
Store the input and output of each step, including the model, temperature, and tokens used. Sounds like overhead, but it saves hours of debugging down the line.
The takeaway
Prompt chaining isn't rocket science. It's just solid software architecture applied to LLM integrations. Small, focused steps. Validation in between. Retry on failure. Parallelization where possible.
Next time a prompt gets too complex and the results become inconsistent: break it apart. It takes a bit more setup, but reliability improves dramatically.
