Self-Healing Infrastructure with AI: Automatic Incident Response
How AI-driven systems automatically detect, diagnose, and resolve incidents — without anyone getting paged at 3 AM.
Jean-Pierre Broeders
Freelance DevOps Engineer
Self-Healing Infrastructure with AI
Three in the morning. The pager goes off. A database connection pool is maxing out, the API throws 503s, and somewhere in Slack three people type "looking into it" at the same time. Sound familiar? This plays out weekly at teams still running fully manual incident response.
What if the system fixed itself before anyone even woke up?
What Self-Healing Actually Means
The term sounds buzzwordy, but the concept is straightforward. A self-healing system combines three components: detection, diagnosis, and action. AI adds a fourth layer — it learns from previous incidents which action was most effective.
A traditional setup looks something like this:
# Prometheus alert rule - classic approach
groups:
- name: database
rules:
- alert: HighConnectionCount
expr: pg_stat_activity_count > 80
for: 5m
annotations:
summary: "Connection pool nearly exhausted"
This fires an alert. Someone has to wake up, log in, assess the situation, and manually intervene. With an AI layer on top, that changes fundamentally.
The Architecture Behind Automatic Recovery
A working self-healing pipeline needs at minimum these components:
class IncidentHandler:
def __init__(self, llm_client, runbook_store):
self.llm = llm_client
self.runbooks = runbook_store
self.history = IncidentHistory()
async def handle(self, alert: Alert):
# Gather context
metrics = await self.gather_metrics(alert)
logs = await self.gather_logs(alert, window="15m")
similar = self.history.find_similar(alert, limit=5)
# Let AI diagnose
diagnosis = await self.llm.diagnose(
alert=alert,
metrics=metrics,
logs=logs,
past_incidents=similar
)
# Pick a runbook based on diagnosis
runbook = self.runbooks.match(diagnosis)
if runbook and runbook.confidence > 0.85:
result = await runbook.execute(dry_run=False)
self.history.record(alert, diagnosis, runbook, result)
The critical piece: the confidence threshold. Set it too low and the system starts executing actions recklessly. Too high and everything falls back to human intervention. In practice, 0.85 works as a solid starting point, though it varies by incident type.
Patterns That Work Well
Not every incident lends itself to automatic remediation. These categories have proven track records:
| Incident Type | Automatic Action | Success Rate |
|---|---|---|
| Connection pool exhaustion | Kill idle connections, resize pool | ~92% |
| Disk space full | Log rotation, temp file cleanup | ~95% |
| Memory leak (gradual) | Rolling pod restart | ~88% |
| Certificate expiry | Trigger auto-renewal | ~99% |
| Deployment rollback | Error rate spike → previous version | ~78% |
Deployment rollbacks score lower because the root cause tends to be more nuanced. Sometimes a high error rate is expected during a migration, and the system needs to distinguish between the two.
Runbooks as Code
The real power lies in structured runbooks that AI can interpret and execute:
# runbooks/db-connection-pool.yaml
name: database-connection-exhaustion
triggers:
- alert: HighConnectionCount
threshold: 80
steps:
- name: identify_idle
command: |
psql -c "SELECT pid, state, query_start
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes'"
- name: terminate_idle
command: |
psql -c "SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes'"
requires_confirmation: false
- name: verify
command: |
psql -c "SELECT count(*) FROM pg_stat_activity"
expected: "count < 60"
The difference from classic scripts? AI can adapt these runbooks based on context. If idle connections aren't the actual problem, the system can analyze logs and pick an alternative approach.
Where It Goes Wrong
One team had their self-healing system configured so aggressively it started restarting pods during a database migration. The migration was slow, the system interpreted the high latency as a problem, and began recycling pods. Result: a half-finished migration and three hours of manual recovery.
The lesson: exclusion windows and maintenance modes aren't optional. Implement them from day one.
class SafetyGuard:
def should_execute(self, action: Action) -> bool:
if self.maintenance_mode:
return False
if action.blast_radius > self.max_blast_radius:
return False
if self.recent_actions_count(minutes=10) > 3:
# Too many actions in short time = structural issue
return False
return True
Getting Started
Start small. Take the three most common alerts from the past month. Write runbooks for them. Run them in dry-run mode first — the system reports what it would do without actually intervening. After two weeks, evaluate: how often would the system have done the right thing?
That data is gold. It builds confidence to gradually increase the level of automation. And meanwhile, the team just sleeps through the night.
