Debugging Azure Functions with Application Insights: From Blackbox to Crystal Clear
Serverless feels like a blackbox until Application Insights is properly configured. Practical KQL queries, alerting, and diagnostics for Azure Functions.
Jean-Pierre Broeders
Freelance DevOps Engineer
Debugging Azure Functions with Application Insights
Serverless is great. Until something breaks and nobody can figure out where the problem is. No server logs to dig through, no IIS to quickly inspect. Azure Functions run somewhere in the cloud, and when they fail, finding the cause is like searching for a needle in a haystack — unless Application Insights is set up properly.
The basics: more than just flipping a switch
Most tutorials stop at "enable Application Insights in the portal." That's like buying a smoke detector and leaving it in the drawer. Default telemetry catches some things, but the real value comes from custom configuration.
The logging section in host.json controls how much data actually flows to Application Insights:
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"maxTelemetryItemsPerSecond": 20,
"excludedTypes": "Request"
}
},
"logLevel": {
"default": "Information",
"Host.Results": "Error",
"Function": "Information",
"Host.Aggregator": "Trace"
}
}
}
That excludedTypes: "Request" setting matters. Without it, requests get sampled and some invocations disappear from telemetry entirely. When debugging sporadic failures, those are exactly the data points that are needed.
KQL queries that actually help
The Log Analytics workspace behind Application Insights uses Kusto Query Language. Not SQL, but the learning curve is gentle. A few queries that come up regularly:
All failed function executions in the last 24 hours:
requests
| where timestamp > ago(24h)
| where success == false
| summarize count() by cloud_RoleName, resultCode
| order by count_ desc
Spotting slow executions (above 5 seconds):
requests
| where timestamp > ago(7d)
| where duration > 5000
| project timestamp, name, duration, resultCode
| order by duration desc
| take 50
Dependency failures — when a downstream service gives up:
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize failCount=count() by target, type, resultCode
| order by failCount desc
That last one is gold. An Azure Function calling an API or database might run fine on its own, but when the dependency goes down, it's the dependency logs that tell the story. Not the function itself.
Adding custom telemetry
Default metrics cover about 80% of cases. For the remaining 20%, TelemetryClient fills the gap:
public class OrderProcessor
{
private readonly TelemetryClient _telemetry;
public OrderProcessor(TelemetryClient telemetry)
{
_telemetry = telemetry;
}
[Function("ProcessOrder")]
public async Task Run(
[QueueTrigger("orders")] OrderMessage order)
{
var stopwatch = Stopwatch.StartNew();
try
{
await ProcessAsync(order);
_telemetry.TrackMetric("OrderProcessingMs",
stopwatch.ElapsedMilliseconds);
_telemetry.TrackEvent("OrderProcessed", new Dictionary<string, string>
{
["OrderId"] = order.Id,
["ProductCount"] = order.Items.Count.ToString()
});
}
catch (Exception ex)
{
_telemetry.TrackException(ex, new Dictionary<string, string>
{
["OrderId"] = order.Id,
["Stage"] = "Processing"
});
throw;
}
}
}
Those custom properties on TrackException make the difference between "something went wrong" and "order 12345 failed during the processing stage." In a production environment handling hundreds of invocations per minute, that saves hours of guesswork.
Alerting: not everything is equally urgent
A common mistake is setting up an alert for every exception. Within a week, the inbox is flooded and all alerts get ignored. More effective: layered alerts.
| Level | Condition | Action |
|---|---|---|
| Critical | Function failure rate > 25% in 5 min | SMS + PagerDuty |
| Warning | P95 latency > 10s in 15 min | Teams/Slack notification |
| Info | Daily failure summary | Email digest |
In the Azure Portal this goes through Alerts → New Alert Rule, but the ARM template approach is reusable and version-controllable:
{
"type": "Microsoft.Insights/metricAlerts",
"apiVersion": "2018-03-01",
"properties": {
"severity": 1,
"evaluationFrequency": "PT5M",
"windowSize": "PT5M",
"criteria": {
"odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria",
"allOf": [
{
"name": "HighFailureRate",
"metricName": "Http5xx",
"operator": "GreaterThan",
"threshold": 10,
"timeAggregation": "Total"
}
]
}
}
}
Live Metrics Stream for real-time debugging
Some problems only show up during peak hours. Live Metrics Stream displays what's happening in real time: incoming requests, failures, dependency calls, everything with sub-second delay.
It doesn't replace KQL queries for post-mortem analysis, but for monitoring a deployment in real time or tracking down an active issue, it's indispensable. One tip: don't leave it open all day. Live Metrics adds extra resource consumption to the function app.
Distributed tracing across multiple functions
A queue-triggered function that makes an HTTP call to another function — without tracing, that becomes chaos. Application Insights automatically assigns an operation_id to related telemetry, but only when the Activity context propagates correctly.
requests
| where operation_Id == "abc123"
| union dependencies
| where operation_Id == "abc123"
| order by timestamp asc
| project timestamp, itemType, name, duration, success
This query shows the entire chain: from the first trigger to the last dependency call. Useful for pinpointing exactly where latency hides.
Keeping costs under control
Application Insights charges per GB of ingested data. With high-throughput functions, costs add up fast. Sampling reduces costs but also reduces visibility. Finding a balance is necessary.
Rule of thumb: sample everything except failures. Error scenarios should always be captured completely. Successful requests can be sampled — the patterns remain visible in aggregated data anyway.
The difference between a serverless setup that works and one that inspires confidence? Monitoring. Not as an afterthought when things go wrong, but from day one as part of the architecture.
