Skip to main content

Root Cause Analysis

TL;DR
  • Hypothesis-driven investigation, not random log searching
  • Full evidence chain showing why this is the cause
  • Memory recalls similar past incidents and their fixes

The problem: Log searching is not investigation

Most debugging starts with "show me the errors." You query logs, scroll through results, copy a timestamp, switch tools, run another query. You're not investigating—you're correlating data manually, holding the reasoning in your head.

The real problem isn't finding logs. It's knowing what questions to ask, what tools to check, and how to connect the dots across logs, metrics, deployments, and past incidents. That mental model lives in the heads of your senior engineers—and they can't be on every call. New team members spend hours on issues that veterans solve in minutes, because the reasoning isn't documented anywhere.

How Azure SRE Agent solves this

Root cause analysis flowSymptomserrors, alertsGather contextlogs, metrics, deploysForm hypothesestest & validateRoot causewith evidenceFixresolveApp InsightsAzure MonitorDeploymentsSource codeMemoryAgent queries multiple data sources and correlates evidence to identify root cause

Your agent investigates like an expert SRE. It doesn't just search logs—it forms hypotheses about what went wrong and systematically validates each one using evidence.

  1. Gathers context — queries Application Insights, Azure Monitor, deployment history, activity logs, and resource properties
  2. Forms hypotheses — generates theories based on the evidence pattern
  3. Validates each one — tests hypotheses systematically, ruling out false leads
  4. Explains the conclusion — shows the full reasoning trail with supporting evidence and citations

What makes this different

Unlike log searching, your agent reasons about the problem. "Show me errors" gives you data to interpret. Your agent interprets the data for you—forming theories, testing them, and explaining conclusions.

Unlike static dashboards, your agent adapts to the specific incident. It doesn't just show you metrics—it decides which metrics matter, correlates them with other evidence, and tells you why.

Unlike scripts, your agent handles novel situations. A script runs the same steps every time. Your agent reasons about what's different this time and adjusts its investigation accordingly.


Before and after

BeforeAfter
Investigation approachSearch logs, hope you find somethingAgent forms and tests hypotheses
Tools opened4+ portals, manual correlation0 (agent queries all sources)
Reasoning"I think it's the database...""Database DTU at 98%, validated"
Evidence trailIn your headFull chain with explanation
Next timeStart from scratchMemory recalls similar incidents

Example: Database timeout investigation

Symptom: "500 errors on /api/orders endpoint"

HYPOTHESIS 1: Recent deployment broke something
├─ Checked: Last deployment was 3 days ago
├─ Evidence: Error rate stable until 30 minutes ago
└─ Result: INVALIDATED

HYPOTHESIS 2: Database overloaded
├─ Checked: Azure SQL metrics (CPU, DTU, connections)
├─ Evidence: DTU at 98%, query duration 4x normal
├─ Traced: SELECT * FROM orders WHERE... taking 8.2s
└─ Result: VALIDATED

ROOT CAUSE: Orders table missing index on customer_id column.
Query plan shows full table scan on 2.1M rows.

RECOMMENDED ACTION: Add index on orders.customer_id
Similar fix applied in INC-2341 (3 weeks ago)

Get started

Root cause analysis works out of the box with Azure's built-in tools. To enable deeper analysis:

EnhancementWhat it enablesSetup
Source controlError-to-code correlation, semantic code searchConnect source code →
Knowledge baseContext for hypothesis generationUpload knowledge →
Custom telemetryBusiness metrics in KustoSet up Kusto connector →

CapabilityWhat it adds
Incident Response →Full incident handling workflow
Azure Observability →Built-in Azure diagnostic tools
External Observability →Datadog, Splunk, custom systems
Was this page helpful?