← Back to cookbooks

Intermediate · 2-4 hours

Debugging with AI Agents

Systematically debug complex issues using AI coding agents with structured workflows and MCP integrations.

Last reviewed Feb 27, 2026

Overview

This cookbook walks through a systematic, AI-powered debugging workflow for complex production issues. Rather than ad-hoc prompting, you'll use a structured 5-phase approach with observability tools, MCP servers, and multi-agent patterns to find and fix bugs faster. Tools covered: Cursor Debug Mode, Claude Code, Sentry Seer, MCP servers Target audience: Developers dealing with production bugs, hard-to-reproduce issues, and complex multi-service failures


The 5-Phase Agentic Debugging Workflow

Phase 1: Reproduce & Contextualize

Before involving AI, confirm the bug is reproducible locally (or in a staging environment). Then assemble a complete context package:

  • Full error message and stack trace
  • Relevant logs (last 50–100 lines around the error)
  • Screenshots or screen recordings (if UI-related)
  • Recent git diff (what changed since it last worked?)
  • Environment details (OS, runtime version, dependency versions)

Tip: GitHub Copilot CLI supports image input as of late 2025 — paste a screenshot of a visual bug directly into the conversation. Prompt template:

Here is the full context for a bug I'm debugging:

Error: [paste error]
Stack trace: [paste trace]
Recent git diff: [paste diff]
Logs: [paste logs]

Before writing any code, list 5–7 possible root causes.

Phase 2: Reason Before Acting

Force the AI to hypothesize before touching code. This prevents speculative rewrites. Prompt:

List 57 possible causes for this bug. Do not write or modify any code yet.
For each hypothesis, rate your confidence (Low/Medium/High) and explain why.

Then evaluate the hypotheses using your domain knowledge — you know which services were recently changed, which dependencies are flaky, and which code paths are most complex. Select the top 1–2 hypotheses to pursue first.

This step alone eliminates the most common AI debugging failure: the model immediately rewrites unrelated code because it "seems cleaner."

Phase 3: Instrument for Observability

Once you have your top hypotheses, ask the AI to add targeted logging — not a rewrite, just instrumentation. Prompt:

Hypothesis: [your chosen hypothesis]

Add structured logging around the suspected code path in [file:line range].
Log: input values, intermediate state, and the return value.
Do not change any logic. Only add logging statements.

Run the application, reproduce the bug, capture the output, and feed it back to the AI:

Here is the output from the instrumented code run:
[paste log output]

Does this confirm or rule out the hypothesis? What do the values tell us?

Claude Code supports live terminal output piping — you can stream the log output directly into the conversation without copy-pasting.

Phase 4: Root Cause & Fix

Use a two-stage fix loop: explain first, then fix. Stage A — Explain only:

Based on the logs, what is the root cause? Explain exactly what is broken
and why. Do not modify any files yet.

Review the explanation. If it matches your understanding, proceed: Stage B — Apply fix:

Now implement the fix you described. Make the minimal change necessary.
Do not refactor unrelated code.

After each change:

  1. Run the existing test suite
  2. Manually reproduce the original scenario
  3. Write a new regression test that would have caught this bug

Phase 5: Cleanup & Prevention

  1. Remove debug logging added in Phase 3
  2. Write a regression test that fails on the original bug and passes on the fix
  3. Commit with a descriptive message:
git commit -m "fix(auth): prevent token expiry race condition in refresh handler

Root cause: concurrent refresh calls could both pass the expiry check before
either updated the token, causing a double-refresh and session invalidation.

Fix: add mutex lock around the token refresh critical section.
Tests: added test_concurrent_token_refresh to auth_test.go"
  1. **Update **CLAUDE.md with any new system knowledge the AI should carry forward:
## Known Issues & Gotchas
- Token refresh handler requires mutex; race condition fixed in commit abc123
- The user service caches responses for 60s; don't expect immediate consistency

Cursor Debug Mode (v2.2)

Released December 2025, Cursor's dedicated debug mode implements the hypothesis-driven workflow natively. The cycle:

  1. Hypothesis generation — Cursor analyzes the error and proposes ranked causes
  2. Targeted logging — adds structured logs to suspected code paths
  3. Runtime analysis — reads live terminal output after you reproduce the bug
  4. Fix generation — produces "precise 2–3 line fixes instead of hundreds of lines of speculative code" (Cursor changelog) Parallel worktrees: Cursor v2.2 can spin up multiple agents, each with a different hypothesis, working in parallel git worktrees. You compare approaches and merge the winner — particularly powerful for bugs with multiple plausible causes. How to activate:
Cmd/Ctrl + Shift + D  →  Debug Mode
Paste error + stack trace
Select hypothesis count (37)

Claude Code Debugging Workflow

CLAUDE.md** setup for debugging context:**

# Debugging Context

## Architecture
- Frontend: Next.js 14, deployed on Vercel
- Backend: FastAPI, deployed on Railway
- Database: PostgreSQL 15 via Supabase
- Queue: Redis + Celery for async tasks

## Common Failure Points
- Auth tokens expire after 1 hour; refresh logic is in auth/refresh.py
- The recommendation engine has a 10s timeout; failures are silent
- DB connection pool limit: 20 connections

## Debugging Commands
- Run tests: `pytest tests/ -v --tb=short`
- View logs: `railway logs --tail 100`
- Check queue: `celery -A app.worker inspect active`

Iterative log-feed workflow:

# In your terminal alongside Claude Code:
tail -f app.log | claude --stdin "Analyze these logs as they come in. \
Alert me to anomalies and hypothesize causes."

Agent hooks for extended debugging sessions:

Hook Type Use Case Max Duration
Command hooks Run after specific shell commands Instant
Prompt hooks Triggered before Claude responds Instant
Agent hooks Long-running: test suites, build pipelines Up to 10 minutes
Example: autonomous debug loop kickoff
I have a flaky test that fails ~20% of the time: test_payment_webhook.
Here is the test file and the implementation: [paste code]

Your task:
1. Run the test 10 times and collect all failure outputs
2. Identify the pattern in failures (timing? data? environment?)
3. Hypothesize root cause
4. Propose a fix — do not apply it yet, present it for my review

Use agent hooks to run the full test suite in the background.

Docs: Claude Code Agent Hooks


Multi-Agent Debugging Patterns

Pattern A: Orchestrator + Specialist Agents

An orchestrator agent coordinates four specialists:

Agent Responsibility
Log Reader Parses and clusters log output, surfaces anomalies
Code Search Finds relevant code paths via semantic search
Fix Generator Proposes targeted patches
Test Runner Executes tests and reports pass/fail
Best for: large codebases where the relevant code is hard to locate manually.

Pattern B: Parallel Hypothesis Testing

Spin up N agents, each pursuing a different hypothesis in its own git branch. Set a time limit (5–10 min); whichever agent finds a passing test suite wins. Merge that branch. Best for: bugs with multiple equally plausible causes.

Pattern C: Checkpoint + Reset

From the CHI 2025 AGDebugger paper, this pattern saves agent state at each step. If the agent goes down a wrong path, reset to the last good checkpoint rather than starting over. Best for: complex multi-step debugging where agents tend to compound errors.

Pattern D: MCP-Mediated Agent Communication

Agents communicate via MCP protocol messages rather than shared memory. Each agent exposes a set of tools; the orchestrator calls them. This makes the workflow auditable and reproducible. Best for: production debugging systems where you need a full audit trail.


MCP Servers for Debugging

MCP Server What It Provides Setup
Sentry MCP Issues, stack traces, releases, 94.5% root cause accuracy npx @sentry/mcp-server
GitHub MCP PRs, commits, diffs, blame, issues npx @github/mcp-server
Postgres MCP Live DB queries, explain plans, slow query log npx @benborla/mcp-server-postgresql
New Relic MCP APM traces, error rates, throughput metrics Via New Relic AI dashboard
CloudWatch MCP AWS log groups, metrics, alarms npx @aws-samples/cloudwatch-mcp
Python Debug MCP Live Python debugger (breakpoints, stack frames) pip install mcp-server-python-debugger
MCP Inspector Debug MCP connections themselves npx @modelcontextprotocol/inspector

Sentry Seer Deep Dive

Sentry Seer is Sentry's AI debugging layer, trained on over 38,000 real issues. Key capabilities:

  • 94.5% root cause accuracy on issues with sufficient context (Sentry benchmark, 2025)
  • Cross-service tracing — connects a React TypeError to the ASP.NET backend commit that caused it
  • Auto-generated fix PRs — Seer doesn't just identify the cause, it opens a pull request with the fix
  • Regression detection — flags when a fix reintroduces a previously resolved issue Example flow:
  1. React frontend throws TypeError: Cannot read property 'userId' of undefined
  2. Seer traces the full distributed trace: frontend → API gateway → user service
  3. Identifies that a 3-day-old ASP.NET commit changed the user response schema, dropping userId
  4. Opens a PR that adds backward-compatible userId back to the response Setup:
npx @sentry/mcp-server
# Configure with SENTRY_AUTH_TOKEN and SENTRY_ORG env vars
# Then in Claude Code: "Use the Sentry MCP to analyze issue SENTRY-12345"

Full-Stack Debugging Stack

Layer Tool What to Look For
Frontend Browser Console + LogRocket JS errors, network failures, UI state
Backend Sentry Seer • New Relic MCP Exceptions, slow endpoints, error rates
Database Postgres MCP Slow queries, lock contention, missing indexes
Infrastructure CloudWatch / Grafana MCP CPU spikes, memory pressure, network I/O
Code GitHub MCP Recent commits, blame, PR history
Layer-by-layer debugging prompts:
# Frontend layer
"Analyze these browser console errors and network requests from LogRocket.
What patterns suggest a root cause at the API layer?"

# Backend layer
"Use the Sentry MCP to fetch the last 10 occurrences of SENTRY-5678.
Cluster them by: user agent, endpoint, time of day, and release version."

# Database layer
"Use the Postgres MCP to run EXPLAIN ANALYZE on this slow query [SQL].
Suggest index changes. Do not modify schema yet — show the migration SQL first."

# Infrastructure layer
"Query CloudWatch for CPU and memory metrics on the api-service ECS task
from 2025-02-26 14:00 to 15:00 UTC. Correlate spikes with error rate in Sentry."

AI Debugging Limitations (Honest Assessment)

Scenario AI Capability Notes
Simple bugs (off-by-one, null checks) Strong AI excels here; often solves in one shot
Stack trace analysis Strong Excellent at reading and explaining traces
Log clustering / anomaly detection Strong Pattern matching in large log sets
Memory leaks Good Needs heap dumps and allocation traces
Distributed system failures (with tracing) Good Works well with OpenTelemetry traces
Race conditions / concurrency bugs Limited Hard to reason about without deterministic replay
Cross-file architectural bugs Limited Loses context across many files
Business logic bugs (wrong behavior, not error) Limited Requires deep domain understanding
Persistent debugging memory across sessions Not available Each session starts fresh (use CLAUDE.md as workaround)
Current benchmarks (as of early 2026):

SWE-bench measures end-to-end resolution of real GitHub issues — a reasonable proxy for debugging capability, though real-world bugs often provide more context than the benchmark.


Common Pitfalls

Pitfall 1: Letting AI make speculative rewrites The most common failure. Always use the "explain first, don't modify files" constraint in Phase 4. A targeted 3-line fix beats a 200-line refactor that introduces new bugs. Pitfall 2: Not providing enough context AI debugging quality scales directly with context quality. Logs, stack trace, recent git diff, and environment details are all required. A vague "it's broken" produces useless output. Pitfall 3: Chasing ghost bugs If you can't reproduce the bug consistently, stop and build reproducibility first. AI cannot reliably debug non-deterministic issues without a reproduction case. Pitfall 4: Not writing regression tests Fixes without regression tests get re-broken. Every AI-assisted fix should be paired with a test that would have caught the original bug. Make this non-negotiable.

MCP servers used