Intermediate · 2-4 hours

Debugging with AI Agents

Systematically debug complex issues using AI coding agents with structured workflows and MCP integrations.

Last reviewed Feb 27, 2026

Overview

This cookbook walks through a systematic, AI-powered debugging workflow for complex production issues. Rather than ad-hoc prompting, you'll use a structured 5-phase approach with observability tools, MCP servers, and multi-agent patterns to find and fix bugs faster. Tools covered: Cursor Debug Mode, Claude Code, Sentry Seer, MCP servers Target audience: Developers dealing with production bugs, hard-to-reproduce issues, and complex multi-service failures

The 5-Phase Agentic Debugging Workflow

Phase 1: Reproduce & Contextualize

Before involving AI, confirm the bug is reproducible locally (or in a staging environment). Then assemble a complete context package:

Full error message and stack trace
Relevant logs (last 50–100 lines around the error)
Screenshots or screen recordings (if UI-related)
Recent git diff (what changed since it last worked?)
Environment details (OS, runtime version, dependency versions)

Tip: GitHub Copilot CLI supports image input as of late 2025 — paste a screenshot of a visual bug directly into the conversation. Prompt template:

Here is the full context for a bug I'm debugging:

Error: [paste error]
Stack trace: [paste trace]
Recent git diff: [paste diff]
Logs: [paste logs]

Before writing any code, list 5–7 possible root causes.

Phase 2: Reason Before Acting

Force the AI to hypothesize before touching code. This prevents speculative rewrites. Prompt:

List 5–7 possible causes for this bug. Do not write or modify any code yet.
For each hypothesis, rate your confidence (Low/Medium/High) and explain why.

Then evaluate the hypotheses using your domain knowledge — you know which services were recently changed, which dependencies are flaky, and which code paths are most complex. Select the top 1–2 hypotheses to pursue first.

This step alone eliminates the most common AI debugging failure: the model immediately rewrites unrelated code because it "seems cleaner."

Phase 3: Instrument for Observability

Once you have your top hypotheses, ask the AI to add targeted logging — not a rewrite, just instrumentation. Prompt:

Hypothesis: [your chosen hypothesis]

Add structured logging around the suspected code path in [file:line range].
Log: input values, intermediate state, and the return value.
Do not change any logic. Only add logging statements.

Run the application, reproduce the bug, capture the output, and feed it back to the AI:

Here is the output from the instrumented code run:
[paste log output]

Does this confirm or rule out the hypothesis? What do the values tell us?

Claude Code supports live terminal output piping — you can stream the log output directly into the conversation without copy-pasting.

Phase 4: Root Cause & Fix

Use a two-stage fix loop: explain first, then fix. Stage A — Explain only:

Based on the logs, what is the root cause? Explain exactly what is broken
and why. Do not modify any files yet.

Review the explanation. If it matches your understanding, proceed: Stage B — Apply fix:

Now implement the fix you described. Make the minimal change necessary.
Do not refactor unrelated code.

After each change:

Run the existing test suite
Manually reproduce the original scenario
Write a new regression test that would have caught this bug

Phase 5: Cleanup & Prevention

Remove debug logging added in Phase 3
Write a regression test that fails on the original bug and passes on the fix
Commit with a descriptive message:

git commit -m "fix(auth): prevent token expiry race condition in refresh handler

Root cause: concurrent refresh calls could both pass the expiry check before
either updated the token, causing a double-refresh and session invalidation.

Fix: add mutex lock around the token refresh critical section.
Tests: added test_concurrent_token_refresh to auth_test.go"

**Update **CLAUDE.md with any new system knowledge the AI should carry forward:

## Known Issues & Gotchas
- Token refresh handler requires mutex; race condition fixed in commit abc123
- The user service caches responses for 60s; don't expect immediate consistency

Cursor Debug Mode (v2.2)

Released December 2025, Cursor's dedicated debug mode implements the hypothesis-driven workflow natively. The cycle:

Hypothesis generation — Cursor analyzes the error and proposes ranked causes
Targeted logging — adds structured logs to suspected code paths
Runtime analysis — reads live terminal output after you reproduce the bug
Fix generation — produces "precise 2–3 line fixes instead of hundreds of lines of speculative code" (Cursor changelog) Parallel worktrees: Cursor v2.2 can spin up multiple agents, each with a different hypothesis, working in parallel git worktrees. You compare approaches and merge the winner — particularly powerful for bugs with multiple plausible causes. How to activate:

Cmd/Ctrl + Shift + D  →  Debug Mode
Paste error + stack trace
Select hypothesis count (3–7)

Claude Code Debugging Workflow

CLAUDE.md** setup for debugging context:**

# Debugging Context

## Architecture
- Frontend: Next.js 14, deployed on Vercel
- Backend: FastAPI, deployed on Railway
- Database: PostgreSQL 15 via Supabase
- Queue: Redis + Celery for async tasks

## Common Failure Points
- Auth tokens expire after 1 hour; refresh logic is in auth/refresh.py
- The recommendation engine has a 10s timeout; failures are silent
- DB connection pool limit: 20 connections

## Debugging Commands
- Run tests: `pytest tests/ -v --tb=short`
- View logs: `railway logs --tail 100`
- Check queue: `celery -A app.worker inspect active`

Iterative log-feed workflow:

# In your terminal alongside Claude Code:
tail -f app.log | claude --stdin "Analyze these logs as they come in. \
Alert me to anomalies and hypothesize causes."

Agent hooks for extended debugging sessions:

Hook Type	Use Case	Max Duration
Command hooks	Run after specific shell commands	Instant
Prompt hooks	Triggered before Claude responds	Instant
Agent hooks	Long-running: test suites, build pipelines	Up to 10 minutes
Example: autonomous debug loop kickoff

I have a flaky test that fails ~20% of the time: test_payment_webhook.
Here is the test file and the implementation: [paste code]

Your task:
1. Run the test 10 times and collect all failure outputs
2. Identify the pattern in failures (timing? data? environment?)
3. Hypothesize root cause
4. Propose a fix — do not apply it yet, present it for my review

Use agent hooks to run the full test suite in the background.

Docs: Claude Code Agent Hooks

Multi-Agent Debugging Patterns

Pattern A: Orchestrator + Specialist Agents

An orchestrator agent coordinates four specialists:

Agent	Responsibility
Log Reader	Parses and clusters log output, surfaces anomalies
Code Search	Finds relevant code paths via semantic search
Fix Generator	Proposes targeted patches
Test Runner	Executes tests and reports pass/fail
Best for: large codebases where the relevant code is hard to locate manually.

Pattern B: Parallel Hypothesis Testing

Spin up N agents, each pursuing a different hypothesis in its own git branch. Set a time limit (5–10 min); whichever agent finds a passing test suite wins. Merge that branch. Best for: bugs with multiple equally plausible causes.

Pattern C: Checkpoint + Reset

From the CHI 2025 AGDebugger paper, this pattern saves agent state at each step. If the agent goes down a wrong path, reset to the last good checkpoint rather than starting over. Best for: complex multi-step debugging where agents tend to compound errors.

Pattern D: MCP-Mediated Agent Communication

Agents communicate via MCP protocol messages rather than shared memory. Each agent exposes a set of tools; the orchestrator calls them. This makes the workflow auditable and reproducible. Best for: production debugging systems where you need a full audit trail.

MCP Servers for Debugging

MCP Server	What It Provides	Setup
Sentry MCP	Issues, stack traces, releases, 94.5% root cause accuracy	`npx @sentry/mcp-server`
GitHub MCP	PRs, commits, diffs, blame, issues	`npx @github/mcp-server`
Postgres MCP	Live DB queries, explain plans, slow query log	`npx @benborla/mcp-server-postgresql`
New Relic MCP	APM traces, error rates, throughput metrics	Via New Relic AI dashboard
CloudWatch MCP	AWS log groups, metrics, alarms	`npx @aws-samples/cloudwatch-mcp`
Python Debug MCP	Live Python debugger (breakpoints, stack frames)	`pip install mcp-server-python-debugger`
MCP Inspector	Debug MCP connections themselves	`npx @modelcontextprotocol/inspector`

Sentry Seer Deep Dive

Sentry Seer is Sentry's AI debugging layer, trained on over 38,000 real issues. Key capabilities:

94.5% root cause accuracy on issues with sufficient context (Sentry benchmark, 2025)
Cross-service tracing — connects a React TypeError to the ASP.NET backend commit that caused it
Auto-generated fix PRs — Seer doesn't just identify the cause, it opens a pull request with the fix
Regression detection — flags when a fix reintroduces a previously resolved issue Example flow:

React frontend throws TypeError: Cannot read property 'userId' of undefined
Seer traces the full distributed trace: frontend → API gateway → user service
Identifies that a 3-day-old ASP.NET commit changed the user response schema, dropping userId
Opens a PR that adds backward-compatible userId back to the response Setup:

npx @sentry/mcp-server
# Configure with SENTRY_AUTH_TOKEN and SENTRY_ORG env vars
# Then in Claude Code: "Use the Sentry MCP to analyze issue SENTRY-12345"

Full-Stack Debugging Stack

Layer	Tool	What to Look For
Frontend	Browser Console + LogRocket	JS errors, network failures, UI state
Backend	Sentry Seer • New Relic MCP	Exceptions, slow endpoints, error rates
Database	Postgres MCP	Slow queries, lock contention, missing indexes
Infrastructure	CloudWatch / Grafana MCP	CPU spikes, memory pressure, network I/O
Code	GitHub MCP	Recent commits, blame, PR history
Layer-by-layer debugging prompts:

# Frontend layer
"Analyze these browser console errors and network requests from LogRocket.
What patterns suggest a root cause at the API layer?"

# Backend layer
"Use the Sentry MCP to fetch the last 10 occurrences of SENTRY-5678.
Cluster them by: user agent, endpoint, time of day, and release version."

# Database layer
"Use the Postgres MCP to run EXPLAIN ANALYZE on this slow query [SQL].
Suggest index changes. Do not modify schema yet — show the migration SQL first."

# Infrastructure layer
"Query CloudWatch for CPU and memory metrics on the api-service ECS task
from 2025-02-26 14:00 to 15:00 UTC. Correlate spikes with error rate in Sentry."

AI Debugging Limitations (Honest Assessment)

Scenario	AI Capability	Notes
Simple bugs (off-by-one, null checks)	Strong	AI excels here; often solves in one shot
Stack trace analysis	Strong	Excellent at reading and explaining traces
Log clustering / anomaly detection	Strong	Pattern matching in large log sets
Memory leaks	Good	Needs heap dumps and allocation traces
Distributed system failures (with tracing)	Good	Works well with OpenTelemetry traces
Race conditions / concurrency bugs	Limited	Hard to reason about without deterministic replay
Cross-file architectural bugs	Limited	Loses context across many files
Business logic bugs (wrong behavior, not error)	Limited	Requires deep domain understanding
Persistent debugging memory across sessions	Not available	Each session starts fresh (use CLAUDE.md as workaround)
Current benchmarks (as of early 2026):

Claude 4 Opus: 67.6% on SWE-bench Verified
GPT-5: 65% on SWE-bench Verified

SWE-bench measures end-to-end resolution of real GitHub issues — a reasonable proxy for debugging capability, though real-world bugs often provide more context than the benchmark.

Common Pitfalls

Pitfall 1: Letting AI make speculative rewrites The most common failure. Always use the "explain first, don't modify files" constraint in Phase 4. A targeted 3-line fix beats a 200-line refactor that introduces new bugs. Pitfall 2: Not providing enough context AI debugging quality scales directly with context quality. Logs, stack trace, recent git diff, and environment details are all required. A vague "it's broken" produces useless output. Pitfall 3: Chasing ghost bugs If you can't reproduce the bug consistently, stop and build reproducibility first. AI cannot reliably debug non-deterministic issues without a reproduction case. Pitfall 4: Not writing regression tests Fixes without regression tests get re-broken. Every AI-assisted fix should be paired with a test that would have caught the original bug. Make this non-negotiable.

Related tools

Related skills

MCP servers used

GitHub MCP Server