Intermediate · 2-4 hours
Debugging with AI Agents
Systematically debug complex issues using AI coding agents with structured workflows and MCP integrations.
Last reviewed Feb 27, 2026
Overview
This cookbook walks through a systematic, AI-powered debugging workflow for complex production issues. Rather than ad-hoc prompting, you'll use a structured 5-phase approach with observability tools, MCP servers, and multi-agent patterns to find and fix bugs faster. Tools covered: Cursor Debug Mode, Claude Code, Sentry Seer, MCP servers Target audience: Developers dealing with production bugs, hard-to-reproduce issues, and complex multi-service failures
The 5-Phase Agentic Debugging Workflow
Phase 1: Reproduce & Contextualize
Before involving AI, confirm the bug is reproducible locally (or in a staging environment). Then assemble a complete context package:
- Full error message and stack trace
- Relevant logs (last 50–100 lines around the error)
- Screenshots or screen recordings (if UI-related)
- Recent
git diff(what changed since it last worked?) - Environment details (OS, runtime version, dependency versions)
Tip: GitHub Copilot CLI supports image input as of late 2025 — paste a screenshot of a visual bug directly into the conversation. Prompt template:
Here is the full context for a bug I'm debugging:
Error: [paste error]
Stack trace: [paste trace]
Recent git diff: [paste diff]
Logs: [paste logs]
Before writing any code, list 5–7 possible root causes.
Phase 2: Reason Before Acting
Force the AI to hypothesize before touching code. This prevents speculative rewrites. Prompt:
List 5–7 possible causes for this bug. Do not write or modify any code yet.
For each hypothesis, rate your confidence (Low/Medium/High) and explain why.
Then evaluate the hypotheses using your domain knowledge — you know which services were recently changed, which dependencies are flaky, and which code paths are most complex. Select the top 1–2 hypotheses to pursue first.
This step alone eliminates the most common AI debugging failure: the model immediately rewrites unrelated code because it "seems cleaner."
Phase 3: Instrument for Observability
Once you have your top hypotheses, ask the AI to add targeted logging — not a rewrite, just instrumentation. Prompt:
Hypothesis: [your chosen hypothesis]
Add structured logging around the suspected code path in [file:line range].
Log: input values, intermediate state, and the return value.
Do not change any logic. Only add logging statements.
Run the application, reproduce the bug, capture the output, and feed it back to the AI:
Here is the output from the instrumented code run:
[paste log output]
Does this confirm or rule out the hypothesis? What do the values tell us?
Claude Code supports live terminal output piping — you can stream the log output directly into the conversation without copy-pasting.
Phase 4: Root Cause & Fix
Use a two-stage fix loop: explain first, then fix. Stage A — Explain only:
Based on the logs, what is the root cause? Explain exactly what is broken
and why. Do not modify any files yet.
Review the explanation. If it matches your understanding, proceed: Stage B — Apply fix:
Now implement the fix you described. Make the minimal change necessary.
Do not refactor unrelated code.
After each change:
- Run the existing test suite
- Manually reproduce the original scenario
- Write a new regression test that would have caught this bug
Phase 5: Cleanup & Prevention
- Remove debug logging added in Phase 3
- Write a regression test that fails on the original bug and passes on the fix
- Commit with a descriptive message:
git commit -m "fix(auth): prevent token expiry race condition in refresh handler
Root cause: concurrent refresh calls could both pass the expiry check before
either updated the token, causing a double-refresh and session invalidation.
Fix: add mutex lock around the token refresh critical section.
Tests: added test_concurrent_token_refresh to auth_test.go"
- **Update **CLAUDE.md with any new system knowledge the AI should carry forward:
## Known Issues & Gotchas
- Token refresh handler requires mutex; race condition fixed in commit abc123
- The user service caches responses for 60s; don't expect immediate consistency
Cursor Debug Mode (v2.2)
Released December 2025, Cursor's dedicated debug mode implements the hypothesis-driven workflow natively. The cycle:
- Hypothesis generation — Cursor analyzes the error and proposes ranked causes
- Targeted logging — adds structured logs to suspected code paths
- Runtime analysis — reads live terminal output after you reproduce the bug
- Fix generation — produces "precise 2–3 line fixes instead of hundreds of lines of speculative code" (Cursor changelog) Parallel worktrees: Cursor v2.2 can spin up multiple agents, each with a different hypothesis, working in parallel git worktrees. You compare approaches and merge the winner — particularly powerful for bugs with multiple plausible causes. How to activate:
Cmd/Ctrl + Shift + D → Debug Mode
Paste error + stack trace
Select hypothesis count (3–7)
Claude Code Debugging Workflow
CLAUDE.md** setup for debugging context:**
# Debugging Context
## Architecture
- Frontend: Next.js 14, deployed on Vercel
- Backend: FastAPI, deployed on Railway
- Database: PostgreSQL 15 via Supabase
- Queue: Redis + Celery for async tasks
## Common Failure Points
- Auth tokens expire after 1 hour; refresh logic is in auth/refresh.py
- The recommendation engine has a 10s timeout; failures are silent
- DB connection pool limit: 20 connections
## Debugging Commands
- Run tests: `pytest tests/ -v --tb=short`
- View logs: `railway logs --tail 100`
- Check queue: `celery -A app.worker inspect active`
Iterative log-feed workflow:
# In your terminal alongside Claude Code:
tail -f app.log | claude --stdin "Analyze these logs as they come in. \
Alert me to anomalies and hypothesize causes."
Agent hooks for extended debugging sessions:
| Hook Type | Use Case | Max Duration |
|---|---|---|
| Command hooks | Run after specific shell commands | Instant |
| Prompt hooks | Triggered before Claude responds | Instant |
| Agent hooks | Long-running: test suites, build pipelines | Up to 10 minutes |
| Example: autonomous debug loop kickoff |
I have a flaky test that fails ~20% of the time: test_payment_webhook.
Here is the test file and the implementation: [paste code]
Your task:
1. Run the test 10 times and collect all failure outputs
2. Identify the pattern in failures (timing? data? environment?)
3. Hypothesize root cause
4. Propose a fix — do not apply it yet, present it for my review
Use agent hooks to run the full test suite in the background.
Docs: Claude Code Agent Hooks
Multi-Agent Debugging Patterns
Pattern A: Orchestrator + Specialist Agents
An orchestrator agent coordinates four specialists:
| Agent | Responsibility |
|---|---|
| Log Reader | Parses and clusters log output, surfaces anomalies |
| Code Search | Finds relevant code paths via semantic search |
| Fix Generator | Proposes targeted patches |
| Test Runner | Executes tests and reports pass/fail |
| Best for: large codebases where the relevant code is hard to locate manually. |
Pattern B: Parallel Hypothesis Testing
Spin up N agents, each pursuing a different hypothesis in its own git branch. Set a time limit (5–10 min); whichever agent finds a passing test suite wins. Merge that branch. Best for: bugs with multiple equally plausible causes.
Pattern C: Checkpoint + Reset
From the CHI 2025 AGDebugger paper, this pattern saves agent state at each step. If the agent goes down a wrong path, reset to the last good checkpoint rather than starting over. Best for: complex multi-step debugging where agents tend to compound errors.
Pattern D: MCP-Mediated Agent Communication
Agents communicate via MCP protocol messages rather than shared memory. Each agent exposes a set of tools; the orchestrator calls them. This makes the workflow auditable and reproducible. Best for: production debugging systems where you need a full audit trail.
MCP Servers for Debugging
| MCP Server | What It Provides | Setup |
|---|---|---|
| Sentry MCP | Issues, stack traces, releases, 94.5% root cause accuracy | npx @sentry/mcp-server |
| GitHub MCP | PRs, commits, diffs, blame, issues | npx @github/mcp-server |
| Postgres MCP | Live DB queries, explain plans, slow query log | npx @benborla/mcp-server-postgresql |
| New Relic MCP | APM traces, error rates, throughput metrics | Via New Relic AI dashboard |
| CloudWatch MCP | AWS log groups, metrics, alarms | npx @aws-samples/cloudwatch-mcp |
| Python Debug MCP | Live Python debugger (breakpoints, stack frames) | pip install mcp-server-python-debugger |
| MCP Inspector | Debug MCP connections themselves | npx @modelcontextprotocol/inspector |
Sentry Seer Deep Dive
Sentry Seer is Sentry's AI debugging layer, trained on over 38,000 real issues. Key capabilities:
- 94.5% root cause accuracy on issues with sufficient context (Sentry benchmark, 2025)
- Cross-service tracing — connects a React TypeError to the ASP.NET backend commit that caused it
- Auto-generated fix PRs — Seer doesn't just identify the cause, it opens a pull request with the fix
- Regression detection — flags when a fix reintroduces a previously resolved issue Example flow:
- React frontend throws
TypeError: Cannot read property 'userId' of undefined - Seer traces the full distributed trace: frontend → API gateway → user service
- Identifies that a 3-day-old ASP.NET commit changed the
userresponse schema, droppinguserId - Opens a PR that adds backward-compatible
userIdback to the response Setup:
npx @sentry/mcp-server
# Configure with SENTRY_AUTH_TOKEN and SENTRY_ORG env vars
# Then in Claude Code: "Use the Sentry MCP to analyze issue SENTRY-12345"
Full-Stack Debugging Stack
| Layer | Tool | What to Look For |
|---|---|---|
| Frontend | Browser Console + LogRocket | JS errors, network failures, UI state |
| Backend | Sentry Seer • New Relic MCP | Exceptions, slow endpoints, error rates |
| Database | Postgres MCP | Slow queries, lock contention, missing indexes |
| Infrastructure | CloudWatch / Grafana MCP | CPU spikes, memory pressure, network I/O |
| Code | GitHub MCP | Recent commits, blame, PR history |
| Layer-by-layer debugging prompts: |
# Frontend layer
"Analyze these browser console errors and network requests from LogRocket.
What patterns suggest a root cause at the API layer?"
# Backend layer
"Use the Sentry MCP to fetch the last 10 occurrences of SENTRY-5678.
Cluster them by: user agent, endpoint, time of day, and release version."
# Database layer
"Use the Postgres MCP to run EXPLAIN ANALYZE on this slow query [SQL].
Suggest index changes. Do not modify schema yet — show the migration SQL first."
# Infrastructure layer
"Query CloudWatch for CPU and memory metrics on the api-service ECS task
from 2025-02-26 14:00 to 15:00 UTC. Correlate spikes with error rate in Sentry."
AI Debugging Limitations (Honest Assessment)
| Scenario | AI Capability | Notes |
|---|---|---|
| Simple bugs (off-by-one, null checks) | Strong | AI excels here; often solves in one shot |
| Stack trace analysis | Strong | Excellent at reading and explaining traces |
| Log clustering / anomaly detection | Strong | Pattern matching in large log sets |
| Memory leaks | Good | Needs heap dumps and allocation traces |
| Distributed system failures (with tracing) | Good | Works well with OpenTelemetry traces |
| Race conditions / concurrency bugs | Limited | Hard to reason about without deterministic replay |
| Cross-file architectural bugs | Limited | Loses context across many files |
| Business logic bugs (wrong behavior, not error) | Limited | Requires deep domain understanding |
| Persistent debugging memory across sessions | Not available | Each session starts fresh (use CLAUDE.md as workaround) |
| Current benchmarks (as of early 2026): |
- Claude 4 Opus: 67.6% on SWE-bench Verified
- GPT-5: 65% on SWE-bench Verified
SWE-bench measures end-to-end resolution of real GitHub issues — a reasonable proxy for debugging capability, though real-world bugs often provide more context than the benchmark.
Common Pitfalls
Pitfall 1: Letting AI make speculative rewrites The most common failure. Always use the "explain first, don't modify files" constraint in Phase 4. A targeted 3-line fix beats a 200-line refactor that introduces new bugs. Pitfall 2: Not providing enough context AI debugging quality scales directly with context quality. Logs, stack trace, recent git diff, and environment details are all required. A vague "it's broken" produces useless output. Pitfall 3: Chasing ghost bugs If you can't reproduce the bug consistently, stop and build reproducibility first. AI cannot reliably debug non-deterministic issues without a reproduction case. Pitfall 4: Not writing regression tests Fixes without regression tests get re-broken. Every AI-assisted fix should be paired with a test that would have caught the original bug. Make this non-negotiable.
Related tools
MCP servers used
Related cookbooks
AI-Powered Code Review & Quality
Automate code review and enforce quality standards using AI-powered tools and agentic workflows.
Building AI-Powered Applications
Build applications powered by LLMs, RAG, and AI agents using Claude Code, Cursor, and modern AI frameworks.
Building APIs & Backends with AI Agents
Design and build robust APIs and backend services with AI coding agents, from REST to GraphQL.