← Back to cookbooks

Intermediate · 3-6 hours

AI-Powered Code Review & Quality

Automate code review and enforce quality standards using AI-powered tools and agentic workflows.

Last reviewed Feb 27, 2026

Overview

This cookbook covers setting up automated AI code review pipelines that catch bugs before they reach production. You'll configure AI reviewers, integrate them into your CI/CD pipeline, and establish quality gates that scale with your team. Tools covered: GitHub Copilot Code Review, Cursor BugBot, CodeRabbit, Claude Code Target audience: Development teams wanting automated quality gates on every pull request


Tool Comparison

Bug detection rates from the DevTools Academy benchmark (October 2025), measuring percentage of seeded bugs caught across 500 PRs:

Tool Bug Detection Rate Key Strength Price
Macroscope 48% Deep semantic analysis Paid
CodeRabbit 46% 35+ linters + code graph Free tier available
Cursor BugBot 42% Agentic multi-pass analysis $40/mo add-on
Greptile 24% Codebase-wide context Paid
Graphite Diamond 18% PR workflow integration Paid

Important context: Even the best tool catches ~48% of bugs. AI code review is a quality multiplier, not a replacement for human review. Design your process accordingly.


GitHub Copilot Code Review Setup

GitHub Copilot Code Review reached General Availability in December 2025, acquiring 1 million users in its first week. What makes it distinctive:

  • Rich context via tool calling — understands the full project, not just the diff
  • Integrated static analysis — runs CodeQL and ESLint alongside LLM reasoning, reducing false positives
  • Seamless agent handoff — comment @copilot apply this fix and Copilot opens a commit with the change applied
  • **Customizable via **copilot-instructions.md — teach Copilot your team's conventions Setup:
  1. Go to repository Settings → Code review → Enable Copilot code review
  2. Create .github/copilot-instructions.md
  3. Configure auto-review on PR creation Full copilot-instructions.md template:
# Copilot Code Review Instructions

## Priority Areas

### Security (flag always, block merge)
- SQL injection, XSS, path traversal vulnerabilities
- Secrets, API keys, or credentials in code
- Auth bypass or privilege escalation risks
- Insecure deserialization

### Correctness (flag always)
- Logic errors that would cause incorrect behavior
- Off-by-one errors in loops or array access
- Null/undefined access without guards
- Race conditions in async code
- Incorrect error handling (swallowed exceptions)

### Architecture (flag if significant)
- Breaking changes to public APIs without versioning
- Circular dependencies introduced
- Database N+1 query patterns
- Missing database transactions for multi-step writes

## What NOT to Flag
- Code style preferences (we use Prettier for formatting)
- Minor naming conventions unless they cause confusion
- Performance micro-optimizations without evidence of bottleneck
- Personal style differences that don't affect correctness
- Comments about architecture that aren't actionable in this PR

## Project Context
- This is a TypeScript/Next.js application
- We use Prisma for database access — flag raw SQL as a security concern
- Auth is handled by next-auth; do not flag standard next-auth patterns
- All API routes require authentication except those in /api/public/
- We follow the repository pattern: business logic belongs in /lib, not in API routes

## Review Style
- Be specific: cite the exact line and explain why it's a problem
- Provide a code suggestion for every issue flagged
- Distinguish between blocking issues and suggestions

Docs: GitHub Copilot Code Review


Cursor BugBot Setup

Cursor BugBot received an agentic redesign in fall 2025 that doubled the number of resolved bugs per PR compared to the previous version, according to Cursor's release notes. How it works:

  • Runs 8 parallel analysis passes on every PR: security, correctness, performance, type safety, error handling, test coverage, API contracts, and dependency risk
  • Each pass is a specialized sub-agent — not a single generic prompt
  • Results are aggregated and de-duplicated before surfacing Setup:
  1. Enable BugBot in Cursor Settings → BugBot ($40/mo add-on)
  2. Connect your GitHub repository
  3. Create BUGBOT.md at the repo root to customize behavior BUGBOT.md** template:**
# BugBot Configuration

## Focus Areas
- Security vulnerabilities (OWASP Top 10)
- TypeScript type errors and unsafe casts
- Missing await on async functions
- Unhandled promise rejections

## Ignore
- Test files in __tests__/ directories
- Generated files in /generated/
- Changes to package-lock.json or yarn.lock

## Context
- Payment processing code in /lib/payments/ is PCI-sensitive — apply highest scrutiny
- /scripts/ contains one-off migration scripts — correctness matters, style does not

Docs: Cursor BugBot


CodeRabbit Setup

CodeRabbit is a popular choice for teams that want deep linting integration alongside AI review. Key capabilities:

  • 35+ integrated linters — ESLint, Pylint, Semgrep, Checkov, and more, all auto-configured
  • Code graph analysis — understands call graphs and data flow, not just the diff
  • Iterative fix loop — CodeRabbit comments, developer responds, CodeRabbit validates the fix
  • Free tier available for public repositories and small teams Setup:
  1. Install the CodeRabbit GitHub App
  2. Create .coderabbit.yaml at the repo root:
# .coderabbit.yaml
reviews:
  auto_review:
    enabled: true
    drafts: false           # don't review draft PRs
    base_branches:          # only review PRs targeting these branches
      - main
      - staging
  profile: 'chill'         # 'chill' = suggestions only, 'assertive' = blocking
  path_filters:
    - '!**/*.lock'
    - '!**/generated/**'
  tools:
    eslint:
      enabled: true
    semgrep:
      enabled: true
      config: 'p/security-audit'
chat:
  auto_reply: true
  1. Optionally integrate with GitHub Copilot: CodeRabbit flags the issue, then @copilot apply implements the fix — a clean division of labor.

Docs: CodeRabbit Configuration


Claude Code as Code Reviewer

For teams using Claude Code, you can configure it as a dedicated review agent that operates on a different axis than automated tools — focusing on higher-level concerns like architectural consistency and business logic correctness. Setup — create a code-reviewer subagent:

# In your project root, create a review script:
cat > scripts/ai-review.sh << 'EOF'
#!/bin/bash
PR_DIFF=$(git diff main...HEAD)
echo "$PR_DIFF" | claude --role code-reviewer --system-prompt .claude/review-prompt.md
EOF
chmod +x scripts/ai-review.sh

.claude/review-prompt.md** — Security review:**

You are a senior security engineer reviewing a code diff.

For every security issue found:
1. Cite the exact file and line number
2. Classify severity: Critical / High / Medium / Low
3. Explain the attack vector
4. Provide the fix as a code snippet

Focus on: injection vulnerabilities, authentication bypasses, 
insecure data storage, missing input validation, exposed secrets.

.claude/performance-review-prompt.md** — Performance review:**

You are a performance engineer reviewing a code diff.

Flag:
- N+1 database query patterns
- Missing pagination on list endpoints
- Synchronous I/O in async contexts
- Large in-memory data structures that could be streamed
- Missing caching on expensive repeated computations

For each issue: cite the line, estimate the performance impact, 
and provide the optimized version.

.claude/consistency-review-prompt.md** — Pattern consistency:**

You are a staff engineer ensuring code consistency across a codebase.

Compare this diff against the patterns in [CODEBASE_CONTEXT].
Flag:
- Divergence from established error handling patterns
- New abstractions that duplicate existing utilities
- Inconsistent naming conventions
- Missing tests for new public functions

Full CI/CD Code Review Pipeline

The pipeline:

Pre-commit hooks
  └── lint-staged (ESLint, Prettier, type check)
  └── commit-msg validation

PR Created
  └── GitHub Actions: lint + type check + unit tests
  └── CodeRabbit: automated review comment
  └── Cursor BugBot: 8-pass analysis
  └── Copilot Code Review: security + correctness

Human Review
  └── Developer reviews AI feedback
  └── Applies @copilot fixes or manually resolves
  └── Required approval from 1 human reviewer

MergeDeploy
  └── Integration tests
  └── Sentry monitoring (post-deploy error rate check)

GitHub Actions YAML:

# .github/workflows/ai-review-pipeline.yml
name: AI-Assisted Code Review

on:
  pull_request:
    branches: [main, staging]

jobs:
  static-checks:
    name: Lint & Type Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run type-check
      - run: npm run test:unit -- --coverage
      - uses: codecov/codecov-action@v4

  security-scan:
    name: Security Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: semgrep/semgrep-action@v1
        with:
          config: p/security-audit p/owasp-top-ten
      - name: Scan for secrets
        uses: gitleaks/gitleaks-action@v2

  ai-review-gate:
    name: AI Review Status Check
    runs-on: ubuntu-latest
    needs: [static-checks, security-scan]
    steps:
      - name: Check CodeRabbit approval status
        uses: coderabbitai/ai-pr-reviewer@latest
        with:
          review_type: 'incremental'
          block_on_critical: true

AI-Assisted Refactoring

Where AI excels — hygiene refactoring:

  • Renaming variables/functions for clarity
  • Adding missing TypeScript types
  • Extracting magic numbers into named constants
  • Reformatting inconsistent code
  • Adding JSDoc/docstring comments Where AI fails — architectural refactoring: A 2025 study analyzing 15,451 AI-generated refactorings found that AI refactoring tools:
  • Introduced semantic changes (behavior changes, not just structure) in 23% of cases
  • Broke existing tests in 11% of refactorings marked as "safe"
  • Performed well on single-function scope, poorly on multi-module changes Safe incremental refactoring workflow:
1. Write tests BEFORE refactoring (if they don't exist)
2. Prompt: "Refactor [specific function] only. Do not change behavior.
             Do not touch other files. Provide the diff for my review."
3. Review diff line by line — don't just run tests
4. Run full test suite
5. Commit each logical refactoring unit separately
6. Repeat for next unit

Key rule: one concern per commit. A refactor commit should contain zero behavior changes. A behavior-change commit should contain zero refactoring.


AI Security Scanning (SAST)

The SAST false positive problem: Traditional SAST tools generate so many findings that developers ignore them. Research from Gartner (2025) estimates 98% of SAST findings are unexploitable at runtime due to factors like input sanitization upstream or unreachable code paths. Agentic SAST tools that reduce false positives:

Tool Approach Key Feature
Corgea Triage + auto-fix Validates exploitability before surfacing findings
Snyk Dependency + code scanning Fix PRs with known-good patches
GitHub Advanced Security CodeQL + AI Native GitHub integration, semantic analysis
Semgrep Pattern + AI triage Open source rules + AI-assisted fix suggestions
Terra Security Autonomous agent Opens fix PRs with exploitability validation
Checkmarx Enterprise SAST + AI AI agent opens and validates fix PRs at scale
Recommended approach: Use Semgrep with the p/security-audit and p/owasp-top-ten rulesets in CI for coverage, then use Corgea or Terra Security to triage and auto-fix confirmed findings.

Test-Driven Development with AI

The updated Red-Green-Refactor cycle for AI-assisted TDD:

Red:     Human writes a failing test (describes desired behavior)
         └── "Here is the failing test. Do NOT write the implementation yet.
              Confirm you understand what behavior is being tested."

Green:   AI implements the minimal code to pass the test
         └── "Now implement the minimum code to make this test pass.
              Do not add features not covered by the test."

Refactor: AI refactors for clarity (no behavior change)
         └── "Refactor for readability. All tests must still pass.
              Do not change behavior. Show me the diff."

Common TDD failure modes with AI:

  • AI writes the implementation first (violates TDD) — use explicit constraints in your prompt
  • AI writes tests that are coupled to implementation details — ask for behavior tests, not unit tests
  • AI generates tests that trivially pass (empty implementations) — require tests to fail first
  • AI adds implementation features not in the failing test — enforce one test → one feature LLM eval tools for CI — for AI-powered features, add LLM evaluation to your test suite: | Tool | Use Case | Integration | | --- | --- | --- | | Braintrust | Eval scoring, prompt regression testing | CI via SDK | | Promptfoo | Open-source, YAML-based evals | GitHub Actions | | LangSmith | LangChain-native eval + tracing | Python SDK | Example Promptfoo CI eval:
# promptfoo.yaml
prompts:
  - file://prompts/classifier.txt

providers:
  - openai:gpt-4o

tests:
  - description: 'Classifies positive sentiment correctly'
    vars:
      input: 'This product is amazing!'
    assert:
      - type: equals
        value: 'positive'
  - description: 'Does not hallucinate company names'
    vars:
      input: 'Who makes this product?'
    assert:
      - type: not-contains
        value: 'OpenAI'
        # Our product is not made by OpenAI
# In CI:
npx promptfoo eval --ci

Common Pitfalls

Pitfall 1: Not customizing the AI reviewer Default settings are tuned for the average codebase, not yours. Without copilot-instructions.md or BUGBOT.md, the reviewer doesn't know your auth patterns, your framework conventions, or which files to ignore. You'll get noise about things that aren't problems and miss context-specific issues. Pitfall 2: Over-relying on AI (it catches 42–48%, not 100%) The best tool in the DevTools Academy benchmark catches 48% of bugs. AI review is a first-pass filter, not a safety net. Human review of the AI's output — and of things the AI didn't flag — remains essential. Pitfall 3: Not integrating with CI/CD AI code review only scales if it's automatic. A review tool that developers have to run manually will be skipped under deadline pressure. Block merges on critical AI findings via GitHub Actions status checks. Pitfall 4: Ignoring AI review feedback over time If developers consistently dismiss AI feedback as noise, the tool loses effectiveness. Tune your configuration to reduce false positives. Track which AI findings are accepted vs. dismissed — a high dismissal rate signals misconfiguration, not a useless tool.

MCP servers used

  • No linked MCP servers yet.