← Back to cookbooks

Intermediate · 4-8 hours

Testing with AI Agents

Generate comprehensive test suites and achieve high coverage using AI coding agents.

Last reviewed Feb 27, 2026

Overview

AI agents are transforming software testing by writing comprehensive test suites, generating edge cases, and maintaining CI/CD pipelines automatically. This cookbook covers how to use AI agents to achieve dramatically better test coverage with far less manual effort. Key tools covered: Claude Code, Cursor, Playwright, GitHub Copilot, Jest Audience: Developers who want better test coverage without spending most of their time writing boilerplate tests.


Test-Driven Agentic Development (TDAD)

Test-Driven Agentic Development is TDD adapted for a world where AI writes the implementation. Instead of fighting AI hallucinations, you harness test constraints to guide the agent precisely.

The Updated Red-Green-Refactor Cycle

  1. Red — Human writes a failing test that describes the desired behavior
  2. Green — AI generates the implementation to make the test pass
  3. Refactor — AI cleans up logic: "Refactor this implementation for clarity and performance, but keep all existing tests green"
  4. Validate — Human reviews the output, approves or iterates

Why TDD Makes AI Coding Better

  • Tests act as prompts. A well-written test is a precise specification that reduces ambiguity and hallucination.
  • Builds confidence. Green tests give you evidence the AI did the right thing, not just plausible-looking code.
  • Reduces scope creep. Tight test scopes prevent AI from over-engineering solutions.
  • Catches regressions instantly. Refactors are safe because the test suite acts as a guardrail.

TDD Plan Generator Prompt Template

You are a senior software engineer practicing test-driven development.

Given the following feature description:
[FEATURE DESCRIPTION]

Generate a comprehensive TDD test plan that includes:
1. Unit tests for all core functions and edge cases
2. Integration tests for component interactions
3. E2E tests for critical user flows
4. Error/failure scenario tests
5. Performance boundary tests

For each test, provide:
- Test name (descriptive, behavior-focused)
- Input conditions
- Expected output/behavior
- Why this test matters

Do NOT write the implementation — only the test plan.

Tips for TDAD

  • Start with high-value behaviors — focus on business logic, not utility functions
  • Write descriptive test namesit('should reject payments over the daily limit') not it('test payment')
  • Keep scopes tight — one behavior per test, no more
  • Separate agents — use one agent for writing tests, a different session for implementation; they shouldn't share context
  • Commit tests before implementation — this enforces the discipline and makes PRs reviewable

Automated Test Generation with Claude Code

Claude Code can set up an entire testing project from a single prompt, including framework configuration, test structure, and CI integration.

Setting Up a Playwright + Cucumber Project

Set up a complete Playwright + Cucumber BDD testing project for a Next.js e-commerce app.

Include:
- Playwright config with multiple browsers (Chrome, Firefox, Safari)
- Cucumber feature files directory structure
- Step definitions for common e-commerce flows
- Page Object Model pattern
- Test data fixtures
- NPM scripts for running tests
- GitHub Actions CI configuration

The app has: product listing, product detail, cart, checkout, and user auth flows.

Generate Tests for Specific Features with Edge Cases

Generate comprehensive tests for the checkout flow including:
- Happy path: successful purchase
- Empty cart handling
- Invalid payment card formats
- Expired card handling
- Network timeout during payment
- Concurrent session handling
- Price calculation with discounts + tax
- Address validation failures
- Stock exhaustion between cart add and checkout

Run the Full Quality Pipeline

Run the full quality pipeline:
1. Execute all tests and report coverage
2. Identify any coverage gaps below 80%
3. Run security scan with npm audit
4. Generate a summary report with pass/fail, coverage %, and any vulnerabilities

CLAUDE.md Template for Testing Standards

Place this in your repo root so Claude Code follows your testing standards automatically:

# Testing Standards

## Framework
- Unit/Integration: Jest + Testing Library
- E2E: Playwright
- API: Supertest

## Coverage Requirements
- Minimum 80% line coverage
- 100% coverage for payment and auth flows
- All new features require tests before merge

## Test Organization
- Unit tests: co-located with source files (*.test.ts)
- Integration tests: tests/integration/
- E2E tests: tests/e2e/

## Naming Conventions
- Use behavior descriptions: should [verb] when [condition]
- Group by feature with describe blocks

## Forbidden Patterns
- No test('it works') — be descriptive
- No .only in committed code
- No hardcoded test data — use factories

Playwright Agents (Planner / Generator / Healer)

Playwright now supports an agentic loop where three specialized agents collaborate on test creation and maintenance.

Setup

npx playwright init-agents --loop=claude

This creates three agent definition files in .playwright/agents/.

Planner Agent

The Planner explores the running application and produces a structured Markdown test plan.

  • Navigates the app autonomously
  • Identifies user flows, form interactions, and navigation paths
  • Produces a test-plan.md with scenarios, preconditions, and expected outcomes
  • Flags areas of complexity or risk Customizing the Planner (planner-agent.md):
You are a QA architect. When exploring the app:
- Focus on user-facing flows, not internal implementation
- Identify at least 3 edge cases per major flow
- Note any accessibility issues you observe
- Prioritize tests by business impact

Generator Agent

The Generator transforms the Markdown test plan into working Playwright tests.

  • Reads test-plan.md and generates .spec.ts files
  • Verifies selectors by querying the live DOM
  • Uses getByRole, getByLabel, getByText (resilient selectors first)
  • Falls back to data-testid attributes when semantic selectors aren't available
  • Adds assertions for visual state, network responses, and accessibility

Healer Agent

The Healer automatically fixes broken tests when UI changes break selectors.

  • Monitors CI for test failures
  • Identifies root cause: selector changed, flow changed, or genuine bug
  • For selector changes: updates the test with correct selector
  • For flow changes: rewrites the affected steps
  • For genuine bugs: opens a GitHub issue and leaves the test failing
  • Submits a PR with the fix for human review All three agents are defined as plain Markdown files — no code required. Fully customizable to your project's conventions.

Visual Regression Testing

Visual regression testing catches UI changes that functional tests miss — layout shifts, color changes, font rendering, and component drift.

Tool Accuracy Key Feature Pricing
Applitools 99.9999% Visual AI, cross-browser Enterprise
Percy High GitHub/CI native Free tier available
Chromatic High Storybook native Free tier available
Lost Pixel Good Open source Free
Applitools reports 99.9999% accuracy with a claimed 10x speedup in visual testing using their Visual AI model trained on millions of screenshots.
// playwright.config.ts with Applitools
import { defineConfig } from '@playwright/test';
import { getAIConfig } from '@applitools/eyes-playwright';

export default defineConfig({
  use: {
    ...getAIConfig({
      apiKey: process.env.APPLITOOLS_API_KEY,
      appName: 'My App',
      batchName: 'CI Run'
    })
  }
});

Property-Based Testing & Fuzzing

Property-based testing generates hundreds of random inputs to find edge cases you'd never think to write manually. AI supercharges this by helping you define properties and interpret failures.

Tools

AI Prompting Pattern for Property Tests

For the function calculateShippingCost(weight, distance, expedited),
identify 5-7 invariant properties that should ALWAYS be true,
regardless of input values. Then write fast-check property tests for each.

Example properties:
- Expedited should never cost less than standard
- Cost should increase monotonically with weight
- Cost should never be negative

When to Use Each Approach

Approach Best For
Example-based tests Known business rules, specific user flows
Property-based tests Pure functions, data transformations, mathematical invariants
Fuzzing Security-sensitive inputs, parsers, file processors

Agentic CI/CD Pipeline

Agentic CI/CD replaces rigid pipeline scripts with intelligent agents that make decisions based on code context.

The Five-Agent Pipeline

  1. Code Analysis Agent — Reads the diff, identifies changed modules, assesses risk level
  2. Test Selection Agent — Selects the minimal test set that covers changed code
  3. Execution Agent — Runs selected tests, collects results, identifies flaky tests
  4. Quality Decision Agent — Decides pass/fail based on coverage thresholds and risk level
  5. Adaptive Pipeline Agent — Updates pipeline configuration based on patterns

Results

Teams implementing agentic CI/CD pipelines have reported:

  • 78% reduction in deployment time by eliminating unnecessary full test suite runs
  • 3x faster delivery cycles through intelligent test selection

GitHub Copilot + Actions: Generate CI Workflows from Natural Language

GitHub Copilot can generate complete GitHub Actions workflows from a description:

Generate a GitHub Actions workflow for a Node.js + PostgreSQL app that:
- Runs on push to main and all PRs
- Caches node_modules between runs
- Runs ESLint, then Jest with coverage, then builds Docker image
- Only deploys to production on main branch
- Uses environment secrets for DATABASE_URL and DOCKER_REGISTRY
- Sends Slack notification on failure

Real Case Study: Elastic's Self-Healing CI

Elastic deployed Claude as a CI assistant to automatically fix broken PRs:

  • 24 broken PRs fixed automatically in one month
  • 20 engineer-days saved — time engineers would have spent debugging CI failures
  • Most fixes involved: updated snapshots, changed import paths, API signature updates

The Self-Healing Prompt

You are a CI debugging assistant. A test suite is failing.

Failing tests:
[PASTE FAILING TEST OUTPUT]

Recent changes to the codebase:
[PASTE GIT DIFF]

Your task:
1. Identify the root cause of each failure
2. Determine if this is a test issue (test needs updating) or a code bug
3. For test issues: provide the exact fix
4. For code bugs: explain the bug but DO NOT fix it — flag for human review
5. Output your changes as a unified diff

Critical: Do not change test assertions to make tests pass.
Only fix tests when the underlying behavior intentionally changed.

Common Pitfalls

Generating tests after the fact — AI-generated tests written after implementation often just verify the current (possibly buggy) behavior. TDD is vastly better: write tests first, let AI write the implementation. Not reviewing AI-generated tests — AI tests can pass while testing the wrong thing. Always read them. Look for tautological tests like expect(add(1,1)).toBe(add(1,1)). Brittle selectors in E2E tests — AI tends to generate page.locator('#submit-btn-v2'). Enforce resilient selectors: getByRole('button', { name: 'Submit' }). Not running tests in CI — Tests that only run locally don't prevent broken deployments. Every test suite needs a CI job.


Further Reading

MCP servers used

  • No linked MCP servers yet.