Intermediate · 4-8 hours

Testing with AI Agents

Generate comprehensive test suites and achieve high coverage using AI coding agents.

Last reviewed Feb 27, 2026

Overview

AI agents are transforming software testing by writing comprehensive test suites, generating edge cases, and maintaining CI/CD pipelines automatically. This cookbook covers how to use AI agents to achieve dramatically better test coverage with far less manual effort. Key tools covered: Claude Code, Cursor, Playwright, GitHub Copilot, Jest Audience: Developers who want better test coverage without spending most of their time writing boilerplate tests.

Test-Driven Agentic Development (TDAD)

Test-Driven Agentic Development is TDD adapted for a world where AI writes the implementation. Instead of fighting AI hallucinations, you harness test constraints to guide the agent precisely.

The Updated Red-Green-Refactor Cycle

Red — Human writes a failing test that describes the desired behavior
Green — AI generates the implementation to make the test pass
Refactor — AI cleans up logic: "Refactor this implementation for clarity and performance, but keep all existing tests green"
Validate — Human reviews the output, approves or iterates

Why TDD Makes AI Coding Better

Tests act as prompts. A well-written test is a precise specification that reduces ambiguity and hallucination.
Builds confidence. Green tests give you evidence the AI did the right thing, not just plausible-looking code.
Reduces scope creep. Tight test scopes prevent AI from over-engineering solutions.
Catches regressions instantly. Refactors are safe because the test suite acts as a guardrail.

TDD Plan Generator Prompt Template

You are a senior software engineer practicing test-driven development.

Given the following feature description:
[FEATURE DESCRIPTION]

Generate a comprehensive TDD test plan that includes:
1. Unit tests for all core functions and edge cases
2. Integration tests for component interactions
3. E2E tests for critical user flows
4. Error/failure scenario tests
5. Performance boundary tests

For each test, provide:
- Test name (descriptive, behavior-focused)
- Input conditions
- Expected output/behavior
- Why this test matters

Do NOT write the implementation — only the test plan.

Tips for TDAD

Start with high-value behaviors — focus on business logic, not utility functions
Write descriptive test names — it('should reject payments over the daily limit') not it('test payment')
Keep scopes tight — one behavior per test, no more
Separate agents — use one agent for writing tests, a different session for implementation; they shouldn't share context
Commit tests before implementation — this enforces the discipline and makes PRs reviewable

Automated Test Generation with Claude Code

Claude Code can set up an entire testing project from a single prompt, including framework configuration, test structure, and CI integration.

Setting Up a Playwright + Cucumber Project

Set up a complete Playwright + Cucumber BDD testing project for a Next.js e-commerce app.

Include:
- Playwright config with multiple browsers (Chrome, Firefox, Safari)
- Cucumber feature files directory structure
- Step definitions for common e-commerce flows
- Page Object Model pattern
- Test data fixtures
- NPM scripts for running tests
- GitHub Actions CI configuration

The app has: product listing, product detail, cart, checkout, and user auth flows.

Generate Tests for Specific Features with Edge Cases

Generate comprehensive tests for the checkout flow including:
- Happy path: successful purchase
- Empty cart handling
- Invalid payment card formats
- Expired card handling
- Network timeout during payment
- Concurrent session handling
- Price calculation with discounts + tax
- Address validation failures
- Stock exhaustion between cart add and checkout

Run the Full Quality Pipeline

Run the full quality pipeline:
1. Execute all tests and report coverage
2. Identify any coverage gaps below 80%
3. Run security scan with npm audit
4. Generate a summary report with pass/fail, coverage %, and any vulnerabilities

CLAUDE.md Template for Testing Standards

Place this in your repo root so Claude Code follows your testing standards automatically:

# Testing Standards

## Framework
- Unit/Integration: Jest + Testing Library
- E2E: Playwright
- API: Supertest

## Coverage Requirements
- Minimum 80% line coverage
- 100% coverage for payment and auth flows
- All new features require tests before merge

## Test Organization
- Unit tests: co-located with source files (*.test.ts)
- Integration tests: tests/integration/
- E2E tests: tests/e2e/

## Naming Conventions
- Use behavior descriptions: should [verb] when [condition]
- Group by feature with describe blocks

## Forbidden Patterns
- No test('it works') — be descriptive
- No .only in committed code
- No hardcoded test data — use factories

Playwright Agents (Planner / Generator / Healer)

Playwright now supports an agentic loop where three specialized agents collaborate on test creation and maintenance.

Setup

npx playwright init-agents --loop=claude

This creates three agent definition files in .playwright/agents/.

Planner Agent

The Planner explores the running application and produces a structured Markdown test plan.

Navigates the app autonomously
Identifies user flows, form interactions, and navigation paths
Produces a test-plan.md with scenarios, preconditions, and expected outcomes
Flags areas of complexity or risk Customizing the Planner (planner-agent.md):

You are a QA architect. When exploring the app:
- Focus on user-facing flows, not internal implementation
- Identify at least 3 edge cases per major flow
- Note any accessibility issues you observe
- Prioritize tests by business impact

Generator Agent

The Generator transforms the Markdown test plan into working Playwright tests.

Reads test-plan.md and generates .spec.ts files
Verifies selectors by querying the live DOM
Uses getByRole, getByLabel, getByText (resilient selectors first)
Falls back to data-testid attributes when semantic selectors aren't available
Adds assertions for visual state, network responses, and accessibility

Healer Agent

The Healer automatically fixes broken tests when UI changes break selectors.

Monitors CI for test failures
Identifies root cause: selector changed, flow changed, or genuine bug
For selector changes: updates the test with correct selector
For flow changes: rewrites the affected steps
For genuine bugs: opens a GitHub issue and leaves the test failing
Submits a PR with the fix for human review All three agents are defined as plain Markdown files — no code required. Fully customizable to your project's conventions.

Visual Regression Testing

Visual regression testing catches UI changes that functional tests miss — layout shifts, color changes, font rendering, and component drift.

Tool	Accuracy	Key Feature	Pricing
Applitools	99.9999%	Visual AI, cross-browser	Enterprise
Percy	High	GitHub/CI native	Free tier available
Chromatic	High	Storybook native	Free tier available
Lost Pixel	Good	Open source	Free
Applitools reports 99.9999% accuracy with a claimed 10x speedup in visual testing using their Visual AI model trained on millions of screenshots.

// playwright.config.ts with Applitools
import { defineConfig } from '@playwright/test';
import { getAIConfig } from '@applitools/eyes-playwright';

export default defineConfig({
  use: {
    ...getAIConfig({
      apiKey: process.env.APPLITOOLS_API_KEY,
      appName: 'My App',
      batchName: 'CI Run'
    })
  }
});

Property-Based Testing & Fuzzing

Property-based testing generates hundreds of random inputs to find edge cases you'd never think to write manually. AI supercharges this by helping you define properties and interpret failures.

Tools

fast-check — JavaScript property-based testing
MLCheck — ML-specific property testing
TensorFuzz — Neural network fuzzing
ART (Adversarial Robustness Toolbox) — AI model robustness testing

AI Prompting Pattern for Property Tests

For the function calculateShippingCost(weight, distance, expedited),
identify 5-7 invariant properties that should ALWAYS be true,
regardless of input values. Then write fast-check property tests for each.

Example properties:
- Expedited should never cost less than standard
- Cost should increase monotonically with weight
- Cost should never be negative

When to Use Each Approach

Approach	Best For
Example-based tests	Known business rules, specific user flows
Property-based tests	Pure functions, data transformations, mathematical invariants
Fuzzing	Security-sensitive inputs, parsers, file processors

Agentic CI/CD Pipeline

Agentic CI/CD replaces rigid pipeline scripts with intelligent agents that make decisions based on code context.

The Five-Agent Pipeline

Code Analysis Agent — Reads the diff, identifies changed modules, assesses risk level
Test Selection Agent — Selects the minimal test set that covers changed code
Execution Agent — Runs selected tests, collects results, identifies flaky tests
Quality Decision Agent — Decides pass/fail based on coverage thresholds and risk level
Adaptive Pipeline Agent — Updates pipeline configuration based on patterns

Results

Teams implementing agentic CI/CD pipelines have reported:

78% reduction in deployment time by eliminating unnecessary full test suite runs
3x faster delivery cycles through intelligent test selection

GitHub Copilot + Actions: Generate CI Workflows from Natural Language

GitHub Copilot can generate complete GitHub Actions workflows from a description:

Generate a GitHub Actions workflow for a Node.js + PostgreSQL app that:
- Runs on push to main and all PRs
- Caches node_modules between runs
- Runs ESLint, then Jest with coverage, then builds Docker image
- Only deploys to production on main branch
- Uses environment secrets for DATABASE_URL and DOCKER_REGISTRY
- Sends Slack notification on failure

Real Case Study: Elastic's Self-Healing CI

Elastic deployed Claude as a CI assistant to automatically fix broken PRs:

24 broken PRs fixed automatically in one month
20 engineer-days saved — time engineers would have spent debugging CI failures
Most fixes involved: updated snapshots, changed import paths, API signature updates

The Self-Healing Prompt

You are a CI debugging assistant. A test suite is failing.

Failing tests:
[PASTE FAILING TEST OUTPUT]

Recent changes to the codebase:
[PASTE GIT DIFF]

Your task:
1. Identify the root cause of each failure
2. Determine if this is a test issue (test needs updating) or a code bug
3. For test issues: provide the exact fix
4. For code bugs: explain the bug but DO NOT fix it — flag for human review
5. Output your changes as a unified diff

Critical: Do not change test assertions to make tests pass.
Only fix tests when the underlying behavior intentionally changed.

Common Pitfalls

Generating tests after the fact — AI-generated tests written after implementation often just verify the current (possibly buggy) behavior. TDD is vastly better: write tests first, let AI write the implementation. Not reviewing AI-generated tests — AI tests can pass while testing the wrong thing. Always read them. Look for tautological tests like expect(add(1,1)).toBe(add(1,1)). Brittle selectors in E2E tests — AI tends to generate page.locator('#submit-btn-v2'). Enforce resilient selectors: getByRole('button', { name: 'Submit' }). Not running tests in CI — Tests that only run locally don't prevent broken deployments. Every test suite needs a CI job.

Related tools

Related skills

MCP servers used

No linked MCP servers yet.