Advanced · 1-2 weeks

DevOps & Infrastructure with AI Agents

Automate infrastructure provisioning, CI/CD pipelines, and monitoring with AI agents.

Last reviewed Feb 27, 2026

Overview

AI agents are transforming DevOps by generating infrastructure configs, creating CI/CD pipelines, troubleshooting deployments, and remediating security findings automatically. This cookbook covers the full stack: Docker, Kubernetes, Terraform, CI/CD, and cloud deployment. Key tools covered: Claude Code, Cursor, GitHub Copilot, Firefly.ai, cloud provider MCP servers Audience: DevOps engineers and full-stack developers who want to move faster without sacrificing reliability.

Docker Workflows with AI

Generate Dockerfiles from Natural Language

Generate a production-ready Dockerfile for a Node.js 20 Express API that:
- Uses multi-stage build (builder + production stages)
- Runs as non-root user
- Installs only production dependencies in final image
- Includes health check endpoint
- Optimizes layer caching (copy package.json before source)
- Final image should be under 200MB

Example AI-Generated Dockerfile:

# Stage 1: Builder
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

COPY --from=builder /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .

USER nodejs
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
  CMD node healthcheck.js

CMD ["node", "server.js"]

Docker Compose for Local Dev Environments

Generate a docker-compose.yml for local development of a full-stack app with:
- Next.js frontend (port 3000, hot reload)
- Express API (port 4000, nodemon)
- PostgreSQL 16 (port 5432, persistent volume)
- Redis (port 6379, for sessions)
- Adminer for DB management (port 8080)
- Shared .env file

Example AI-Generated docker-compose.yml:

version: '3.9'
services:
  frontend:
    build:
      context: ./frontend
      target: development
    ports:
      - "3000:3000"
    volumes:
      - ./frontend:/app
      - /app/node_modules
    environment:
      - NEXT_PUBLIC_API_URL=http://localhost:4000
    depends_on:
      - api

  api:
    build:
      context: ./api
      target: development
    ports:
      - "4000:4000"
    volumes:
      - ./api:/app
      - /app/node_modules
    env_file: .env
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:16-alpine
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  adminer:
    image: adminer
    ports:
      - "8080:8080"

volumes:
  postgres_data:

Kubernetes with AI Agents

Generate K8s Manifests from App Descriptions

Generate Kubernetes manifests for a Node.js REST API:
- Deployment: 3 replicas, rolling update strategy
- Resources: 256Mi memory, 250m CPU (requests); 512Mi memory, 500m CPU (limits)
- Liveness and readiness probes on /health
- Service: ClusterIP, port 80 to 3000
- Ingress: api.myapp.com, TLS via cert-manager
- HorizontalPodAutoscaler: scale 3-10 replicas on 70% CPU
- ConfigMap for non-secret env vars
- Secret reference for DATABASE_URL

Example AI-Generated Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    app: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myregistry/api:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          envFrom:
            - configMapRef:
                name: api-config
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: api-secrets
                  key: database-url

AI-Assisted Troubleshooting of Pod Failures

My Kubernetes pod is in CrashLoopBackOff. Here is the output of:
- kubectl describe pod [pod-name]: [PASTE OUTPUT]
- kubectl logs [pod-name] --previous: [PASTE OUTPUT]

Diagnose the root cause and provide step-by-step remediation commands.

Terraform & Infrastructure as Code

Why Basic LLM Terraform Generation Fails

Basic LLMs generate plausible-looking Terraform that breaks in practice because:

They don't know your existing .tfstate — they can't reference real resource IDs
They hallucinate resource attribute names from outdated provider documentation
They don't account for dependencies between resources
They can't run terraform plan to validate output

State-Aware Agents

Firefly.ai is the leading state-aware Terraform agent:

Reads your actual .tfstate files to understand current infrastructure
Queries provider documentation in real-time (no hallucinated attributes)
Generates Terraform modules that reference your real resource IDs
Automatically runs terraform plan and surfaces the diff before applying
Detects drift between state and actual cloud infrastructure Rackspace's Auto-Remediation Agent monitors Terraform runs and:
Detects failed terraform apply operations
Analyzes error messages and stack traces
Generates targeted fixes (not full rewrites)
Submits fix as a PR for human approval before retrying

Safe Terraform Workflow

# Step 1: Generate with AI (state-aware)
firefly generate "Add an RDS PostgreSQL 16 instance in the same VPC as ecs-cluster-prod"

# Step 2: ALWAYS review the plan before applying
terraform plan -out=tfplan
# Read every line. Understand every change.

# Step 3: Apply only after review
terraform apply tfplan

Rule: Never run terraform apply on AI-generated code without reading the plan. Misconfigured infrastructure can be expensive, irreversible, or a security vulnerability.

CI/CD Pipeline Creation

Complete GitHub Actions Workflow from Natural Language

Generate a complete GitHub Actions workflow for a Node.js app that:
1. Triggers on push to main and pull_request events
2. Runs in parallel: ESLint + Prettier check, Jest unit tests with coverage
3. Only proceeds to build if both pass
4. Builds Docker image tagged with git SHA and 'latest'
5. Pushes to GitHub Container Registry
6. Deploys to AWS ECS Fargate on main branch only
7. Posts deployment status to Slack channel #deployments
8. Caches node_modules using actions/cache

Full Working Generated YAML:

name: CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run format:check

  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --coverage

  build-and-push:
    needs: [lint, test]
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest

  deploy:
    needs: build-and-push
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster production \
            --service api \
            --force-new-deployment
      - name: Notify Slack
        uses: slackapi/slack-github-action@v1
        with:
          channel-id: 'deployments'
          slack-message: "Deployed ${{ github.sha }} to production"
        env:
          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

Cloud Deployment Workflows

AWS

ECS Fargate: Serverless containers — no EC2 management. AI generates task definitions, service configs, and ALB rules from app descriptions.
Lambda: AI can generate function code, IAM roles, and event source mappings from natural language.
AWS MCP Server: Provides sub-24-hour re-indexing of service documentation, so AI agents have current API knowledge.

Generate an AWS Lambda function + API Gateway setup for a webhook receiver that:
- Accepts POST /webhook
- Validates HMAC-SHA256 signature from header X-Signature
- Stores payload in DynamoDB with timestamp
- Sends SNS notification to operations topic
- Include IAM role with least-privilege permissions
- Include CloudFormation template for deployment

GCP

Cloud Run: Zero-config container deployment. AI generates cloudbuild.yaml and Cloud Run service configs.
GKE Autopilot: AI generates Kubernetes manifests; GKE Autopilot handles node management.
Google Cloud MCP Server: Available for Vertex AI, Cloud Storage, and BigQuery integration.

Vercel

Vercel offers zero-config deployment for frontend frameworks. AI agents generate vercel.json for custom routing, environment variable setup, and preview deployment configurations.

Security Scanning & Compliance

The Problem with Traditional SAST

Research shows that 98% of SAST findings are unexploitable at runtime — they're false positives or theoretical vulnerabilities that can't be triggered in practice. This creates massive alert fatigue.

Agentic SAST Tools

Terra Security — Uses AI to validate whether SAST findings are actually exploitable in your specific runtime context. Reduces noise by ~98%. Checkmarx AI Agent — Analyzes vulnerabilities in context and generates autonomous fix PRs with explanations.

Autonomous Fix PR Generation

Analyze this SAST finding and determine if it is exploitable in our application:

Finding: SQL Injection in UserController.getUserById()
Code: [PASTE CODE]
Runtime context: This endpoint requires admin authentication.
  Admin users are internal employees only.

If exploitable: generate a fix PR
If not exploitable: explain why and mark as suppressed with justification

Prompt Library for DevOps

1. Dockerfile Generation

Generate a production-ready Dockerfile for [LANGUAGE/RUNTIME] [VERSION] running [APP TYPE].
Requirements: multi-stage build, non-root user, health check at [ENDPOINT],
final image under [SIZE]MB. Base on [BASE_IMAGE].

2. Kubernetes Manifest Generation

Generate K8s manifests for [APP_NAME]: Deployment ([N] replicas), Service ([TYPE]),
Ingress ([DOMAIN], TLS), HPA (scale [MIN]-[MAX] on [METRIC]% [RESOURCE]).
Resource limits: [MEMORY] memory, [CPU] CPU.

3. GitHub Actions Workflow

Generate a GitHub Actions workflow that: [lint/test/build/push/deploy steps].
Target: [CLOUD PROVIDER], service: [SERVICE NAME].
Trigger on: push to [BRANCH] and PRs.
Notify [SLACK/EMAIL] on failure.

4. Terraform Module Creation

Generate a reusable Terraform module for [RESOURCE TYPE] on [CLOUD PROVIDER].
Inputs: [LIST VARIABLES]. Outputs: [LIST OUTPUTS].
Follow [COMPANY] naming conventions. Include README.md.
Provider version: [VERSION].

5. Security Audit

Audit this [Dockerfile/K8s manifest/Terraform module] for security issues.
Check for: exposed secrets, root user, unnecessary permissions, open ports,
unpinned image versions, network policies, IAM least-privilege violations.
For each finding: severity (CRITICAL/HIGH/MEDIUM/LOW), explanation, fix.

6. Performance Optimization

Analyze this [GitHub Actions workflow/Dockerfile/K8s deployment] for performance.
Identify: slow steps that could be parallelized, missing caching opportunities,
oversized images, resource misconfigurations.
Provide optimized version with explanation of each change.

7. Incident Runbook Generation

Generate an incident runbook for [SERVICE NAME] covering:
- Common failure modes and symptoms
- Diagnostic commands to run first
- Escalation criteria
- Rollback procedure
- Post-incident checklist
Base it on this architecture: [DESCRIBE ARCHITECTURE]

8. Cost Optimization Audit

Review this Terraform/K8s configuration for cost optimization opportunities.
Identify: overprovisioned resources, unused resources, savings plan opportunities,
spot instance candidates, storage tier downgrades.
Estimate monthly savings for each recommendation.

Common Pitfalls

Blindly applying AI-generated infrastructure configs — Always review generated Terraform with terraform plan. Always review generated K8s manifests before kubectl apply. The blast radius of infrastructure mistakes is much larger than application bugs. Not understanding the generated Terraform before applying — If you can't explain what the Terraform does, don't apply it. Ask the AI to explain each resource before applying. Ignoring security best practices in Dockerfiles — AI may generate Dockerfiles that run as root, use :latest tags, or copy in .env files. Always audit generated Dockerfiles against security checklists. Not testing CI/CD changes in a staging environment — Test pipeline changes on a feature branch before merging to main. A broken main pipeline blocks the entire team.

Related tools

Related skills

Writing Plans