Advanced · 1-2 weeks
DevOps & Infrastructure with AI Agents
Automate infrastructure provisioning, CI/CD pipelines, and monitoring with AI agents.
Last reviewed Feb 27, 2026
Overview
AI agents are transforming DevOps by generating infrastructure configs, creating CI/CD pipelines, troubleshooting deployments, and remediating security findings automatically. This cookbook covers the full stack: Docker, Kubernetes, Terraform, CI/CD, and cloud deployment. Key tools covered: Claude Code, Cursor, GitHub Copilot, Firefly.ai, cloud provider MCP servers Audience: DevOps engineers and full-stack developers who want to move faster without sacrificing reliability.
Docker Workflows with AI
Generate Dockerfiles from Natural Language
Generate a production-ready Dockerfile for a Node.js 20 Express API that:
- Uses multi-stage build (builder + production stages)
- Runs as non-root user
- Installs only production dependencies in final image
- Includes health check endpoint
- Optimizes layer caching (copy package.json before source)
- Final image should be under 200MB
Example AI-Generated Dockerfile:
# Stage 1: Builder
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
COPY --from=builder /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .
USER nodejs
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s \
CMD node healthcheck.js
CMD ["node", "server.js"]
Docker Compose for Local Dev Environments
Generate a docker-compose.yml for local development of a full-stack app with:
- Next.js frontend (port 3000, hot reload)
- Express API (port 4000, nodemon)
- PostgreSQL 16 (port 5432, persistent volume)
- Redis (port 6379, for sessions)
- Adminer for DB management (port 8080)
- Shared .env file
Example AI-Generated docker-compose.yml:
version: '3.9'
services:
frontend:
build:
context: ./frontend
target: development
ports:
- "3000:3000"
volumes:
- ./frontend:/app
- /app/node_modules
environment:
- NEXT_PUBLIC_API_URL=http://localhost:4000
depends_on:
- api
api:
build:
context: ./api
target: development
ports:
- "4000:4000"
volumes:
- ./api:/app
- /app/node_modules
env_file: .env
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16-alpine
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: ${POSTGRES_DB}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
adminer:
image: adminer
ports:
- "8080:8080"
volumes:
postgres_data:
Kubernetes with AI Agents
Generate K8s Manifests from App Descriptions
Generate Kubernetes manifests for a Node.js REST API:
- Deployment: 3 replicas, rolling update strategy
- Resources: 256Mi memory, 250m CPU (requests); 512Mi memory, 500m CPU (limits)
- Liveness and readiness probes on /health
- Service: ClusterIP, port 80 to 3000
- Ingress: api.myapp.com, TLS via cert-manager
- HorizontalPodAutoscaler: scale 3-10 replicas on 70% CPU
- ConfigMap for non-secret env vars
- Secret reference for DATABASE_URL
Example AI-Generated Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels:
app: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: myregistry/api:latest
ports:
- containerPort: 3000
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
envFrom:
- configMapRef:
name: api-config
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: api-secrets
key: database-url
AI-Assisted Troubleshooting of Pod Failures
My Kubernetes pod is in CrashLoopBackOff. Here is the output of:
- kubectl describe pod [pod-name]: [PASTE OUTPUT]
- kubectl logs [pod-name] --previous: [PASTE OUTPUT]
Diagnose the root cause and provide step-by-step remediation commands.
Terraform & Infrastructure as Code
Why Basic LLM Terraform Generation Fails
Basic LLMs generate plausible-looking Terraform that breaks in practice because:
- They don't know your existing
.tfstate— they can't reference real resource IDs - They hallucinate resource attribute names from outdated provider documentation
- They don't account for dependencies between resources
- They can't run
terraform planto validate output
State-Aware Agents
Firefly.ai is the leading state-aware Terraform agent:
- Reads your actual
.tfstatefiles to understand current infrastructure - Queries provider documentation in real-time (no hallucinated attributes)
- Generates Terraform modules that reference your real resource IDs
- Automatically runs
terraform planand surfaces the diff before applying - Detects drift between state and actual cloud infrastructure Rackspace's Auto-Remediation Agent monitors Terraform runs and:
- Detects failed
terraform applyoperations - Analyzes error messages and stack traces
- Generates targeted fixes (not full rewrites)
- Submits fix as a PR for human approval before retrying
Safe Terraform Workflow
# Step 1: Generate with AI (state-aware)
firefly generate "Add an RDS PostgreSQL 16 instance in the same VPC as ecs-cluster-prod"
# Step 2: ALWAYS review the plan before applying
terraform plan -out=tfplan
# Read every line. Understand every change.
# Step 3: Apply only after review
terraform apply tfplan
Rule: Never run
terraform applyon AI-generated code without reading the plan. Misconfigured infrastructure can be expensive, irreversible, or a security vulnerability.
CI/CD Pipeline Creation
Complete GitHub Actions Workflow from Natural Language
Generate a complete GitHub Actions workflow for a Node.js app that:
1. Triggers on push to main and pull_request events
2. Runs in parallel: ESLint + Prettier check, Jest unit tests with coverage
3. Only proceeds to build if both pass
4. Builds Docker image tagged with git SHA and 'latest'
5. Pushes to GitHub Container Registry
6. Deploys to AWS ECS Fargate on main branch only
7. Posts deployment status to Slack channel #deployments
8. Caches node_modules using actions/cache
Full Working Generated YAML:
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run format:check
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- run: npm test -- --coverage
build-and-push:
needs: [lint, test]
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
deploy:
needs: build-and-push
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster production \
--service api \
--force-new-deployment
- name: Notify Slack
uses: slackapi/slack-github-action@v1
with:
channel-id: 'deployments'
slack-message: "Deployed ${{ github.sha }} to production"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
Cloud Deployment Workflows
AWS
- ECS Fargate: Serverless containers — no EC2 management. AI generates task definitions, service configs, and ALB rules from app descriptions.
- Lambda: AI can generate function code, IAM roles, and event source mappings from natural language.
- AWS MCP Server: Provides sub-24-hour re-indexing of service documentation, so AI agents have current API knowledge.
Generate an AWS Lambda function + API Gateway setup for a webhook receiver that:
- Accepts POST /webhook
- Validates HMAC-SHA256 signature from header X-Signature
- Stores payload in DynamoDB with timestamp
- Sends SNS notification to operations topic
- Include IAM role with least-privilege permissions
- Include CloudFormation template for deployment
GCP
- Cloud Run: Zero-config container deployment. AI generates
cloudbuild.yamland Cloud Run service configs. - GKE Autopilot: AI generates Kubernetes manifests; GKE Autopilot handles node management.
- Google Cloud MCP Server: Available for Vertex AI, Cloud Storage, and BigQuery integration.
Vercel
Vercel offers zero-config deployment for frontend frameworks. AI agents generate vercel.json for custom routing, environment variable setup, and preview deployment configurations.
Security Scanning & Compliance
The Problem with Traditional SAST
Research shows that 98% of SAST findings are unexploitable at runtime — they're false positives or theoretical vulnerabilities that can't be triggered in practice. This creates massive alert fatigue.
Agentic SAST Tools
Terra Security — Uses AI to validate whether SAST findings are actually exploitable in your specific runtime context. Reduces noise by ~98%. Checkmarx AI Agent — Analyzes vulnerabilities in context and generates autonomous fix PRs with explanations.
Autonomous Fix PR Generation
Analyze this SAST finding and determine if it is exploitable in our application:
Finding: SQL Injection in UserController.getUserById()
Code: [PASTE CODE]
Runtime context: This endpoint requires admin authentication.
Admin users are internal employees only.
If exploitable: generate a fix PR
If not exploitable: explain why and mark as suppressed with justification
Prompt Library for DevOps
1. Dockerfile Generation
Generate a production-ready Dockerfile for [LANGUAGE/RUNTIME] [VERSION] running [APP TYPE].
Requirements: multi-stage build, non-root user, health check at [ENDPOINT],
final image under [SIZE]MB. Base on [BASE_IMAGE].
2. Kubernetes Manifest Generation
Generate K8s manifests for [APP_NAME]: Deployment ([N] replicas), Service ([TYPE]),
Ingress ([DOMAIN], TLS), HPA (scale [MIN]-[MAX] on [METRIC]% [RESOURCE]).
Resource limits: [MEMORY] memory, [CPU] CPU.
3. GitHub Actions Workflow
Generate a GitHub Actions workflow that: [lint/test/build/push/deploy steps].
Target: [CLOUD PROVIDER], service: [SERVICE NAME].
Trigger on: push to [BRANCH] and PRs.
Notify [SLACK/EMAIL] on failure.
4. Terraform Module Creation
Generate a reusable Terraform module for [RESOURCE TYPE] on [CLOUD PROVIDER].
Inputs: [LIST VARIABLES]. Outputs: [LIST OUTPUTS].
Follow [COMPANY] naming conventions. Include README.md.
Provider version: [VERSION].
5. Security Audit
Audit this [Dockerfile/K8s manifest/Terraform module] for security issues.
Check for: exposed secrets, root user, unnecessary permissions, open ports,
unpinned image versions, network policies, IAM least-privilege violations.
For each finding: severity (CRITICAL/HIGH/MEDIUM/LOW), explanation, fix.
6. Performance Optimization
Analyze this [GitHub Actions workflow/Dockerfile/K8s deployment] for performance.
Identify: slow steps that could be parallelized, missing caching opportunities,
oversized images, resource misconfigurations.
Provide optimized version with explanation of each change.
7. Incident Runbook Generation
Generate an incident runbook for [SERVICE NAME] covering:
- Common failure modes and symptoms
- Diagnostic commands to run first
- Escalation criteria
- Rollback procedure
- Post-incident checklist
Base it on this architecture: [DESCRIBE ARCHITECTURE]
8. Cost Optimization Audit
Review this Terraform/K8s configuration for cost optimization opportunities.
Identify: overprovisioned resources, unused resources, savings plan opportunities,
spot instance candidates, storage tier downgrades.
Estimate monthly savings for each recommendation.
Common Pitfalls
Blindly applying AI-generated infrastructure configs — Always review generated Terraform with
terraform plan. Always review generated K8s manifests beforekubectl apply. The blast radius of infrastructure mistakes is much larger than application bugs. Not understanding the generated Terraform before applying — If you can't explain what the Terraform does, don't apply it. Ask the AI to explain each resource before applying. Ignoring security best practices in Dockerfiles — AI may generate Dockerfiles that run as root, use:latesttags, or copy in.envfiles. Always audit generated Dockerfiles against security checklists. Not testing CI/CD changes in a staging environment — Test pipeline changes on a feature branch before merging to main. A broken main pipeline blocks the entire team.
Further Reading
Related tools
Related skills
MCP servers used
Related cookbooks
AI-Powered Code Review & Quality
Automate code review and enforce quality standards using AI-powered tools and agentic workflows.
Building AI-Powered Applications
Build applications powered by LLMs, RAG, and AI agents using Claude Code, Cursor, and modern AI frameworks.
Building APIs & Backends with AI Agents
Design and build robust APIs and backend services with AI coding agents, from REST to GraphQL.