Operations

Auto review

Deployment, monitoring, and operational readiness

Hats

Review

Auto

Unit Types

Ops, Backend

Inputs

Inception, Product, Development, Development

Dependencies

Inceptiondiscovery

Productbehavioral-spec

Developmentcode

Developmentarchitecture

Hat Sequence

Ops Engineer

Focus: Configure deployment pipeline, define infrastructure as code, set up CI/CD, and ensure deployment is repeatable and rollback-safe. Every deployment should be automated, auditable, and reversible.

Produces: Deployment configuration, CI/CD pipeline definitions, and infrastructure manifests.

Reads: code and architecture via the unit's ## References section.

Anti-patterns:

Manual deployment steps that require human intervention
Hardcoded secrets or environment-specific values in code
No rollback strategy — every deployment must be reversible
Skipping health checks — the system must verify its own readiness
Creating deployment config without testing it (terraform plan, docker build, etc.)
Mixing infrastructure concerns with application code

Sre

Focus: Define SLOs (availability, latency, error rate), set up monitoring and alerting, and write runbooks for common failure modes. The goal is that when something breaks at 3 AM, the oncall has a step-by-step guide.

Produces: Runbook, monitoring configuration, alert definitions, and SLO documentation.

Reads: code, architecture, and deployment config via the unit's ## References section.

Anti-patterns:

Alerting on symptoms instead of causes (alert on error rate, not individual errors)
SLOs without error budgets — an SLO without a budget is just a wish
Runbooks that say "page the oncall" without diagnostic steps
Monitoring that generates noise (alert fatigue makes real alerts invisible)
Not defining what "healthy" looks like before defining what "unhealthy" looks like

Operations

Criteria Guidance

Good criteria examples:

"Deployment pipeline runs terraform plan in CI and requires approval before apply"
"Runbook covers: service restart, database failover, cache flush, and certificate rotation with step-by-step commands"
"Alerts fire when error rate exceeds 1% over 5 minutes, with PagerDuty routing"
"Health check endpoint responds within 5 seconds and verifies database connectivity"

Bad criteria examples:

"Deployment is automated"
"Runbook exists"
"Monitoring is set up"

Completion Signal

Deployment pipeline defined and validated (builds, plans, and applies successfully). Monitoring covers key metrics (latency, error rate, throughput). Runbook exists for common failure modes with step-by-step remediation commands. SLOs defined with alert thresholds and error budgets.