WAF++ WAF++
Back to WAF++ Homepage

Best Practice: Implementing Infrastructure as Code Consistently

Context

Infrastructure as Code is more than "writing Terraform". It is a discipline paradigm: infrastructure only exists when it is in version control, has been reviewed, and was deployed through CI/CD. Without this discipline, configuration drift accumulates and the team loses control over its environment.

Target State

Full IaC implementation means:

  • All production resources are defined in Terraform (or Pulumi/CDK)

  • Remote state with locking prevents concurrent state changes

  • Module library enables code reuse without copy-paste

  • Drift detection runs daily and alerts on deviations

  • Disaster recovery can be reproduced from IaC in < 2 hours

Technical Implementation

Step 1: Repository Structure

infrastructure/
├── modules/           # Reusable modules
│   ├── networking/    # VPC, Subnets, Security Groups
│   ├── compute/       # ECS Cluster, Auto Scaling, EC2
│   ├── database/      # RDS, ElastiCache
│   ├── observability/ # CloudWatch, X-Ray, Dashboards
│   └── mandatory-tags/# Mandatory tags (shared across teams)
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── backend.tf
│   ├── staging/
│   └── production/
└── live/              # Atlantis/Terragrunt live configs (optional)

Step 2: Configure Remote State

# infrastructure/environments/production/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state-prod"
    key            = "payment-service/production/terraform.tfstate"
    region         = "eu-central-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:eu-central-1:123456789:key/abc123"
  }
}

# S3 bucket and DynamoDB lock table (one-time bootstrap)
resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state-prod"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Step 3: Automate Drift Detection

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 06:00 UTC
  workflow_dispatch:

jobs:
  drift-check:
    name: Drift Check – Production
    runs-on: ubuntu-latest
    environment: production-readonly

    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~1.6"

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_READONLY_ROLE_ARN }}
          aws-region: eu-central-1

      - name: Terraform Init
        working-directory: infrastructure/environments/production
        run: terraform init

      - name: Terraform Plan (Drift Detection)
        id: plan
        working-directory: infrastructure/environments/production
        run: |
          terraform plan -detailed-exitcode -out=plan.out 2>&1 | tee plan.txt
          EXIT_CODE=${PIPESTATUS[0]}
          echo "exit-code=$EXIT_CODE" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Alert on Drift Detected
        if: steps.plan.outputs.exit-code == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "⚠️ DRIFT DETECTED in production infrastructure!\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}

Step 4: Brownfield Migration (Importing Existing Resources)

# Step 1: Write the existing resource as Terraform code
resource "aws_security_group" "app" {
  name        = "payment-service-sg"
  description = "Security group for payment service"
  vpc_id      = aws_vpc.main.id
  # ... rules matching the current state
}

# Step 2: Import block (Terraform 1.5+)
import {
  to = aws_security_group.app
  id = "sg-0123456789abcdef0"  # Actual AWS resource ID
}

# Step 3: Run the import
# terraform plan -generate-config-out=generated.tf  # Generate code
# terraform import aws_security_group.app sg-0123456789abcdef0  # Classic import
# terraform apply  # Only verify the plan, change nothing if correct

Common Anti-Patterns

Anti-Pattern Problem

Local Terraform state

No sharing, no locking, lost if laptop breaks

Running terraform apply locally

No review, no audit trail, conflicts with CI/CD

Modules without version pinning

Uncontrolled upstream changes break deployments

Console changes as a "quick fix"

Starts drift accumulation; never gets moved to IaC

All resources in a single main.tf

Hard to review, slow state operations, no modularity

-auto-approve in CI without plan review

Unplanned changes are applied blindly

Metrics

  • IaC coverage: % of production resources under IaC management (target: 100%)

  • Drift rate: % of resources with active drift (target: < 2%)

  • Time-to-detect drift: Time between drift occurrence and detection (target: < 24h)

  • Time-to-remediate drift: Time between detection and resolution (target: < SLA per severity)

Maturity Levels

Level Characteristics

Level 1

No IaC; all infrastructure created manually in the console.

Level 2

Parts of the infrastructure as IaC; manual resources coexist; no remote state.

Level 3

100% production infrastructure as IaC; remote state; drift detection daily; manual changes restricted.

Level 4

GitOps workflow; drift alerts; SLA-based remediation; module library in use.

Level 5

Automatic drift remediation for safe patterns; full GitOps pipeline; 0 unresolved drift > 48h.