Best Practice: Implementing Infrastructure as Code Consistently
Context
Infrastructure as Code is more than "writing Terraform". It is a discipline paradigm: infrastructure only exists when it is in version control, has been reviewed, and was deployed through CI/CD. Without this discipline, configuration drift accumulates and the team loses control over its environment.
Target State
Full IaC implementation means:
-
All production resources are defined in Terraform (or Pulumi/CDK)
-
Remote state with locking prevents concurrent state changes
-
Module library enables code reuse without copy-paste
-
Drift detection runs daily and alerts on deviations
-
Disaster recovery can be reproduced from IaC in < 2 hours
Technical Implementation
Step 1: Repository Structure
infrastructure/
├── modules/ # Reusable modules
│ ├── networking/ # VPC, Subnets, Security Groups
│ ├── compute/ # ECS Cluster, Auto Scaling, EC2
│ ├── database/ # RDS, ElastiCache
│ ├── observability/ # CloudWatch, X-Ray, Dashboards
│ └── mandatory-tags/# Mandatory tags (shared across teams)
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── backend.tf
│ ├── staging/
│ └── production/
└── live/ # Atlantis/Terragrunt live configs (optional)
Step 2: Configure Remote State
# infrastructure/environments/production/backend.tf
terraform {
backend "s3" {
bucket = "myorg-terraform-state-prod"
key = "payment-service/production/terraform.tfstate"
region = "eu-central-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
kms_key_id = "arn:aws:kms:eu-central-1:123456789:key/abc123"
}
}
# S3 bucket and DynamoDB lock table (one-time bootstrap)
resource "aws_s3_bucket" "terraform_state" {
bucket = "myorg-terraform-state-prod"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Step 3: Automate Drift Detection
# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
schedule:
- cron: '0 6 * * *' # Daily at 06:00 UTC
workflow_dispatch:
jobs:
drift-check:
name: Drift Check – Production
runs-on: ubuntu-latest
environment: production-readonly
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "~1.6"
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_READONLY_ROLE_ARN }}
aws-region: eu-central-1
- name: Terraform Init
working-directory: infrastructure/environments/production
run: terraform init
- name: Terraform Plan (Drift Detection)
id: plan
working-directory: infrastructure/environments/production
run: |
terraform plan -detailed-exitcode -out=plan.out 2>&1 | tee plan.txt
EXIT_CODE=${PIPESTATUS[0]}
echo "exit-code=$EXIT_CODE" >> $GITHUB_OUTPUT
continue-on-error: true
- name: Alert on Drift Detected
if: steps.plan.outputs.exit-code == '2'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "⚠️ DRIFT DETECTED in production infrastructure!\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_OPS_WEBHOOK }}
Step 4: Brownfield Migration (Importing Existing Resources)
# Step 1: Write the existing resource as Terraform code
resource "aws_security_group" "app" {
name = "payment-service-sg"
description = "Security group for payment service"
vpc_id = aws_vpc.main.id
# ... rules matching the current state
}
# Step 2: Import block (Terraform 1.5+)
import {
to = aws_security_group.app
id = "sg-0123456789abcdef0" # Actual AWS resource ID
}
# Step 3: Run the import
# terraform plan -generate-config-out=generated.tf # Generate code
# terraform import aws_security_group.app sg-0123456789abcdef0 # Classic import
# terraform apply # Only verify the plan, change nothing if correct
Common Anti-Patterns
| Anti-Pattern | Problem |
|---|---|
Local Terraform state |
No sharing, no locking, lost if laptop breaks |
Running |
No review, no audit trail, conflicts with CI/CD |
Modules without version pinning |
Uncontrolled upstream changes break deployments |
Console changes as a "quick fix" |
Starts drift accumulation; never gets moved to IaC |
All resources in a single |
Hard to review, slow state operations, no modularity |
|
Unplanned changes are applied blindly |
Metrics
-
IaC coverage: % of production resources under IaC management (target: 100%)
-
Drift rate: % of resources with active drift (target: < 2%)
-
Time-to-detect drift: Time between drift occurrence and detection (target: < 24h)
-
Time-to-remediate drift: Time between detection and resolution (target: < SLA per severity)
Maturity Levels
| Level | Characteristics |
|---|---|
Level 1 |
No IaC; all infrastructure created manually in the console. |
Level 2 |
Parts of the infrastructure as IaC; manual resources coexist; no remote state. |
Level 3 |
100% production infrastructure as IaC; remote state; drift detection daily; manual changes restricted. |
Level 4 |
GitOps workflow; drift alerts; SLA-based remediation; module library in use. |
Level 5 |
Automatic drift remediation for safe patterns; full GitOps pipeline; 0 unresolved drift > 48h. |