Your state file is the single most dangerous file in your infrastructure. Here’s how to stop it from ruining your week.
One terraform destroy in the wrong terminal tab.
That’s all it takes to wipe out a production database, a Kubernetes cluster, or an entire network stack. Not because someone was careless — because the state management pattern made it easy to be in the wrong environment without realising it.
If you’ve ever had that sinking feeling of running terraform plan and seeing resources you didn’t expect, or watched a colleague accidentally target prod when they meant dev — this post is for you. We’re going to walk through state management patterns that make multi-environment Terraform projects safe, scalable, and painless to operate.
By the end, you’ll have a clear pattern you can implement this week, a naming convention that self-documents, and the confidence that adding a new environment won’t put existing ones at risk.
Why State Management Is the Hard Part
Terraform’s state file is a JSON mapping of your configuration to real-world infrastructure. It tracks resource IDs, metadata, dependencies, and sensitive values. Lose it, corrupt it, or point it at the wrong environment — and you’re in trouble.
The challenges multiply with environments:
- Isolation — Dev state must never interfere with prod state. A
terraform destroyin dev should be a non-event, not a career-defining moment. - Locking — Two engineers running
terraform applysimultaneously against the same state causes corruption. Remote backends with locking solve this, but only if configured correctly. - Visibility — When you open a terminal, you need to know which environment you’re targeting. Ambiguity is the enemy.
- Scalability — Adding a fifth, tenth, or twentieth environment shouldn’t require rethinking your entire approach.
Let’s look at the three common patterns and when each one makes sense.
Pattern 1: Terraform Workspaces
Workspaces are Terraform’s built-in mechanism for managing multiple environments from a single configuration directory.
# Create and switch between environments
terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform plan
Your state files are stored under the same backend but in different paths:
s3://my-terraform-state/
├── env:/
│ ├── dev/
│ │ └── terraform.tfstate
│ ├── staging/
│ │ └── terraform.tfstate
│ └── production/
│ └── terraform.tfstate
You reference the current workspace in your code:
locals {
environment = terraform.workspace
instance_type = {
dev = "Standard_B2s"
staging = "Standard_D2s_v3"
production = "Standard_D4s_v3"
}
}
resource "azurerm_kubernetes_cluster" "main" {
name = "aks-${local.environment}"
location = var.location
resource_group_name = azurerm_resource_group.main.name
default_node_pool {
name = "default"
vm_size = local.instance_type[local.environment]
node_count = local.environment == "production" ? 5 : 2
}
}
When Workspaces Work Well
- Small teams (1–3 people) where everyone knows the workflow
- Environments that are structurally identical (same resources, different sizes)
- Prototyping and experimentation
When Workspaces Break Down
- Silent targeting — There’s no visual indicator of which workspace is active. You run
terraform planand hope you remembered which workspace you selected 20 minutes ago. One wrongterraform workspace selectand you’re planning against production. - Shared backend — All environments live under the same backend configuration. There’s no way to give the production state stricter access controls than dev state without external tooling.
- Conditional sprawl — As environments diverge, your code fills up with
terraform.workspace == "production" ? ... : ...ternaries. By the time you have environment-specific resources, the single codebase advantage disappears. - Blast radius — A misconfigured provider or backend change affects all environments simultaneously.
The bottom line: Workspaces work for simple setups, but they optimise for convenience at the cost of safety. In a team environment where production reliability matters, the risks usually outweigh the benefits.
Pattern 2: Directory-per-Environment with Shared Modules
This is the pattern I recommend for most teams. Each environment gets its own directory with its own backend configuration, its own variable values, and its own terraform init. Shared logic lives in reusable modules.
The Directory Structure
terraform/
├── modules/
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── kubernetes/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── monitoring/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/
│ ├── dev/
│ │ ├── main.tf # Calls shared modules
│ │ ├── backend.tf # Dev-specific backend
│ │ ├── providers.tf # Dev-specific provider config
│ │ ├── terraform.tfvars # Dev-specific values
│ │ └── variables.tf # Variable declarations
│ ├── staging/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ ├── providers.tf
│ │ ├── terraform.tfvars
│ │ └── variables.tf
│ └── production/
│ ├── main.tf
│ ├── backend.tf
│ ├── providers.tf
│ ├── terraform.tfvars
│ └── variables.tf
└── scripts/
└── new-environment.sh # Scaffolding script
The Shared Module
# modules/kubernetes/variables.tf
variable "environment" {
description = "Environment name (dev, staging, production)"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "cluster_name" {
description = "Name of the Kubernetes cluster"
type = string
}
variable "node_count" {
description = "Number of nodes in the default pool"
type = number
default = 2
}
variable "node_vm_size" {
description = "VM size for cluster nodes"
type = string
default = "Standard_D2s_v3"
}
variable "kubernetes_version" {
description = "Kubernetes version"
type = string
}
variable "enable_autoscaling" {
description = "Enable cluster autoscaler"
type = bool
default = false
}
variable "min_node_count" {
description = "Minimum nodes when autoscaling is enabled"
type = number
default = 2
}
variable "max_node_count" {
description = "Maximum nodes when autoscaling is enabled"
type = number
default = 10
}
variable "tags" {
description = "Tags to apply to all resources"
type = map(string)
default = {}
}
# modules/kubernetes/main.tf
resource "azurerm_kubernetes_cluster" "main" {
name = var.cluster_name
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = var.cluster_name
kubernetes_version = var.kubernetes_version
default_node_pool {
name = "default"
vm_size = var.node_vm_size
node_count = var.enable_autoscaling ? null : var.node_count
enable_auto_scaling = var.enable_autoscaling
min_count = var.enable_autoscaling ? var.min_node_count : null
max_count = var.enable_autoscaling ? var.max_node_count : null
zones = var.environment == "production" ? ["1", "2", "3"] : ["1"]
}
tags = merge(var.tags, {
environment = var.environment
managed_by = "terraform"
})
}
The Environment Configuration
# environments/dev/backend.tf
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstate"
container_name = "tfstate"
key = "dev/infrastructure.tfstate"
}
}
# environments/dev/main.tf
module "networking" {
source = "../../modules/networking"
environment = var.environment
# ... networking variables
}
module "kubernetes" {
source = "../../modules/kubernetes"
environment = var.environment
cluster_name = "aks-${var.project}-${var.environment}"
node_count = var.node_count
node_vm_size = var.node_vm_size
kubernetes_version = var.kubernetes_version
enable_autoscaling = var.enable_autoscaling
resource_group_name = module.networking.resource_group_name
location = var.location
tags = {
project = var.project
environment = var.environment
cost_centre = var.cost_centre
}
}
# environments/dev/terraform.tfvars
environment = "dev"
project = "platform"
location = "australiaeast"
cost_centre = "engineering"
node_count = 2
node_vm_size = "Standard_B2s"
kubernetes_version = "1.30"
enable_autoscaling = false
# environments/production/terraform.tfvars
environment = "production"
project = "platform"
location = "australiaeast"
cost_centre = "engineering"
node_count = 5
node_vm_size = "Standard_D4s_v3"
kubernetes_version = "1.30"
enable_autoscaling = true
min_node_count = 3
max_node_count = 15
Why This Pattern Wins
| Benefit | How It Works |
|---|---|
| Total isolation | Each environment has its own state file, its own terraform init, and its own backend. You literally cannot accidentally target prod from the dev directory. |
| Independent access controls | Production backend can require different Azure RBAC roles, MFA, or approval workflows. Dev can be wide open for experimentation. |
| Environment-specific resources | Production needs a WAF? Add it to environments/production/main.tf. Dev doesn’t need it — no conditionals, no ternaries, no count hacks. |
| Safe CI/CD | Path-based pipeline triggers: changes in environments/dev/ trigger the dev pipeline only. |
| Easy to add environments | Copy a directory, change the tfvars and backend key. Done. |
The Trade-off: Some Duplication
Yes, each environment directory has similar main.tf, providers.tf, and variables.tf files. This is intentional. The small amount of duplication buys you structural safety. When you’re managing infrastructure that runs a business, explicit is always better than clever.
Pattern 3: Partial Backend Config with -backend-config
This is a hybrid approach that keeps a single set of Terraform files but uses CI/CD to inject the right backend configuration at init time.
# backend.tf — intentionally incomplete
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstate"
container_name = "tfstate"
# key is NOT specified here — injected at init time
}
}
# CI/CD injects the environment-specific state key
terraform init \
-backend-config="key=${ENVIRONMENT}/infrastructure.tfstate"
terraform plan \
-var-file="environments/${ENVIRONMENT}.tfvars"
terraform apply \
-var-file="environments/${ENVIRONMENT}.tfvars"
When This Works
- Teams comfortable with CI/CD-driven workflows where humans rarely run Terraform locally
- Environments that are structurally identical (same resources, just different sizing)
- You want DRY code and trust your pipeline to always pass the right parameters
The Risk
If someone runs terraform init locally without the -backend-config flag, they get an error or — worse — initialise against a default state path. The safety net is entirely in the CI/CD pipeline, not in the file structure. This is fine for disciplined teams but dangerous for teams where people routinely run Terraform locally.
The State Naming Convention
Regardless of which pattern you choose, your state file paths should be self-documenting:
{project}/{environment}/{component}.tfstate
Examples:
platform/dev/infrastructure.tfstate
platform/dev/kubernetes-addons.tfstate
platform/staging/infrastructure.tfstate
platform/production/infrastructure.tfstate
platform/production/kubernetes-addons.tfstate
If you’re multi-region:
platform/production/australiaeast/infrastructure.tfstate
platform/production/southeastasia/infrastructure.tfstate
This convention means anyone looking at the storage backend can immediately understand what each state file controls — and which one they should never touch manually.
Remote State Setup: The Foundation
Every pattern above requires a properly configured remote backend. Here’s a production-ready setup for Azure (the principles are identical for AWS S3 or GCP GCS):
Bootstrap the Backend (Run Once)
# bootstrap/main.tf — provisions the state storage itself
resource "azurerm_resource_group" "state" {
name = "rg-terraform-state"
location = "australiaeast"
tags = {
purpose = "terraform-state"
managed_by = "manual" # This one resource is bootstrapped manually
}
}
resource "azurerm_storage_account" "state" {
name = "stterraformstate${random_string.suffix.result}"
resource_group_name = azurerm_resource_group.state.name
location = azurerm_resource_group.state.location
account_tier = "Standard"
account_replication_type = "GRS" # Geo-redundant for state files
blob_properties {
versioning_enabled = true # Recover from accidental overwrites
delete_retention_policy {
days = 30 # Soft delete for state file recovery
}
}
tags = {
purpose = "terraform-state"
managed_by = "terraform-bootstrap"
}
}
resource "azurerm_storage_container" "state" {
name = "tfstate"
storage_account_name = azurerm_storage_account.state.name
container_access_type = "private"
}
resource "random_string" "suffix" {
length = 6
special = false
upper = false
}
output "storage_account_name" {
value = azurerm_storage_account.state.name
}
Key decisions in this bootstrap:
- GRS replication — Your state file is the most important file in your infrastructure. Geo-redundant storage means a regional outage doesn’t lose it.
- Versioning enabled — If state gets corrupted or accidentally overwritten, you can recover a previous version from blob versioning.
- Soft delete — 30-day safety net against accidental deletion.
- Private access — No public blob access. Authentication is via Azure AD or storage account keys, ideally managed through your CI/CD pipeline’s service principal.
Locking
Azure Storage Account provides native lease-based locking for Terraform state. When one terraform apply is running, the blob acquires a lease and any concurrent operation will wait or fail. This is handled automatically — no additional configuration needed.
For AWS users, locking depends on your Terraform version:
Terraform v1.10+ (recommended): S3 now supports native state locking without any additional infrastructure. HashiCorp introduced built-in S3 locking in v1.10, which means you no longer need to provision and maintain a separate DynamoDB table. The backend configuration is cleaner:
# AWS — Terraform v1.10+ (native S3 locking, no DynamoDB needed)
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "dev/infrastructure.tfstate"
region = "ap-southeast-2"
encrypt = true
use_lockfile = true # Enables native S3 state locking
}
}
This is a significant simplification — one less piece of infrastructure to bootstrap, manage, and pay for. If you’re starting a new project or upgrading an existing one, move to v1.10+ and drop the DynamoDB table.
Terraform < v1.10 (legacy): If you’re still on an older version, you need a DynamoDB table for locking:
# AWS — Terraform < v1.10 (DynamoDB required for locking)
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "dev/infrastructure.tfstate"
region = "ap-southeast-2"
dynamodb_table = "terraform-locks" # Only needed for Terraform < v1.10
encrypt = true
}
}
If you’re currently running this legacy setup, upgrading to v1.10+ and removing the DynamoDB dependency is a worthwhile migration — it’s one less resource to manage, one less IAM permission to configure, and one less thing that can go wrong during terraform init.
Cross-Environment State Reads
Sometimes one environment needs to reference another’s outputs — a shared networking layer, a central container registry, or a DNS zone. Use the terraform_remote_state data source sparingly:
# environments/staging/main.tf
data "terraform_remote_state" "shared_networking" {
backend = "azurerm"
config = {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstate"
container_name = "tfstate"
key = "shared/networking.tfstate"
}
}
# Use the output
module "kubernetes" {
source = "../../modules/kubernetes"
vnet_id = data.terraform_remote_state.shared_networking.outputs.vnet_id
subnet_id = data.terraform_remote_state.shared_networking.outputs.aks_subnet_id
# ...
}
When to Avoid terraform_remote_state
- When it creates a tight coupling between state files. If the shared networking state structure changes, every consumer breaks.
- When a simple data source lookup works instead:
# Prefer this when possible — no state coupling
data "azurerm_virtual_network" "shared" {
name = "vnet-shared-${var.environment}"
resource_group_name = "rg-networking-${var.environment}"
}
Data source lookups are more resilient because they query the real infrastructure, not another team’s state file format.
Disaster Recovery for State Files
State corruption happens. Someone runs terraform import incorrectly, a partial apply fails mid-way, or a state file gets manually edited. Here’s your recovery playbook:
1. Blob versioning recovery (Azure):
# List previous versions of the state file
az storage blob list \
--account-name stterraformstate \
--container-name tfstate \
--prefix "production/" \
--include v \
--output table
# Download a specific previous version
az storage blob download \
--account-name stterraformstate \
--container-name tfstate \
--name "production/infrastructure.tfstate" \
--version-id "2026-03-20T10:30:00.0000000Z" \
--file recovered-state.tfstate
2. S3 versioning recovery (AWS):
# List versions
aws s3api list-object-versions \
--bucket my-terraform-state \
--prefix "production/infrastructure.tfstate"
# Restore a previous version by copying it over the current
aws s3api copy-object \
--bucket my-terraform-state \
--copy-source "my-terraform-state/production/infrastructure.tfstate?versionId=abc123" \
--key "production/infrastructure.tfstate"
3. Force unlock (when a lock gets stuck):
# Only use this when you're certain no other operation is running
terraform force-unlock LOCK_ID
4. State surgery (last resort):
# Remove a problematic resource from state without destroying it
terraform state rm 'module.kubernetes.azurerm_kubernetes_cluster.main'
# Re-import it cleanly
terraform import \
'module.kubernetes.azurerm_kubernetes_cluster.main' \
/subscriptions/.../resourceGroups/.../managedClusters/aks-prod
Putting It All Together: The CI/CD Pipeline
Here’s how the directory-per-environment pattern maps to a CI/CD pipeline (using Azure DevOps YAML, but the concept applies to GitHub Actions, GitLab CI, etc.):
# azure-pipelines.yml
trigger:
branches:
include:
- main
paths:
include:
- terraform/environments/dev/**
pr:
paths:
include:
- terraform/environments/dev/**
pool:
vmImage: 'ubuntu-latest'
variables:
ENVIRONMENT: 'dev'
WORKING_DIR: 'terraform/environments/dev'
stages:
- stage: Plan
jobs:
- job: TerraformPlan
steps:
- task: TerraformInstaller@0
inputs:
terraformVersion: '1.9.x'
- script: terraform init
workingDirectory: $(WORKING_DIR)
- script: terraform validate
workingDirectory: $(WORKING_DIR)
- script: terraform plan -out=tfplan -detailed-exitcode
workingDirectory: $(WORKING_DIR)
- publish: $(WORKING_DIR)/tfplan
artifact: terraform-plan
- stage: Apply
dependsOn: Plan
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: TerraformApply
environment: 'dev' # For prod, this triggers approval gates
strategy:
runOnce:
deploy:
steps:
- download: current
artifact: terraform-plan
- script: |
terraform init
terraform apply -auto-approve $(Pipeline.Workspace)/terraform-plan/tfplan
workingDirectory: $(WORKING_DIR)
The key detail: path-based triggers. Changes in terraform/environments/dev/ only trigger the dev pipeline. You’d duplicate this pipeline for staging and production, with production requiring manual approval gates via the environment resource in Azure DevOps.
Quick Reference: Which Pattern Should You Choose?
| Factor | Workspaces | Directory-per-Env | Partial Backend |
|---|---|---|---|
| Isolation | Weak (shared backend) | Strong (separate everything) | Medium (CI-dependent) |
| Safety | Low (silent workspace switch) | High (physical separation) | Medium (pipeline enforced) |
| Code duplication | None | Some (intentional) | None |
| Ease of adding environments | workspace new |
Copy directory + edit tfvars | Add a tfvars file |
| Best for | Solo / small team | Teams, enterprise, production | CI/CD-native teams |
| Biggest risk | Wrong workspace selected | Drift between env directories | Local init without flags |
The One Thing to Do Today
If you take nothing else from this post: enable versioning on your state backend. Whether it’s Azure Blob versioning, S3 versioning, or GCS object versioning — turn it on now, before you need it. It costs almost nothing and it’s the difference between a 5-minute recovery and a multi-day incident.
Then, if you’re still using workspaces for production environments, carve out an hour to move to the directory-per-environment pattern. Copy your existing main.tf into environments/production/, add a dedicated backend.tf, migrate the state with terraform init -migrate-state, and sleep better tonight.
Your future self — the one who almost ran terraform destroy in the wrong terminal — will thank you.