Stop designing modules around cloud APIs. Start designing them around what your team actually needs to say.
Here’s a question: when you add a new environment to your Terraform project, how much code do you touch?
If the answer involves editing module internals, adding if statements, or copying resource blocks — your module boundaries are in the wrong place. A well-designed Terraform module should work like a contract. The consumer says what they want. The module figures out how to deliver it. Adding a new environment should mean adding a new tfvars file and nothing else.
This post walks through a design approach for Terraform modules that makes multi-environment (and eventually multi-cloud) projects straightforward to scale. No complex abstractions. No over-engineering. Just practical interface design that pays off every time you type terraform apply.
The Problem: Modules That Leak Implementation Details
Most Terraform modules start life as a convenience wrapper. Someone gets tired of writing the same 30-line azurerm_kubernetes_cluster resource in every environment, so they extract it into a module. The module’s variables end up looking like a mirror of the cloud provider’s API:
# What most modules look like — a thin wrapper over the provider API
variable "vm_size" {
type = string
default = "Standard_D2s_v3"
}
variable "os_disk_size_gb" {
type = number
default = 128
}
variable "enable_auto_scaling" {
type = bool
default = false
}
variable "min_count" {
type = number
default = 1
}
variable "max_count" {
type = number
default = 10
}
variable "availability_zones" {
type = list(string)
default = ["1"]
}
variable "max_pods" {
type = number
default = 110
}
variable "network_plugin" {
type = string
default = "azure"
}
This module works. But the person calling it needs to know Azure SKU names, which availability zones exist in their region, sensible max_pods values, and how min_count and max_count interact with enable_auto_scaling. They’re not configuring a module — they’re configuring Azure through an unnecessary middleman.
The environment-specific tfvars files become a maze of cloud-specific values:
# environments/dev/terraform.tfvars
vm_size = "Standard_B2s"
os_disk_size_gb = 64
enable_auto_scaling = false
min_count = 1
max_count = 3
availability_zones = ["1"]
max_pods = 50
network_plugin = "azure"
Now imagine you need to support a second cloud provider, or a colleague who isn’t a cloud infrastructure specialist needs to configure a new environment. This doesn’t scale.
The Solution: Intent-Based Module Interfaces
Design your module variables around business intent, not cloud provider parameters. The consumer describes what kind of environment they want. The module translates that into the right infrastructure.
The Contract
# modules/kubernetes/variables.tf
variable "environment" {
description = "Environment name"
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,20}$", var.environment))
error_message = "Environment name must be lowercase alphanumeric with hyphens, 2-21 chars."
}
}
variable "project" {
description = "Project or product name"
type = string
}
variable "region" {
description = "Deployment region"
type = string
}
variable "cluster_tier" {
description = "Cluster sizing tier: dev, standard, or production"
type = string
default = "standard"
validation {
condition = contains(["dev", "standard", "production"], var.cluster_tier)
error_message = "cluster_tier must be dev, standard, or production."
}
}
variable "kubernetes_version" {
description = "Kubernetes version to deploy"
type = string
}
variable "high_availability" {
description = "Deploy across multiple availability zones with autoscaling"
type = bool
default = false
}
variable "extra_node_pools" {
description = "Additional node pools beyond the default"
type = map(object({
tier = string
min_nodes = optional(number, 1)
max_nodes = optional(number, 5)
labels = optional(map(string), {})
taints = optional(list(string), [])
}))
default = {}
}
variable "tags" {
description = "Tags applied to all resources"
type = map(string)
default = {}
}
Notice what’s not here: no VM SKU names, no disk sizes, no max_pods, no network_plugin strings. The consumer doesn’t need to know any of that.
The Translation Layer
Inside the module, a locals block maps intent to implementation:
# modules/kubernetes/locals.tf
locals {
# ── Tier-based sizing profiles ──────────────────────────────────
tier_profiles = {
dev = {
vm_size = "Standard_B2s"
os_disk_size_gb = 64
node_count = 1
max_pods = 50
}
standard = {
vm_size = "Standard_D2s_v3"
os_disk_size_gb = 128
node_count = 2
max_pods = 110
}
production = {
vm_size = "Standard_D4s_v3"
os_disk_size_gb = 256
node_count = 3
max_pods = 110
}
}
profile = local.tier_profiles[var.cluster_tier]
# ── High availability settings ──────────────────────────────────
zones = var.high_availability ? ["1", "2", "3"] : ["1"]
autoscale = var.high_availability
min_nodes = var.high_availability ? local.profile.node_count : null
max_nodes = var.high_availability ? local.profile.node_count * 3 : null
node_count = var.high_availability ? null : local.profile.node_count
# ── Naming ──────────────────────────────────────────────────────
cluster_name = "aks-${var.project}-${var.environment}"
dns_prefix = "${var.project}-${var.environment}"
# ── Standard tags ───────────────────────────────────────────────
default_tags = {
project = var.project
environment = var.environment
managed_by = "terraform"
}
all_tags = merge(local.default_tags, var.tags)
}
The Resource (Clean and Readable)
# modules/kubernetes/main.tf
resource "azurerm_kubernetes_cluster" "main" {
name = local.cluster_name
location = var.region
resource_group_name = var.resource_group_name
dns_prefix = local.dns_prefix
kubernetes_version = var.kubernetes_version
default_node_pool {
name = "default"
vm_size = local.profile.vm_size
os_disk_size_gb = local.profile.os_disk_size_gb
max_pods = local.profile.max_pods
zones = local.zones
node_count = local.node_count
enable_auto_scaling = local.autoscale
min_count = local.min_nodes
max_count = local.max_nodes
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
}
identity {
type = "SystemAssigned"
}
tags = local.all_tags
}
The Consumer Experience
Now look at how clean the environment configuration becomes:
# environments/dev/terraform.tfvars
environment = "dev"
project = "platform"
region = "australiaeast"
cluster_tier = "dev"
kubernetes_version = "1.30"
high_availability = false
# environments/staging/terraform.tfvars
environment = "staging"
project = "platform"
region = "australiaeast"
cluster_tier = "standard"
kubernetes_version = "1.30"
high_availability = false
# environments/production/terraform.tfvars
environment = "production"
project = "platform"
region = "australiaeast"
cluster_tier = "production"
kubernetes_version = "1.30"
high_availability = true
Anyone on the team can read these files and understand what each environment looks like — without knowing a single Azure SKU name. Adding a new environment? Copy a tfvars file, change the values, done. No module code touched.
Adding a New Environment: The 5-Minute Workflow
Let’s say the QA team asks for a dedicated qa environment. Here’s everything that changes:
Step 1: Create the Environment Directory
# Assuming directory-per-environment pattern
cp -r environments/staging environments/qa
Step 2: Update the Backend Key
# environments/qa/backend.tf
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstate"
container_name = "tfstate"
key = "qa/infrastructure.tfstate" # Changed
}
}
Step 3: Update the Variables
# environments/qa/terraform.tfvars
environment = "qa" # Changed
project = "platform"
region = "australiaeast"
cluster_tier = "standard" # Same as staging
kubernetes_version = "1.30"
high_availability = false
Step 4: Init and Apply
cd environments/qa
terraform init
terraform plan
terraform apply
That’s it. Zero changes to any module. Zero changes to any other environment. The new environment is completely isolated with its own state file.
Automate It
For teams that do this regularly, a scaffolding script removes even the manual steps:
#!/bin/bash
# scripts/new-environment.sh
set -euo pipefail
ENV_NAME="${1:?Usage: $0 <environment-name>}"
SOURCE_ENV="${2:-staging}"
BASE_DIR="terraform/environments"
if [[ -d "${BASE_DIR}/${ENV_NAME}" ]]; then
echo "Error: Environment '${ENV_NAME}' already exists."
exit 1
fi
echo "Creating environment '${ENV_NAME}' from '${SOURCE_ENV}'..."
cp -r "${BASE_DIR}/${SOURCE_ENV}" "${BASE_DIR}/${ENV_NAME}"
# Update backend key
sed -i "s|key.*=.*\".*\"|key = \"${ENV_NAME}/infrastructure.tfstate\"|" \
"${BASE_DIR}/${ENV_NAME}/backend.tf"
# Update environment variable in tfvars
sed -i "s|environment.*=.*\".*\"|environment = \"${ENV_NAME}\"|" \
"${BASE_DIR}/${ENV_NAME}/terraform.tfvars"
echo ""
echo "Environment '${ENV_NAME}' created at ${BASE_DIR}/${ENV_NAME}"
echo ""
echo "Next steps:"
echo " 1. Review and edit ${BASE_DIR}/${ENV_NAME}/terraform.tfvars"
echo " 2. cd ${BASE_DIR}/${ENV_NAME}"
echo " 3. terraform init"
echo " 4. terraform plan"
echo " 5. terraform apply"
# Usage
./scripts/new-environment.sh qa
./scripts/new-environment.sh load-test staging
./scripts/new-environment.sh dr-recovery production
Variable Validation: The Safety Net
Terraform’s validation blocks are underused. They turn runtime errors into clear, immediate feedback:
variable "cluster_tier" {
type = string
validation {
condition = contains(["dev", "standard", "production"], var.cluster_tier)
error_message = "cluster_tier must be 'dev', 'standard', or 'production'. Got: ${var.cluster_tier}"
}
}
variable "environment" {
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,20}$", var.environment))
error_message = "Environment must be lowercase, start with a letter, and contain only letters, numbers, and hyphens (2-21 chars)."
}
}
variable "kubernetes_version" {
type = string
validation {
condition = can(regex("^[0-9]+\\.[0-9]+$", var.kubernetes_version))
error_message = "kubernetes_version must be in format 'major.minor' (e.g., '1.30')."
}
}
When someone creates a new environment with an invalid tier:
Error: Invalid value for variable
on variables.tf line 15:
15: variable "cluster_tier" {
cluster_tier must be 'dev', 'standard', or 'production'. Got: 'large'
No guessing, no cryptic cloud provider error 10 minutes into a plan. The feedback is instant and helpful.
The lookup Pattern: One Codebase, Zero Conditionals
For modules that need to vary behaviour by environment without sprinkling count and ternaries everywhere, use a configuration map:
variable "environment_configs" {
description = "Per-environment configuration profiles"
type = map(object({
cluster_tier = string
high_availability = bool
backup_enabled = bool
backup_retention = number
log_level = string
alert_channels = list(string)
}))
default = {
dev = {
cluster_tier = "dev"
high_availability = false
backup_enabled = false
backup_retention = 0
log_level = "debug"
alert_channels = ["slack-dev"]
}
staging = {
cluster_tier = "standard"
high_availability = false
backup_enabled = true
backup_retention = 7
log_level = "info"
alert_channels = ["slack-staging"]
}
production = {
cluster_tier = "production"
high_availability = true
backup_enabled = true
backup_retention = 30
log_level = "warn"
alert_channels = ["slack-prod", "pagerduty"]
}
}
}
locals {
config = var.environment_configs[var.environment]
}
Now every resource just references local.config:
module "kubernetes" {
source = "../../modules/kubernetes"
environment = var.environment
cluster_tier = local.config.cluster_tier
high_availability = local.config.high_availability
# ...
}
module "monitoring" {
source = "../../modules/monitoring"
environment = var.environment
log_level = local.config.log_level
alert_channels = local.config.alert_channels
}
module "backup" {
count = local.config.backup_enabled ? 1 : 0
source = "../../modules/backup"
retention = local.config.backup_retention
}
Adding a new environment means adding one entry to the map. The single count on the backup module is the only conditional in the entire configuration.
Module Outputs as Contracts
Outputs are part of the contract too. Design them so downstream consumers get exactly what they need:
# modules/kubernetes/outputs.tf
output "cluster_id" {
description = "The resource ID of the Kubernetes cluster"
value = azurerm_kubernetes_cluster.main.id
}
output "cluster_name" {
description = "The name of the Kubernetes cluster"
value = azurerm_kubernetes_cluster.main.name
}
output "cluster_fqdn" {
description = "The FQDN of the Kubernetes cluster API server"
value = azurerm_kubernetes_cluster.main.fqdn
}
output "kube_config" {
description = "Kubeconfig for connecting to the cluster"
value = azurerm_kubernetes_cluster.main.kube_config_raw
sensitive = true
}
output "node_resource_group" {
description = "The auto-generated resource group for cluster nodes"
value = azurerm_kubernetes_cluster.main.node_resource_group
}
output "kubelet_identity_object_id" {
description = "Object ID of the kubelet managed identity (for role assignments)"
value = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}
These outputs become the interface that other modules consume. The monitoring module needs cluster_name to set up dashboards. The networking module needs kubelet_identity_object_id to grant ACR pull access. As long as these outputs exist, the internal implementation can change freely.
Extending to Multi-Cloud
The contract pattern makes multi-cloud a realistic option rather than a rewrite. The key insight: your consumer code (the environment main.tf) doesn’t change. Only the module implementation does.
Option A: Provider-Specific Module Implementations
modules/
├── kubernetes/
│ ├── azure/
│ │ ├── main.tf # azurerm_kubernetes_cluster
│ │ ├── variables.tf # Same interface as the contract above
│ │ └── outputs.tf # Same output contract
│ ├── aws/
│ │ ├── main.tf # aws_eks_cluster + aws_eks_node_group
│ │ ├── variables.tf # Same interface
│ │ └── outputs.tf # Same output contract
│ └── gcp/
│ ├── main.tf # google_container_cluster
│ ├── variables.tf # Same interface
│ └── outputs.tf # Same output contract
The consumer selects which implementation to use:
# environments/prod-azure/main.tf
module "kubernetes" {
source = "../../modules/kubernetes/azure"
environment = var.environment
cluster_tier = "production"
high_availability = true
# ...
}
# environments/prod-aws/main.tf
module "kubernetes" {
source = "../../modules/kubernetes/aws"
environment = var.environment
cluster_tier = "production"
high_availability = true
# Exact same variables — the interface is the contract
}
Option B: Single Module with Provider Abstraction
For simpler cases, one module that switches internally:
variable "cloud_provider" {
type = string
validation {
condition = contains(["azure", "aws", "gcp"], var.cloud_provider)
error_message = "cloud_provider must be azure, aws, or gcp."
}
}
locals {
vm_size_map = {
azure = {
dev = "Standard_B2s"
standard = "Standard_D2s_v3"
production = "Standard_D4s_v3"
}
aws = {
dev = "t3.small"
standard = "m5.large"
production = "m5.xlarge"
}
gcp = {
dev = "e2-small"
standard = "e2-standard-2"
production = "e2-standard-4"
}
}
vm_size = local.vm_size_map[var.cloud_provider][var.cluster_tier]
}
This approach works for modules with a small surface area but gets unwieldy for complex resources where the provider APIs diverge significantly. Use Option A for anything beyond basic compute.
Anti-Patterns to Avoid
1. The “God Module”
A single module that provisions networking, compute, databases, monitoring, and DNS. When anything changes, the blast radius is everything. Break modules along resource lifecycle boundaries — things that change together belong together.
2. Over-Templating
Not every attribute needs to be a variable. If the network_plugin is always "azure" and will never change, hardcode it inside the module. Only expose what genuinely varies between environments. Every unnecessary variable is a decision someone has to make when they don’t need to.
3. Circular State Dependencies
Module A reads Module B’s state. Module B reads Module A’s state. Now neither can be applied first. Design your module graph as a directed acyclic graph (DAG): shared infrastructure → cluster → add-ons → applications. Dependencies flow in one direction.
4. Skipping terraform plan Review
The module contract gives you clean plans that are easy to review. Use that advantage. Every plan output should be reviewed before apply, especially in production. In CI/CD, post the plan as a PR comment so reviewers can see exactly what will change.
The Checklist for a Well-Designed Module
Before publishing a module (even internally), check:
- Variables describe intent, not implementation (
cluster_tier, notvm_size) - Validation blocks on every variable that has constraints
- Sensible defaults that are safe for the most common case
- Outputs are documented and form a stable contract
- No hardcoded environment names inside the module
- Tags/labels are standardised via
locals, not left to the consumer - Adding an environment requires zero module changes
-
terraform planoutput is readable by someone who didn’t write the module
If all eight boxes are ticked, you have a module that will serve your team well as you scale from 2 environments to 20.
What You Can Do Today
-
Pick your most-used module. Look at its variables. How many of them are cloud-specific implementation details that could be replaced with an intent-based variable like
cluster_tier? -
Add a
localstranslation layer. Map 2-3 tier names to the provider-specific values. This single change makes the module dramatically easier to consume. -
Add validation blocks. Even just one on the
environmentvariable. The first time it catches a typo before a 15-minute plan, it’ll pay for itself. -
Try the scaffolding script. Can you create a new environment in under 5 minutes without editing any module code? If not, your module interface has room to improve.
The goal isn’t architectural perfection. It’s making the common operations — adding an environment, changing a tier, onboarding a teammate — so straightforward that they don’t require tribal knowledge or a Terraform deep dive. That’s what a good contract does.