KamranOnline - Technical Blog

Terraform-in-Lambda: Building a Serverless Infrastructure Runner on AWS

Kamran BiglariFebruary 15, 202636 min read
Terraform-in-Lambda: Building a Serverless Infrastructure Runner on AWS

What if you could execute Terraform operations—plan, apply, destroy—without maintaining any infrastructure? No EC2 instances, no self-hosted runners, no dedicated CI/CD servers. Just pure, serverless, on-demand infrastructure automation.

I built terraform-in-lambda to do exactly that: a production-ready Terraform module that packages Terraform into an AWS Lambda function, turning it into a serverless infrastructure runner.

Terraform-in-Lambda Architecture

The Problem: Infrastructure Automation Overhead

Let’s talk about the typical Terraform execution patterns and their challenges:

Pattern 1: Developer Laptop

# Manual execution
cd infrastructure/
terraform plan
terraform apply

Problems:

  • ❌ No automation—requires manual intervention
  • ❌ Inconsistent execution environments
  • ❌ No audit trail or centralized logging
  • ❌ Credential management scattered across developer machines
  • ❌ Can’t respond to events (alarms, schedule, API calls)

Pattern 2: CI/CD Pipeline (GitHub Actions, GitLab CI, Jenkins)

# GitHub Actions workflow
jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Apply
        run: terraform apply -auto-approve

Problems:

  • ❌ CI/CD runners require persistent infrastructure (hosted or self-hosted)
  • ❌ Self-hosted runners need maintenance, updates, and monitoring
  • ❌ GitHub Actions minutes cost money at scale
  • ❌ Difficult to trigger from AWS events (CloudWatch alarms, EventBridge)
  • ❌ GitOps-only (can’t easily trigger programmatically)

Pattern 3: Dedicated EC2 Instance / Bastion

# SSH into bastion
ssh bastion.company.com
cd /infrastructure
terraform apply

Problems:

  • ❌ EC2 instance running 24/7 for occasional Terraform runs
  • ❌ Requires patching, hardening, and monitoring
  • ❌ Single point of failure
  • ❌ Cost inefficient—paying for idle time

The Solution: Terraform as a Lambda Function

What if Terraform execution could be:

  • Serverless — No infrastructure to maintain
  • Event-driven — Triggered by EventBridge, API Gateway, SNS, or manual invocations
  • Cost-effective — Pay only for execution time (millisecond billing)
  • Scalable — Lambda auto-scales across concurrent executions
  • Secure — IAM-based permissions with temporary credentials
  • Auditable — Full CloudWatch logging and CloudTrail events

That’s exactly what this module provides.

How It Works: Architecture Deep-Dive

The module implements a Docker-based Lambda function with a custom shell runtime that executes Terraform commands on demand.

Build Phase: Creating the Lambda Container

Step 1: Docker Image Construction

The module builds a custom Docker image based on hashicorp/terraform:

FROM hashicorp/terraform:1.11-alpine

# Install dependencies
RUN apk add --no-cache \
    aws-cli \
    jq \
    zip \
    unzip \
    curl \
    bash \
    dos2unix

# Copy custom runtime
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]

Why these dependencies?

  • aws-cli — Credential configuration and AWS operations
  • jq — JSON parsing for Lambda Runtime API
  • zip/unzip — Extract dynamically sent Terraform code
  • curl — Communicate with Lambda Runtime API
  • bash — Execute Terraform commands
  • dos2unix — Handle cross-platform line endings

Step 2: ECR Image Storage

The kreuzwerker/docker Terraform provider automatically:

  1. Builds the Docker image locally
  2. Authenticates to your ECR registry
  3. Pushes the image to ECR
  4. Tags with proper versioning
resource "docker_image" "terraform_lambda" {
  name = "${aws_ecr_repository.this.repository_url}:${var.image_tag}"

  build {
    context    = "${path.module}/docker"
    dockerfile = "Dockerfile"
    build_args = {
      TERRAFORM_VERSION = var.terraform_version
    }
  }
}

Step 3: Lambda Function Creation

resource "aws_lambda_function" "terraform_runner" {
  function_name = var.function_name
  role          = aws_iam_role.lambda_exec.arn
  package_type  = "Image"
  image_uri     = docker_image.terraform_lambda.name

  timeout     = var.function_timeout      # Up to 15 minutes
  memory_size = var.function_memory_size  # Up to 10GB

  ephemeral_storage {
    size = var.ephemeral_storage_size     # Up to 10GB for /tmp
  }
}

Execution Phase: Custom Runtime Implementation

The magic happens in entrypoint.sh, which implements the Lambda Runtime API:

The Runtime Loop:

#!/bin/bash
set -euo pipefail

# Lambda Runtime API endpoints
RUNTIME_API="http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime"

while true; do
  # 1. Poll for next invocation
  HEADERS=$(mktemp)
  EVENT_DATA=$(curl -sS -LD "$HEADERS" -X GET "${RUNTIME_API}/invocation/next")

  # 2. Extract request ID from headers
  REQUEST_ID=$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)

  # 3. Parse event payload
  COMMAND=$(echo "$EVENT_DATA" | jq -r '.command // "plan"')
  TF_CODE=$(echo "$EVENT_DATA" | jq -r '.tf_code // ""')
  BACKEND=$(echo "$EVENT_DATA" | jq -r '.backend // ""')

  # 4. Configure AWS credentials (if provided)
  if [[ -n "$(echo "$EVENT_DATA" | jq -r '.aws_access_key // ""')" ]]; then
    export AWS_ACCESS_KEY_ID=$(echo "$EVENT_DATA" | jq -r '.aws_access_key')
    export AWS_SECRET_ACCESS_KEY=$(echo "$EVENT_DATA" | jq -r '.aws_secret_key')
    export AWS_SESSION_TOKEN=$(echo "$EVENT_DATA" | jq -r '.aws_session_token // ""')
  fi

  # 5. Extract Terraform code (if dynamic)
  if [[ -n "$TF_CODE" ]]; then
    echo "$TF_CODE" | base64 -d > /tmp/code.zip
    unzip -o /tmp/code.zip -d /tmp/terraform
  elif [[ -d /bundled-code ]]; then
    cp -r /bundled-code /tmp/terraform
  fi

  # 6. Write backend configuration
  if [[ -n "$BACKEND" ]]; then
    echo "$BACKEND" | base64 -d > /tmp/terraform/backend.tf
  fi

  # 7. Execute Terraform
  cd /tmp/terraform
  terraform init -input=false
  terraform $COMMAND -input=false -auto-approve

  # 8. Report success
  curl -X POST "${RUNTIME_API}/invocation/${REQUEST_ID}/response" -d '{"status":"success"}'
done

Key Design Decisions:

  1. Infinite Loop: Lambda keeps the container alive for subsequent invocations (warm starts)
  2. Runtime API Contract: Implements the standard AWS Lambda interface without language-specific SDKs
  3. Dynamic Code Loading: Supports both bundled code (fast) and runtime code (flexible)
  4. Credential Flexibility: Uses Lambda role by default, allows override for multi-account scenarios
  5. Ephemeral Filesystem: Uses /tmp (up to 10GB) for Terraform working directory

Capabilities and Constraints

What You Can Do

✅ Terraform Commands:

  • plan — Preview infrastructure changes
  • apply — Create/update infrastructure
  • destroy — Tear down resources
  • validate — Check configuration syntax
  • init — Initialize backend (usually automatic)

✅ Flexible Code Delivery:

Option 1: Bundle at Build Time (Faster)

module "terraform_lambda" {
  source = "KamranBiglari/terraform-in-lambda/aws"

  terraform_code_source_path = "${path.module}/infrastructure"
  terraform_code_source_exclude = [
    ".terraform/**",
    "*.tfstate*"
  ]
}

Option 2: Send Code Dynamically (More Flexible)

{
  "command": "apply",
  "tf_code": "<base64-encoded-zip-of-terraform-files>",
  "backend": "<base64-encoded-backend-config>"
}

✅ Custom Environment Variables:

{
  "command": "apply",
  "envs": "VEZfVkFSX3JlZ2lvbj11cy1lYXN0LTEKVEZfVkFSX2Vudmlyb25tZW50PXByb2Q="
}

Decodes to:

TF_VAR_region=us-east-1
TF_VAR_environment=prod

✅ Multi-Account Execution:

{
  "command": "apply",
  "aws_access_key": "AKIA...",
  "aws_secret_key": "...",
  "aws_session_token": "..."
}

✅ Private Terraform Registry:

{
  "command": "plan",
  "tfconfig": "ewogICJjcmVkZW50aWFscyI6IHsKICAgICJhcHAudGVycmFmb3JtLmlvIjogewogICAgICAidG9rZW4iOiAiLi4uIgogICAgfQogIH0KfQo="
}

Constraints

⏱️ 15-Minute Timeout:

  • Lambda maximum execution time is 900 seconds
  • Large infrastructure operations may exceed this limit
  • Solution: Break into smaller operations or use Step Functions for orchestration

💾 10GB Ephemeral Storage:

  • /tmp filesystem limited to 10GB
  • Large Terraform states or provider downloads may hit limits
  • Solution: Use remote state (S3), minimize provider cache

🔐 Secrets in CloudWatch Logs:

  • Terraform output (including sensitive values) goes to CloudWatch
  • Solution: Mark outputs as sensitive = true in Terraform, use encryption at rest for logs

🐳 Build-Time Dependency:

  • Requires Docker Engine on deployment machine
  • Solution: Run Terraform deployments from CI/CD or local development machines with Docker

Implementation Guide

Prerequisites

1. Local Requirements:

  • Docker Engine running (for image builds)
  • Terraform >= 1.0.11
  • AWS credentials configured

2. AWS Requirements:

  • ECR repository (created automatically)
  • IAM permissions for Lambda, ECR, CloudWatch
  • Optional: VPC for network-isolated execution

Step 1: Deploy the Module

Create a main.tf:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.8"
    }
    docker = {
      source  = "kreuzwerker/docker"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# Docker provider for building images
provider "docker" {
  registry_auth {
    address  = data.aws_ecr_authorization_token.token.proxy_endpoint
    username = data.aws_ecr_authorization_token.token.user_name
    password = data.aws_ecr_authorization_token.token.password
  }
}

data "aws_ecr_authorization_token" "token" {}

module "terraform_lambda" {
  source  = "KamranBiglari/terraform-in-lambda/aws"
  version = "1.0.0"

  # Basic Configuration
  function_name      = "terraform-runner"
  terraform_version  = "1.11"

  # Resource Allocation
  function_timeout          = 900    # 15 minutes
  function_memory_size      = 4096   # 4GB
  ephemeral_storage_size    = 4096   # 4GB

  # Optional: Bundle Terraform code at build time
  terraform_code_source_path = "${path.module}/infrastructure"
  terraform_code_source_exclude = [
    ".terraform/**",
    "*.tfstate*",
    ".git/**"
  ]

  # Optional: VPC deployment
  function_create_sg      = false
  function_vpc_subnet_ids = []

  # CloudWatch Logs
  function_cloudwatch_logs_retention_in_days = 30

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

output "lambda_function_name" {
  value = module.terraform_lambda.lambda_function_name
}

output "lambda_function_arn" {
  value = module.terraform_lambda.lambda_function_arn
}

Deploy:

terraform init
terraform apply

What gets created:

  • ECR repository for Docker images
  • Docker image built and pushed to ECR
  • Lambda function with IAM execution role
  • CloudWatch log group
  • Optional: VPC security group

Step 2: Create IAM Policies for Terraform Execution

The Lambda needs permissions to manage infrastructure. Create a custom policy:

resource "aws_iam_role_policy" "terraform_permissions" {
  name = "terraform-execution-permissions"
  role = module.terraform_lambda.lambda_execution_role_id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:*",
          "s3:*",
          "dynamodb:*",
          "iam:*",
          # Add all AWS services your Terraform code manages
        ]
        Resource = "*"
      }
    ]
  })
}

Security Best Practice: Grant least-privilege permissions based on what your Terraform code actually manages.

Step 3: Invoke the Lambda Function

Option A: AWS CLI (Testing)

# Execute a Terraform plan with bundled code
aws lambda invoke \
  --function-name terraform-runner \
  --payload '{"command": "plan"}' \
  response.json

cat response.json

Option B: With Dynamic Code

# Prepare Terraform code
cd infrastructure/
zip -r /tmp/code.zip *.tf

# Base64 encode
TF_CODE=$(base64 -w 0 /tmp/code.zip)

# Backend configuration
BACKEND_CONFIG='
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/infrastructure.tfstate"
    region = "us-east-1"
  }
}
'
BACKEND=$(echo "$BACKEND_CONFIG" | base64 -w 0)

# Invoke
aws lambda invoke \
  --function-name terraform-runner \
  --payload "{
    \"command\": \"apply\",
    \"tf_code\": \"$TF_CODE\",
    \"backend\": \"$BACKEND\"
  }" \
  response.json

Option C: From Python Application

import boto3
import base64
import json

lambda_client = boto3.client('lambda')

# Prepare Terraform code
with open('/tmp/code.zip', 'rb') as f:
    tf_code_b64 = base64.b64encode(f.read()).decode()

# Prepare backend config
backend_config = """
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/infrastructure.tfstate"
    region = "us-east-1"
  }
}
"""
backend_b64 = base64.b64encode(backend_config.encode()).decode()

# Invoke
response = lambda_client.invoke(
    FunctionName='terraform-runner',
    InvocationType='RequestResponse',
    Payload=json.dumps({
        'command': 'apply',
        'tf_code': tf_code_b64,
        'backend': backend_b64
    })
)

result = json.loads(response['Payload'].read())
print(result)

Real-World Use Cases

1. Scheduled Drift Detection and Remediation

Automatically detect and fix configuration drift every night:

# EventBridge rule
resource "aws_cloudwatch_event_rule" "nightly_drift_check" {
  name                = "nightly-terraform-reconciliation"
  schedule_expression = "cron(0 2 * * ? *)"  # 2 AM daily
}

resource "aws_cloudwatch_event_target" "lambda" {
  rule      = aws_cloudwatch_event_rule.nightly_drift_check.name
  target_id = "terraform-runner"
  arn       = module.terraform_lambda.lambda_function_arn

  input = jsonencode({
    command = "apply"
    # Code bundled in Lambda container
  })
}

resource "aws_lambda_permission" "allow_eventbridge" {
  statement_id  = "AllowExecutionFromEventBridge"
  action        = "lambda:InvokeFunction"
  function_name = module.terraform_lambda.lambda_function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.nightly_drift_check.arn
}

Result: Infrastructure stays compliant 24/7 without manual intervention.

2. Self-Service Developer Platform

Build an internal platform where developers request infrastructure via API:

# Flask API endpoint
@app.route('/infrastructure/create', methods=['POST'])
def create_infrastructure():
    data = request.json

    # Generate Terraform code from template
    tf_code = render_template('vpc.tf.j2',
        region=data['region'],
        cidr=data['cidr_block'],
        environment=data['environment']
    )

    # Zip and encode
    zip_buffer = create_zip({'main.tf': tf_code})
    tf_code_b64 = base64.b64encode(zip_buffer.getvalue()).decode()

    # Invoke Lambda
    lambda_client.invoke(
        FunctionName='terraform-runner',
        InvocationType='Event',  # Async
        Payload=json.dumps({
            'command': 'apply',
            'tf_code': tf_code_b64,
            'backend': generate_backend_config(data['project_id'])
        })
    )

    return {'status': 'provisioning', 'job_id': '...'}

Result: Developers get infrastructure on-demand without Terraform knowledge.

3. Event-Driven Infrastructure Scaling

Respond to CloudWatch alarms by provisioning additional resources:

# CloudWatch Alarm → SNS → Lambda
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "rds-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_actions       = [aws_sns_topic.scaling.arn]
}

resource "aws_sns_topic_subscription" "lambda" {
  topic_arn = aws_sns_topic.scaling.arn
  protocol  = "lambda"
  endpoint  = module.terraform_lambda.lambda_function_arn
}

Lambda payload handler:

# In entrypoint.sh - parse SNS message
ALARM_NAME=$(echo "$EVENT_DATA" | jq -r '.Records[0].Sns.Message' | jq -r '.AlarmName')

if [[ "$ALARM_NAME" == "rds-high-cpu" ]]; then
  # Apply Terraform to add read replica
  cd /tmp/terraform
  terraform apply -var="add_replica=true" -auto-approve
fi

Result: Infrastructure auto-scales based on metrics.

4. Multi-Account Infrastructure Management

Centralized infrastructure management across AWS accounts:

# Admin Lambda invokes Terraform for each account
accounts = [
    {'id': '111111111111', 'role': 'arn:aws:iam::111111111111:role/TerraformRole'},
    {'id': '222222222222', 'role': 'arn:aws:iam::222222222222:role/TerraformRole'},
]

for account in accounts:
    # Assume role
    sts = boto3.client('sts')
    creds = sts.assume_role(
        RoleArn=account['role'],
        RoleSessionName='terraform-runner'
    )

    # Invoke Lambda with temporary credentials
    lambda_client.invoke(
        FunctionName='terraform-runner',
        Payload=json.dumps({
            'command': 'apply',
            'aws_access_key': creds['Credentials']['AccessKeyId'],
            'aws_secret_key': creds['Credentials']['SecretAccessKey'],
            'aws_session_token': creds['Credentials']['SessionToken'],
            'tf_code': get_account_terraform_code(account['id'])
        })
    )

Result: Single Lambda manages infrastructure across organizational units.

5. Ephemeral Testing Environments

Spin up and tear down test environments on-demand:

# CI/CD pipeline
- name: Create Test Environment
  run: |
    aws lambda invoke \
      --function-name terraform-runner \
      --payload '{"command": "apply", "envs": "'$(echo "TF_VAR_branch=$BRANCH_NAME" | base64)'"}'

- name: Run Integration Tests
  run: pytest tests/integration/

- name: Destroy Test Environment
  run: |
    aws lambda invoke \
      --function-name terraform-runner \
      --payload '{"command": "destroy"}'

Result: Zero-cost test environments that exist only during test execution.

Advanced Configuration

VPC Deployment for Private Resources

If your Terraform manages resources in private subnets:

module "terraform_lambda" {
  source = "KamranBiglari/terraform-in-lambda/aws"

  # ... other config ...

  # VPC Configuration
  function_create_sg = true
  function_vpc_subnet_ids = [
    "subnet-abc123",
    "subnet-def456"
  ]

  # Lambda will automatically create security group
  # with egress to 0.0.0.0/0 (for Terraform provider downloads)
}

Use cases:

  • Managing RDS databases in private subnets
  • Provisioning resources in isolated VPCs
  • Compliance requirements for network isolation

Custom Terraform CLI Configuration

For private Terraform registries or custom provider mirrors:

{
  "command": "plan",
  "tfconfig": "ewogICJjcmVkZW50aWFscyI6IHsKICAgICJhcHAudGVycmFmb3JtLmlvIjogewogICAgICAidG9rZW4iOiAiWU9VUl9UT0tFTiIKICAgIH0KICB9Cn0K"
}

Decodes to .terraformrc:

credentials "app.terraform.io" {
  token = "YOUR_TOKEN"
}

Debug Mode

Enable verbose output for troubleshooting:

{
  "command": "apply",
  "debug": true
}

This outputs:

  • Full Terraform execution logs
  • AWS SDK debug information
  • Environment variables (sanitized)
  • File system state

Security Best Practices

1. Least-Privilege IAM Policies

Anti-pattern:

# DON'T DO THIS
resource "aws_iam_role_policy" "bad" {
  policy = jsonencode({
    Statement = [{
      Effect   = "Allow"
      Action   = "*"
      Resource = "*"
    }]
  })
}

Best practice:

resource "aws_iam_role_policy" "good" {
  policy = jsonencode({
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:Describe*",
          "ec2:CreateTags",
          "ec2:DeleteTags"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "ec2:RunInstances",
          "ec2:TerminateInstances"
        ]
        Resource = "arn:aws:ec2:*:*:instance/*"
        Condition = {
          StringEquals = {
            "aws:RequestedRegion": ["us-east-1", "eu-west-1"]
          }
        }
      }
    ]
  })
}

2. Encrypt CloudWatch Logs

resource "aws_cloudwatch_log_group" "terraform_lambda" {
  name              = "/aws/lambda/terraform-runner"
  retention_in_days = 30
  kms_key_id        = aws_kms_key.logs.arn
}

resource "aws_kms_key" "logs" {
  description = "Encrypt Terraform Lambda logs"

  policy = jsonencode({
    Statement = [
      {
        Sid    = "Enable CloudWatch Logs"
        Effect = "Allow"
        Principal = {
          Service = "logs.amazonaws.com"
        }
        Action = [
          "kms:Encrypt",
          "kms:Decrypt",
          "kms:ReEncrypt*",
          "kms:GenerateDataKey*",
          "kms:CreateGrant",
          "kms:DescribeKey"
        ]
        Resource = "*"
      }
    ]
  })
}

3. Secure Terraform State

Never store state in Lambda:

# Always use remote backend
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/..."
  }
}

4. Validate Payloads

Implement input validation in Lambda:

# In entrypoint.sh
ALLOWED_COMMANDS=("plan" "apply" "destroy" "validate")

if [[ ! " ${ALLOWED_COMMANDS[@]} " =~ " ${COMMAND} " ]]; then
  echo "Invalid command: $COMMAND"
  curl -X POST "${RUNTIME_API}/invocation/${REQUEST_ID}/error" \
    -d '{"errorMessage":"Invalid Terraform command"}'
  continue
fi

5. Enable ECR Image Scanning

resource "aws_ecr_repository" "terraform_lambda" {
  name                 = "terraform-runner"
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }
}

Review scan results:

aws ecr describe-image-scan-findings \
  --repository-name terraform-runner \
  --image-id imageTag=latest

Monitoring and Observability

CloudWatch Metrics

Key metrics to monitor:

resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "terraform-lambda-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  period              = 300
  statistic           = "Sum"
  threshold           = 0
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    FunctionName = module.terraform_lambda.lambda_function_name
  }
}

resource "aws_cloudwatch_metric_alarm" "lambda_duration" {
  alarm_name          = "terraform-lambda-timeout-risk"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "Duration"
  namespace           = "AWS/Lambda"
  period              = 300
  statistic           = "Maximum"
  threshold           = 800000  # 800 seconds (warn before 900s timeout)
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    FunctionName = module.terraform_lambda.lambda_function_name
  }
}

CloudWatch Insights Queries

Find all Terraform apply operations:

fields @timestamp, @message
| filter @message like /terraform apply/
| sort @timestamp desc
| limit 100

Identify failures:

fields @timestamp, @message
| filter @message like /Error:/
| stats count() by bin(5m)

Track resource changes:

fields @timestamp, @message
| parse @message /Plan: (?<add>\d+) to add, (?<change>\d+) to change, (?<destroy>\d+) to destroy/
| filter ispresent(add)

X-Ray Tracing (Advanced)

Enable distributed tracing:

resource "aws_lambda_function" "terraform_runner" {
  # ... other config ...

  tracing_config {
    mode = "Active"
  }
}

Add to IAM role:

resource "aws_iam_role_policy_attachment" "xray" {
  role       = module.terraform_lambda.lambda_execution_role_id
  policy_arn = "arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess"
}

Cost Optimization

Pricing Breakdown

Lambda pricing (us-east-1):

  • Requests: $0.20 per 1M requests
  • Duration: $0.0000166667 per GB-second
  • Ephemeral storage: $0.0000000309 per GB-second (over 512MB)

Example: Daily drift detection

Assumptions:

  • 4GB memory
  • 2-minute execution
  • 1 execution per day

Monthly cost:

Requests: 30 × $0.20 / 1,000,000 = $0.000006
Duration: 30 × (4 GB × 120s) × $0.0000166667 = $0.24
Total: ~$0.24/month

Compare to:

  • EC2 t3.medium (24/7): ~$30/month
  • GitHub Actions (2000 minutes/month): $0 (free tier), then $0.008/minute

Optimization Strategies

1. Right-Size Memory Allocation

# Test with different memory sizes
for memory in 1024 2048 4096 8192; do
  aws lambda update-function-configuration \
    --function-name terraform-runner \
    --memory-size $memory

  # Run test and measure duration
  time aws lambda invoke --function-name terraform-runner ...
done

2. Use Bundled Code When Possible

Bundled code avoids base64 encoding/decoding overhead:

# Faster execution
terraform_code_source_path = "${path.module}/infrastructure"

vs.

// Slower execution (dynamic code)
{
  "tf_code": "<base64-zip>"
}

3. Optimize Container Image Size

# Multi-stage build to reduce image size
FROM hashicorp/terraform:1.11-alpine AS terraform

FROM alpine:3.19
COPY --from=terraform /bin/terraform /bin/terraform
# ... install only necessary tools

4. Leverage Lambda SnapStart (Future)

Currently not supported for container images, but monitor for updates.

Troubleshooting Guide

Problem: “Task timed out after 900.00 seconds”

Cause: Terraform operation exceeds 15-minute Lambda limit.

Solutions:

  1. Break into smaller operations:

    # Instead of applying entire infrastructure
    # Split into modules and apply separately
  2. Use Step Functions for orchestration:

    resource "aws_sfn_state_machine" "terraform_pipeline" {
      definition = jsonencode({
        StartAt = "Init"
        States = {
          Init = {
            Type     = "Task"
            Resource = module.terraform_lambda.lambda_function_arn
            Next     = "Plan"
          }
          Plan = {
            Type     = "Task"
            Resource = module.terraform_lambda.lambda_function_arn
            Next     = "Apply"
          }
          Apply = {
            Type     = "Task"
            Resource = module.terraform_lambda.lambda_function_arn
            End      = true
          }
        }
      })
    }
  3. Optimize Terraform performance:

    # Use -parallelism flag
    terraform apply -parallelism=20
    
    # Reduce provider re-initialization
    # Use persistent backend config

Problem: “Failed to download provider”

Cause: Lambda can’t reach Terraform Registry or GitHub.

Solutions:

  1. VPC NAT Gateway:

    # Ensure Lambda has internet access via NAT
    function_vpc_subnet_ids = [aws_subnet.private_with_nat.id]
  2. VPC Endpoints:

    resource "aws_vpc_endpoint" "s3" {
      vpc_id       = var.vpc_id
      service_name = "com.amazonaws.${var.region}.s3"
    }
  3. Provider caching:

    # In Terraform code
    terraform {
      required_providers {
        aws = {
          source  = "hashicorp/aws"
          version = "~> 5.0"
        }
      }
    }
    
    # Cache in container image at build time

Problem: “No space left on device”

Cause: Terraform state or provider cache exceeds ephemeral storage.

Solutions:

  1. Increase ephemeral storage:

    ephemeral_storage_size = 10240  # 10GB max
  2. Use remote state:

    terraform {
      backend "s3" {
        # Don't store large state locally
      }
    }
  3. Clean up /tmp:

    # In entrypoint.sh, before terraform execution
    rm -rf /tmp/terraform/.terraform/providers/*

Problem: “Error: Unsupported credential source”

Cause: Credential provider chain misconfiguration.

Solutions:

  1. Check environment variables:

    # In CloudWatch logs
    echo "AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID"
    echo "AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:0:4}***"
  2. Verify IAM role trust:

    aws sts get-caller-identity
  3. Use explicit credentials:

    {
      "command": "plan",
      "aws_access_key": "...",
      "aws_secret_key": "..."
    }

Comparison: Lambda vs. Traditional Runners

FeatureLambda (This Module)EC2 RunnerGitHub Actions
InfrastructureServerlessSelf-managedHosted/Self-hosted
Cost (light usage)~$0.24/month~$30/monthFree tier, then $0.008/min
ScalingAutomaticManualLimited concurrency
MaintenanceZeroPatching, updatesSelf-hosted: patching
Event-drivenNativeComplexWebhook-based
Max execution time15 minutesUnlimited6 hours (self-hosted: unlimited)
Cold start1-3 secondsNoneVariable
Multi-accountNative (AssumeRole)ComplexRequires secrets
Audit loggingCloudTrail + CWManualGitHub audit log

Limitations and Considerations

When NOT to Use This

❌ Very Large Infrastructure Operations

  • Terraform applies taking > 15 minutes
  • Alternative: Use EC2 spot instances or AWS Batch

❌ Interactive Workflows

  • Operations requiring human approval mid-execution
  • Alternative: Use Terraform Cloud or Atlantis

❌ Persistent Terraform Workspaces

  • Need to maintain multiple workspaces with local state
  • Alternative: Terraform Cloud workspaces

❌ Complex Provider Authentication

  • Providers requiring complex OAuth flows or device codes
  • Alternative: Pre-authenticate and inject tokens

When This Excels

Scheduled reconciliation — Nightly drift detection ✅ Event-driven provisioning — React to CloudWatch alarms, SNS topics ✅ API-driven infrastructure — Self-service platforms ✅ Multi-account management — Centralized control across AWS Organizations ✅ Ephemeral environments — Short-lived test infrastructure ✅ Cost-sensitive workloads — Avoid 24/7 runner costs

Conclusion

Running Terraform in AWS Lambda represents a paradigm shift in infrastructure automation—from infrastructure-based (EC2, CI/CD runners) to serverless, event-driven execution.

Key Benefits Recap

Zero Infrastructure: No servers, no maintenance, no patching ✅ Cost-Effective: Pay only for execution time (~$0.24/month for daily runs) ✅ Event-Driven: Native integration with EventBridge, SNS, API Gateway ✅ Scalable: Concurrent executions for multi-account management ✅ Secure: IAM-based permissions with CloudTrail audit logging ✅ Flexible: Bundle code or send dynamically, support for custom credentials

Getting Started

  1. Explore the module: github.com/KamranBiglari/terraform-aws-terraform-in-lambda
  2. Check Terraform Registry: registry.terraform.io/modules/KamranBiglari/terraform-in-lambda/aws
  3. Start with bundled code: Deploy with your infrastructure pre-packaged
  4. Add event triggers: Set up EventBridge rules for scheduled execution
  5. Scale to production: Implement monitoring, alerting, and multi-account patterns

What’s Next?

The module is open source and actively maintained. Potential future enhancements:

  • Terraform Cloud/Enterprise integration
  • Step Functions-based long-running operation support
  • Built-in approval workflows
  • Enhanced observability with structured logging

This module bridges the gap between traditional Terraform execution and serverless, event-driven infrastructure management. Whether you’re building a self-service platform, automating drift detection, or managing multi-account infrastructure, Terraform-in-Lambda provides the foundation for serverless Infrastructure as Code.


Resources:

Have you tried running Terraform serverlessly? What use cases are you considering? Share your thoughts or reach out on LinkedIn!

Share:

Related Articles