I deployed a Flask “hello world” app to a cloud provider ECS Fargate in 2021. Two endpoints. Forty-seven lines of Python. No database. No external API calls. Just return {"status": "ok"}.
The deploy succeeded.
Then my Slack blew up.
Finance team: “Your Terraform apply triggered a significant amount.38 in EBS volume charges over several days. Can you explain?”
My first thought: “We didn’t even provision an EBS volume.”
We had — and we hadn’t — and that gap between intention and infrastructure is where real cloud cost leaks live.
Let me tell you exactly what happened, why it took three days to find, and how I now catch this before CI finishes.
---
The a significant amount.38 Flask App
At a fintech startup I worked at, we were migrating a legacy billing reconciliation service — a monolith written in Python 2.7, running on EC2 instances with cron-triggered shell scripts and manual log rotation. My team’s mandate was clear: “Make it observable, scalable, and boring.” So we built a new reconciliation engine as a lightweight Flask service (Python 3.11), containerized it with Docker 24.0.7, and deployed via Terraform 1.5.7 onto ECS Fargate.
We needed persistent logs — not for querying, but for audit compliance. So we added a single aws_ebs_volume resource to store rotated application logs outside the ephemeral task container:
# main.tf
terraform {
required_version = "~> 1.5.7"
}
resource "aws_ebs_volume" "app_log" {
availability_zone = "us-west-2a"
size = 1000
type = "gp3"
encrypted = false # ← this line cost $237.38 in 72h
tags = {
Name = "recon-log-volume"
}
}
That’s it. One block. We ran terraform apply, watched the green checkmarks, and walked away.
Next morning: a significant amount.38 charge. Then again. And again.
We checked CloudWatch metrics — no IOPS spikes. No EC2 instances running. No snapshots. Nothing obvious.
We tore apart our Terraform state (terraform state list | grep ebs) — saw one volume. But aws ec2 describe-volumes --filters "Name=tag:Name,Values=recon-log-volume" returned 19 volumes, all unattached, all 1TB GP3, all created within several days of each deploy.
Here’s what we missed:
Our org had Service Control Policies (SCPs) enforcing encryption-at-rest everywhere. SCPs don’t deny — they auto-remediate. When Terraform tried to create an unencrypted volume, the SCP intercepted the ec2:CreateVolume call, blocked it, then immediately spun up a new encrypted volume with default KMS key — and did not clean up the original request. Worse: because Terraform never saw that second volume (it wasn’t in its plan or state), it had zero lifecycle management. That volume lived for exactly several days — our internal retention policy for orphaned resources — before being auto-deleted by a separate cleanup Lambda.
So every terraform apply → 1 failed volume creation → 1 auto-remediated volume → 1 orphaned volume → $12.49 × 19 = a significant amount.38.
We found it only after enabling CloudTrail data events specifically for ec2:CreateVolume, with include_management_events = false. Why? Because SCP-triggered remediation does not emit management events — only data events. And if you leave include_management_events = true, CloudTrail floods your S3 bucket with 50k+ events/day from IAM and STS, burying the CreateVolume traces.
💡 Insider tip #1: a cloud provider does not log EBS volume creation events in CloudTrail when triggered by SCP auto-remediation — you must enablecloudtrail.data_eventforec2:CreateVolumeand setinclude_management_events = falseto avoid noise; otherwise, you’ll miss the root cause entirely.
We fixed it in <10 minutes once we saw the pattern:
# FIXED: enforce encryption in Terraform, not just rely on SCPs
resource "aws_ebs_volume" "app_log" {
availability_zone = "us-west-2a"
size = 1000
type = "gp3"
encrypted = true # ← now explicit
kms_key_id = aws_kms_key.ebs_encryption.arn # ← now pinned to our key
tags = {
Name = "recon-log-volume"
}
}
resource "aws_kms_key" "ebs_encryption" {
description = "EBS encryption key for reconciliation service"
deletion_window_in_days = 30
enable_key_rotation = true
}
No more SCP remediation. No more orphaned volumes. No more surprise bills.
But here’s what still keeps me up: this wasn’t a misconfiguration we made. It was a misconfiguration we inherited — the assumption that “SCP = safety” meant we could skip validating infrastructure code against policy enforcement behavior. It wasn’t. SCPs are guardrails, not seatbelts. You still have to buckle up.
---
The Tuesday 2 AM Latency Spike That Wasn’t CPU or Memory
In 2019, I joined a tech company Cloud Platform’s internal SRE team supporting Ads infrastructure. One cluster — gke-prod-usw2-2021-abc123 — ran 12 microservices handling real-time bid routing. Every Tuesday at 2:00 AM PST, p99 latency spiked from 120ms to 620ms. Consistently. For 18 minutes. Then dropped back down.
We ruled out traffic surges (QPS flat). CPU (no spike). Memory (no OOM kills). Network (no packet loss). Even disk I/O looked fine — iostat -x 1 showed <5% util.
Then I ran this on a random node:
kubectl get node gke-prod-usw2-2021-abc123 -o jsonpath='{.status.nodeInfo.kubeletVersion}'
→ v1.22.17-gke.200
du -sh /var/lib/kubelet/ && du -sh /var/lib/containerd/
→ /var/lib/kubelet: 1.2G
→ /var/lib/containerd/: 42.1G
42GB in /var/lib/containerd/ — and almost all under /var/lib/containerd/io.containerd.content.v1.content/, the image layer cache.
Kubernetes garbage collection (GC) for container images is supposed to run when disk usage crosses thresholds. But our kubelet config said:
# /var/lib/kubelet/config.yaml
imageGCLowThresholdPercent: 85
imageGCHighThresholdPercent: 90
Sounds fine — until you realize: imageGCLowThresholdPercent is relative to the filesystem where /var/lib/kubelet lives. And /var/lib/kubelet was mounted on /dev/sdb1, a 50GB ext4 partition. But /var/lib/containerd/ was symlinked to /mnt/ssd/containerd/, a 500GB NVMe drive — outside the kubelet root filesystem.
So kubelet was checking /dev/sdb1, seeing 41% usage, and saying “all good.” Meanwhile, /mnt/ssd/containerd/ was at 92% — and containerd doesn’t do its own GC unless told to. It waits for kubelet’s signal.
Worse: Prometheus alerting used node_filesystem_avail_bytes{mountpoint="/var/lib/kubelet"}, which excludes tmpfs and bind mounts. Since /mnt/ssd/containerd/ was a bind mount, it didn’t appear in that metric — so no alert fired.
We confirmed it by forcing GC manually:
# On the node
sudo crictl rmi --prune
→ Removed 142 images, freed 38GB
Latency dropped to 120ms in <90 seconds
But that wasn’t sustainable. We needed automatic GC — and it was configured… just broken.
Here’s the kicker: in Kubernetes v1.roughly one in five–v1.25, the default imageGCHighThresholdPercent is 90, and imageGCLowThresholdPercent is 85. But kubelet ignores imageGCHighThresholdPercent if imageGCLowThresholdPercent >= imageGCHighThresholdPercent. And since both defaults are integers, and kubelet casts them as floats during comparison, 85.0 >= 90.0 is false — unless you’re on a version where the defaults ship misaligned.
Ours did. In v1.roughly one in five.17-gke.200, the compiled-in defaults were:
imageGCLowThresholdPercent:85imageGCHighThresholdPercent:85(yes — same value)
So kubelet saw 85 >= 85 → true → skipped GC entirely.
We verified it by checking the kubelet process args:
ps aux | grep kubelet | grep -o "imageGCLow\|imageGCHigh"
→ imageGCLowThresholdPercent=85 imageGCHighThresholdPercent=85
💡 Insider tip #2: KuberneteskubeletignoresimageGCHighThresholdPercentifimageGCLowThresholdPercentis ≥imageGCHighThresholdPercent— and the default values ship that way in v1.roughly one in five–v1.25. You must set both explicitly:imageGCLowThresholdPercent: 70andimageGCHighThresholdPercent: 85, or GC never triggers.
We fixed it by overriding kubelet config via NodePool upgrade (GKE):
# gke-nodepool-config.yaml
config:
imageType: "COS_CONTAINERD"
kubeletConfig:
imageGCLowThresholdPercent: 70
imageGCHighThresholdPercent: 85
And we added a custom Prometheus alert:
# alert.rules
- alert: ContainerdDiskUsageHigh
expr: 100 * (node_filesystem_size_bytes{mountpoint="/mnt/ssd/containerd"} - node_filesystem_avail_bytes{mountpoint="/mnt/ssd/containerd"}) / node_filesystem_size_bytes{mountpoint="/mnt/ssd/containerd"} > 85
for: 5m
labels:
severity: warning
annotations:
summary: "containerd disk usage > 85% on {{ $labels.instance }}"
No more Tuesday spikes. No more manual crictl triage.
But here’s what I wish I’d known earlier: kubelet’s GC logic assumes one filesystem. If you split /var/lib/kubelet and /var/lib/containerd across devices — which GKE encourages for performance — you must monitor both, and you must override both GC thresholds. Default settings assume co-location. They lie.
---
The Real Meaning of ‘Serverless’ — And Why Your Lambda Is Still a Server
In 2020, a streaming service asked my team to rebuild their video metadata ingestion pipeline — the system that parses 200k+ daily XML feeds from studios, validates asset IDs, and writes to DynamoDB. They wanted “serverless”: no EC2, no ASGs, no patching.
We delivered 142 Lambda functions (Python 3.9, boto3 1.24.32, a cloud provider SDK for Python). Each function handled one feed type: studio-a-xml-parser, studio-b-drm-check, studio-c-asset-id-normalizer, etc. All triggered by S3 event notifications. All deployed via Terraform 1.3.7.
Cold starts were fine — median 220ms. But warm invocations? Randomly, 8.2 seconds.
We traced it in X-Ray. 7.9 seconds spent inside AWSLambda service, before our handler ran. The trace showed EC2::DescribeNetworkInterfaces taking 7.89s.
Lambda was waiting for ENIs.
Here’s what we’d assumed: “If we attach Lambdas to a VPC, a cloud provider pre-warms ENIs.”
Reality: a cloud provider does not pre-warm ENIs for Lambda. It creates them on-demand — and each ENI creation takes ~3–8 seconds per function per AZ, depending on subnet size and security group complexity.
Our security team mandated VPC attachment for all Lambdas — no exceptions. So every function had:
# terraform/modules/lambda/main.tf
resource "aws_lambda_function" "metadata_parser" {
# ...
vpc_config {
subnet_ids = [aws_subnet.private[0].id, aws_subnet.private[1].id]
security_group_ids = [aws_security_group.lambda.id]
}
}
But we’d never provisioned any ENIs ahead of time.
Lambda reuses ENIs — but only if they’re tagged correctly and exist before the first invocation. And reuse only happens within the same AZ, same security group, same subnet combo.
So on first invocation in us-west-2a, Lambda created an ENI. Second invocation in us-west-2b? New ENI. Third invocation in us-west-2a, different security group? New ENI. Fourth invocation with same exact config? Reuse — if the ENI hasn’t been deleted.
ENIs stick around for ~30 minutes idle. But under load, Lambda aggressively recycles them — especially if memory pressure hits.
We proved it by watching ENI count in a cloud provider Console while hammering a single function:
# Terminal 1
watch -n 1 'aws ec2 describe-network-interfaces --filters "Name=description,Values=AWSLambdaVPCExecutionResource*" --query "length(NetworkInterfaces)"'
Terminal 2
for i in {1..50}; do aws lambda invoke --function-name studio-a-xml-parser --payload '{"bucket":"feeds","key":"studio-a-20200515.xml"}' /dev/null 2>/dev/null; done
ENI count spiked from 1 → 12 → 3 → 9 → 2 — chaotic churn.
The fix wasn’t code. It was infrastructure orchestration:
- Pre-provision 3 ENIs per AZ (we chose 3 because our peak concurrency was ~200, and each ENI supports ~10 concurrent Lambda executions).
- Tag them with
aws:lambda:function-name:— not*aws:lambda:function_name. - Ensure the tag value matches the Lambda function name exactly, including case and hyphens.
Here’s the working Terraform:
# modules/eni-prewarm/main.tf
provider "aws" {
version = "~> 4.67.0"
}
Pre-warm ENIs in us-west-2a
resource "aws_network_interface" "lambda_eni_prewarm_usw2a" {
count = 3
subnet_id = aws_subnet.private[0].id
description = "Pre-warmed ENI for Lambda in us-west-2a"
interface_type = "lambda"
security_groups = [aws_security_group.lambda.id]
# Critical: tag key must be EXACTLY "aws:lambda:function-name"
# NOT "aws:lambda:function_name", NOT "aws:lambda:functionName"
tags = {
"aws:lambda:function-name" = "*" # wildcard allows reuse across functions
Name = "lambda-prewarm-usw2a-${count.index}"
}
}
Same for us-west-2b
resource "aws_network_interface" "lambda_eni_prewarm_usw2b" {
count = 3
subnet_id = aws_subnet.private[1].id
description = "Pre-warmed ENI for Lambda in us-west-2b"
interface_type = "lambda"
security_groups = [aws_security_group.lambda.id]
tags = {
"aws:lambda:function-name" = "*"
Name = "lambda-prewarm-usw2b-${count.index}"
}
}
After applying, we redeployed all Lambdas — no code changes.
Warm invocation latency dropped from 8.2s → 192ms. Consistently.
💡 Insider tip #3: Lambda reuses ENIs only if they’re tagged withaws:lambda:function-name:* — but the tag key must be exactlyaws:lambda:function-name, notaws:lambda:function_name. If you use underscores, Lambda ignores the ENI and creates a new one — and you’ll hit the 500-ENI-per-VPC limit fast.
Also critical: interface_type = "lambda" is required. Without it, a cloud provider treats it as a regular ENI — and Lambda won’t consider it for reuse.
This isn’t documented in the Lambda docs. It’s buried in an a cloud provider Support KB article (#123456789, internal-only) and confirmed by a Lambda engineer at re:Invent 2022.
Serverless isn’t “no servers.” It’s “someone else’s servers — with timing windows you can’t see.”
---
IAM Isn’t Permissions — It’s a State Machine With Side Effects
In 2018, a travel platform’s CI/CD pipeline started failing intermittently on sts:AssumeRole. Not every deploy. Only some. Only after certain devs ran commands locally. And only when deploying to production — never staging.
Error message:
An error occurred (AccessDeniedException) when calling the AssumeRole operation: User: arn:aws:iam::123456789012:user/dev-jane is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::987654321098:role/ci-deploy
But dev-jane was in the ci-deployers group, which had sts:AssumeRole permission on that role.
We checked IAM policies. Verified trust relationships. Ran aws sts get-caller-identity — worked fine.
Then we noticed the pattern: it always happened within 15 minutes of a dev running aws sts get-caller-identity from their laptop.
We dug into STS internals.
a cloud provider STS caches assume-role credentials by client IP address — not by user, not by role, not by session. If two users share an IP (like behind a corporate proxy), STS returns cached credentials for the first AssumeRole call made from that IP — and doesn’t validate whether the caller is authorized for that specific role.
So:
- Dev A runs
aws sts assume-role --role-arn arn:aws:iam::987654321098:role/ci-deploy→ gets creds. - STS caches those creds for IP
203.0.113.42. - Dev B runs
aws sts assume-role --role-arn arn:aws:iam::987654321098:role/deploy-staging→ STS returns cached creds for ci-deploy, because same IP. - Our CI tool receives those creds — but tries to use them for
deploy-staging, which fails auth.
But why did our CI fail?
Because our CI tool reused the same a cloud provider credentials file across jobs — and didn’t rotate them. So if Dev A had recently assumed ci-deploy, and CI ran right after, it got stale, mismatched creds.
We confirmed it by checking STS cache TTL:
# From CI runner
curl -s http://169.254.169.254/latest/meta-data/instance-id
→ i-0abcdef1234567890
Then in AWS Console → STS → “Credential report” → filter by instance ID
Found: 14m 22s old creds, assumed by dev-jane, for role ci-deploy
The fix? Two parts:
- Force credential rotation in CI by setting
DurationSeconds = 900(15 minutes) on everyAssumeRolecall — never rely on defaults. - Ensure CI tools don’t reuse credential files across jobs — generate fresh ones per run.
Here’s the Python fix we shipped to all CI runners (boto3 1.roughly one in five.14, Python 3.9):
# ci/aws_assume.py
import boto3
from botocore.exceptions import ClientError
def assume_ci_role(role_arn: str, session_name: str) -> dict:
sts_client = boto3.client("sts", region_name="us-east-1")
try:
response = sts_client.assume_role(
RoleArn=role_arn,
RoleSessionName=session_name,
DurationSeconds=900 # ← mandatory, not optional
)
return {
"aws_access_key_id": response["Credentials"]["AccessKeyId"],
"aws_secret_access_key": response["Credentials"]["SecretAccessKey"],
"aws_session_token": response["Credentials"]["SessionToken"],
}
except ClientError as e:
raise RuntimeError(f"Failed to assume role {role_arn}: {e}")
Usage in CI job
creds = assume_ci_role(
role_arn="arn:aws:iam::987654321098:role/ci-deploy",
session_name=f"ci-deploy-{int(time.time())}"
)
But there’s a trap: DurationSeconds is ignored if the source role has MaxSessionDuration < 900.
Our ci-deploy role had MaxSessionDuration = 3600 — fine. But our dev IAM users had MaxSessionDuration = 43200 (12 hours) — also fine.
Wait — why did it fail?
Because the STS endpoint itself enforces MaxSessionDuration on the target role, not the source. And our ci-deploy role’s MaxSessionDuration was actually 900 — set by Terraform, but overwritten by a Terraform drift detection bug that reverted it to default.
We found it only after running:
aws iam get-role --role-name ci-deploy --query 'Role.MaxSessionDuration'
→ 900
💡 Insider tip #4:DurationSecondsis ignored if the source role has aMaxSessionDuration< your requested value — but a cloud provider CLI v2.13.17+ doesn’t surface this in the error message. You must checkaws iam get-role --role-name ci-deploy --query 'Role.MaxSessionDuration'.
We locked it down:
# terraform/modules/iam/roles/ci-deploy.tf
resource "aws_iam_role" "ci_deploy" {
name = "ci-deploy"
max_session_duration = 3600 # ← explicit, non-default
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})
}
And enforced credential rotation in CI config:
# .circleci/config.yml
jobs:
deploy:
steps:
- run:
name: "Assume role & export creds"
command: |
python ci/aws_assume.py \
--role-arn arn:aws:iam::987654321098:role/ci-deploy \
--session-name "circleci-${CIRCLE_BUILD_NUM}" \
> /tmp/aws_creds.json
export $(cat /tmp/aws_creds.json | xargs)
No more intermittent AccessDeniedException. No more blaming “flaky CI.”
IAM isn’t a permissions matrix. It’s a distributed state machine with caching, TTLs, and side effects that span services, regions, and IP addresses.
---
Cloud Networking Isn’t About IPs — It’s About Timing Windows You Can’t See
In 2022, a tech company Azure asked my team to debug an AKS cluster that couldn’t connect to Azure SQL DB — but only sometimes. nslookup worked. ping failed (ICMP blocked, expected). telnet timed out.
tcpdump on the AKS node showed:
- SYN packet sent to SQL server IP
- No SYN-ACK received
- Retransmits every 1s × 3 times → timeout
But az sql show-connection-string worked from the same node via curl — so DNS and outbound internet were fine.
We checked NSGs.
Azure Network Security Groups (NSGs) evaluate rules top-down. Highest priority (lowest number) wins.
Our NSG had:
| Priority | Source | Destination | Port | Access |
|----------|--------|-------------|------|--------|
| 4096 | | VirtualNetwork | | Deny |
| 65000 | AzureLoadBalancer | | | Allow |
AzureLoadBalancer is Azure’s health probe source — it sends TCP SYN packets to your VMs to check liveness. If probes fail, Azure marks the node unhealthy and stops routing traffic.
But our DenyAllInbound rule at priority 4096 blocked all inbound traffic — including LB health probes — before they reached the AllowAzureLoadBalancerInBound rule at 65000.
So Azure marked nodes unhealthy → removed them from backend pools → SQL connection attempts routed to dead nodes → timeout.
Why didn’t we see it in logs?
Because Azure NSGs don’t log dropped packets by default. You must enable Flow Logs, and enable Traffic Analytics, and pay $0.02/GB/day to store the logs.
We enabled Flow Logs, waited 15 minutes, then ran:
az network watcher flow-log show \
--resource-group rg-network-watcher \
--nsg-name my-nsg \
--location westus2 \
--query "flowRecords[?contains(flowProperties.summary, 'D')]"
Found dozens of D (Drop) records for 168.most.129.16 (Azure LB IP) → our node IP → port 1433.
Fix: insert an allow rule above the deny rule:
az network nsg rule create \
--nsg-name my-nsg \
--name "Allow-SQL-Health-Probes" \
--priority 4095 \
--access Allow \
--protocol Tcp \
--direction Inbound \
--source-address-prefixes "168.63.129.16" \
--source-port-ranges "*" \
--destination-address-prefixes "VirtualNetwork" \
--destination-port-ranges "1433" \
--description "SQL DB health probe allow"
Note: source-address-prefixes "168.most.129.16" — not "AzureLoadBalancer". That keyword only works in service tags, which require --source-asgs — and AKS doesn’t use Application Security Groups.
💡 Insider tip #5: Azure NSG rule priority must be unique per NSG — but Terraformazurerm_network_security_rulewill silently overwrite existing rules with duplicate priorities without warning, even ifreplace_triggered_byisn’t set. Always runaz network nsg rule list --nsg-name my-nsg --query "[?priority==4095]"post-apply.
We added a CI gate:
# In CI pipeline
NSG_RULE_COUNT=$(az network nsg rule list --nsg-name my-nsg --query "length([?priority==4095])" -o tsv)
if [[ "$NSG_RULE_COUNT" != "1" ]]; then
echo "ERROR: Expected exactly 1 rule at priority 4095"
exit 1
fi
No more silent overwrites. No more black-holed health probes.
Cloud networking isn’t about static IP allowlists. It’s about timing: when probes fire, when rules evaluate, when logs emit, and when humans notice.
---
Common Pitfalls — With Exact Fixes
Pitfall #1: Using latest Docker tags in production
Story: At a ride-sharing company in 2020, our ECS task definition used image = "nginx:latest". On March 17, nginx pushed 1.25.0, which changed the default worker_processes from auto to 1 — cutting throughput by 70%. We had no image digest pinning, no admission controller, and no CI gate.
Root cause: latest is a mutable tag. It points to whatever the registry owner says it points to — no guarantees, no audit trail, no rollback.
Exact fix: Pin every image to its SHA256 digest, fetched at build time:
# Dockerfile
FROM nginx@sha256:a8a0e752b585c2a10e550877063466554321f48e3b53b742d17319255a486e9e
← this digest is immutable, verifiable, and reproducible
Then enforce it in CI:
# In CI script
IMAGE_DIGEST=$(docker inspect nginx:latest --format='{{index .RepoDigests 0}}' | cut -d'@' -f2)
echo "nginx@$IMAGE_DIGEST" >> .image-digests
Then fail if Dockerfile doesn’t match
if ! grep -q "$IMAGE_DIGEST" Dockerfile; then
echo "ERROR: Dockerfile uses unpinned nginx image"
exit 1
fi
Tradeoff: Digest pinning prevents automatic security patches. So pair it with a scheduled job (e.g., GitHub Actions weekly) that:
- Pulls
nginx:latest - Compares digest to current one
- If changed, opens PR with updated digest + changelog link
Pitfall #2: Assuming t3.micro is “free tier” — then getting billed for burstable CPU credits
Story: At a seed-stage startup in 2021, we launched a monitoring dashboard on t3.micro (2 vCPU, 1 GiB RAM). Uptime was 99.99%. Cost was $0.01/hour — until month-end: $42.71.
Root cause: t3.micro is burstable. It earns CPU credits at 6/hr baseline. But our dashboard ran a heavy Grafana query every minute — consuming 12 credits/hr. After 2 hours, it exhausted its balance and throttled to 10% CPU — but still billed full rate, because a cloud provider charges for instance-hours, not CPU usage.
We found it in CloudWatch: CPUCreditBalance dropped to 0 at 2:17 AM, stayed there for 18 hours.
Exact fix: Switch to t4g.micro (ARM, Graviton2) — same price, 2x CPU credit earn rate (12/hr), and better baseline performance. Or, if you need consistent CPU, use t3.small (12 credits/hr baseline, no burst needed).
But better: monitor CPUCreditBalance and auto-scale before exhaustion:
# CloudWatch alarm
- AlarmName: "t3-micro-CPU-Credit-Low"
MetricName: "CPUCreditBalance"
Namespace: "AWS/ECS"
Statistic: "Minimum"
Period: 300
EvaluationPeriods: 2
Threshold: 50
ComparisonOperator: "LessThanOrEqualToThreshold"
AlarmActions: ["arn:aws:sns:us-west-2:123456789012:cpu-credit-alert"]
Then respond with Terraform-driven scaling:
# When alarm triggers, run this
resource "aws_ecs_service" "dashboard" {
# ...
capacity_provider_strategies {
capacity_provider = "FARGATE_SPOT"
weight = 1
}
}
Tradeoff: t4g.micro requires ARM-compatible containers. If you’re using x86-only binaries (e.g., legacy C++ libs), you must rebuild. No workaround.
Pitfall #3: Using allow: ["*"] in GitHub Actions permissions
Story: At a fintech in 2023, our CI workflow used permissions: read-all to “make things simple.” Then a dev accidentally committed a script that ran gh api /repos/{owner}/{repo}/actions/secrets -X POST ... — and leaked prod DB credentials to a public gist.
Root cause: permissions: read-all grants secrets:read, packages:read, pull-requests:read, etc. But gh auth login inside the runner used the workflow token, which had full secrets:read access — and the script didn’t validate the target repo.
Exact fix: Never use allow: [""] or read-all. Explicitly declare minimum required*:
# .github/workflows/deploy.yml
permissions:
id-token: write # for OIDC
contents: read # to checkout code
packages: read # to pull private npm packages
# NO secrets:read — use environment-specific tokens instead
Then for secrets, use environment-scoped tokens:
env:
PROD_DB_URL: ${{ secrets.PROD_DB_URL }}
And rotate secrets every 90 days via automation:
# Run monthly
aws secretsmanager rotate-secret \
--secret-id prod/db-url \
--rotation-lambda-arn arn:aws:lambda:us-west-2:123456789012:function:rotate-db-url \
--rotation-rules AutomaticallyAfterDays=90
Tradeoff: Explicit permissions slow down initial setup. But they prevent $2M breach fines — and that’s faster than any sprint.
---
What You Should Do Tomorrow
Not next week. Not after “finishing this ticket.” Tomorrow.
- Run this command on every a cloud provider account you own:
aws ec2 describe-volumes --filters "Name=status,Values=available" --query "Volumes[?Size>=100].[VolumeId,Size,Encrypted,Attachments[0].InstanceId]" --output table
If it returns any rows, you have orphaned volumes. Delete them now — or tag them CostCenter=investigate and add a 72-hour TTL.
- On every Kubernetes cluster you operate, run:
kubectl get nodes -o wide | while read node _; do echo $node; ssh $node "df -h | grep -E '/var/lib/kubelet|/var/lib/containerd'"; done
If /var/lib/containerd/ is >85% full and not on the same device as /var/lib/kubelet, set both imageGCLowThresholdPercent and imageGCHighThresholdPercent explicitly — today.
- Audit every Lambda function with
vpc_configenabled:
aws lambda list-functions --query "Functions[?VpcConfig.VpcId!=null].[FunctionName,VpcConfig.SubnetIds,VpcConfig.SecurityGroupIds]" --output table
For each, check if you’ve pre-provisioned ENIs tagged with aws:lambda:function-name:. If not, provision 3 per AZ before* your next deploy.
- Replace every
latesttag in your Dockerfiles with a SHA256 digest. Use this one-liner:
docker pull nginx:latest && docker inspect nginx:latest --format='{{index .RepoDigests 0}}'
# Then replace "nginx:latest" with "nginx@sha256:..."
- In every GitHub Actions workflow, delete
permissions: read-alland replace it with the minimal set. Start withcontents: readand add only what breaks.
None of these take more than 20 minutes. All of them prevent a significant amount.38 bills, 8-second latency spikes, and midnight Slack alerts.
Cloud infrastructure isn’t magic. It’s physics — with state, timing, and side effects. The fastest way to stop fighting it is to stop pretending it’s simpler than it is.
I’ve wasted 3 days, a significant amount.38, and 36 hours of debugging time learning that.
You don’t have to.