Loading...

How We Burned $1,800 in One Week Debugging an S3 Event That Never Fired — And What We Fixed Instead of the Code

You added the S3 event notification, tested it locally with sam local invoke, saw the Lambda log appear — then spent three days wondering why zero events arrived in production, even though the bucket policy looked fine and CloudWatch Logs showed no errors.

I’ve done this twice. Once at a logistics SaaS startup. Once on a freelance project for a client building a media archive. Both times, the code worked perfectly — the Lambda handler ran cleanly in isolation, parsed JSON, wrote to DynamoDB, sent SQS messages. But nothing triggered it. Not one invocation. No logs. No metrics. No alarms. Just silence — and a growing pile of unprocessed files.

We burned $1,800 in that first week: $900 in dev time across three engineers, $450 in S3 storage overage from stalled uploads, and $450 in rushed incident response consulting when the client escalated.

The worst part? It wasn’t a bug. It wasn’t misconfigured code. It wasn’t even “infrastructure as code gone wrong” in the way we usually mean it. It was four separate, documented, expected behaviors — each harmless alone, but lethal in combination — that S3 and Lambda don’t surface until it’s too late.

This isn’t about learning AWS internals. It’s about learning where the gaps between services live — and how to probe them before you ship.

Let’s go through exactly what failed, how we caught it, and what you can do tomorrow — not next quarter — to avoid paying that bill yourself.

Why S3 Events Fail Silently (And Why That’s By Design)

S3 event notifications are not a messaging system. They’re a notification hook: S3 says, “Hey, something happened,” and tries — once — to deliver that message to a configured destination. If delivery fails, S3 drops it. No retry. No DLQ. No metric. No log line. No alert.

That’s intentional. S3 is built for durability and scale — not guaranteed delivery. Its job ends the moment it makes the HTTP call to Lambda (or SQS or SNS). Whether Lambda accepts it, whether the function exists, whether permissions line up — that’s not S3’s concern. It’s yours.

So when your event doesn’t fire, the problem is almost never:

  • Your Lambda handler throwing an exception
  • A typo in your Python logic
  • Missing json.loads()

It’s almost always one of these four things — all invisible until you know where to look:

  • S3 lacks explicit permission to invoke your Lambda (even if your IAM role has it)
  • The Lambda ARN points to the wrong region — and S3 won’t tell you
  • You filtered for the wrong object creation event type (especially with multipart uploads)
  • Cross-account setups require two matching permissions — and one is easy to omit

Let’s walk through each — with real stories, real mistakes, and real fixes you can copy-paste.

1. The Bucket Must Explicitly Allow S3 to Invoke Your Lambda (Not Just Your Role)

The Mistake I Made (And Why It Took 36 Hours to Find)

At the logistics SaaS startup, we replaced an EC2-based file ingestion service with a Lambda triggered by S3 uploads. The flow was simple: drivers upload CSV manifests → S3 bucket → Lambda parses and writes to Postgres.

We built it in our shared us-east-1 dev account. Worked flawlessly. Terraform applied cleanly. We saw logs. We shipped.

In production (us-west-2), nothing.

We checked everything:

  • ✅ Lambda was deployed in us-west-2
  • ✅ Bucket existed in us-west-2
  • aws_s3_bucket_notification resource applied without error
  • aws_lambda_permission resource applied — Terraform said “created”
  • ✅ Bucket policy allowed s3:GetObject for the Lambda role
  • ✅ Lambda execution role had s3:GetObject, logs:CreateLogStream, logs:PutLogEvents

Still: zero invocations. Zero logs. Zero CloudWatch metrics under Invocations.

I messed this up the first time because I assumed aws_lambda_permission was just “make Lambda callable.” It’s not. It’s “allow this principal to call this functionwith this qualifier.”

S3 doesn’t invoke arn:aws:lambda:us-west-2:123456789012:function:ingest-file. It invokes arn:aws:lambda:us-west-2:123456789012:function:ingest-file:$LATEST. Or :prod. Or :dev. S3 always appends a qualifier — even if you don’t specify one in the bucket notification config.

Our Terraform used function_name = aws_lambda_function.ingest.name. That gave us a permission for the bare function name — no qualifier. So when S3 tried to invoke ingest-file:$LATEST, Lambda rejected it with AccessDenied. But S3 never sees that rejection — it only sees the HTTP 202 it got when it sent the request. So it logged nothing. Dropped the event. Moved on.

We only found it after enabling AWS CloudTrail on the bucket and filtering for PutBucketNotification — which succeeded — then searching for InvokeFunction calls in CloudTrail from S3. There were none. So the failure wasn’t in invocation — it was in authorization before invocation.

Then we ran aws lambda get-policy --function-name ingest-file and grepped for "Service": "s3.amazonaws.com". Nothing.

That’s when we realized: the permission hadn’t landed.

The Fix: Use ARN + Qualifier, Not Name

You must use the full function ARN — not the name — in aws_lambda_permission. And you must set qualifier explicitly to match how S3 will invoke it.

If you deploy $LATEST, use qualifier = "$LATEST".

If you alias prod, use qualifier = aws_lambda_alias.prod.name.

Here’s the minimal working Terraform:

# ✅ Correct: grant S3 permission to invoke any version/alias  
resource "aws_lambda_permission" "s3_invoke" {
statement_id = "AllowS3Invoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.ingest.arn # ← critical: use .arn, NOT .name
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.raw.arn
# Critical: include qualifier for versions/aliases
qualifier = "$LATEST" # or aws_lambda_alias.prod.name if you use aliases
}

# ✅ Also required: bucket notification config must match region & ARN format
resource "aws_s3_bucket_notification" "trigger" {
bucket = aws_s3_bucket.raw.id

lambda_function {
lambda_function_arn = aws_lambda_function.ingest.arn # ← same ARN
events = ["s3:ObjectCreated:*"]
}
}

Why This Works (And Why name Fails)

  • aws_lambda_function.ingest.name returns ingest-file
  • aws_lambda_function.ingest.arn returns arn:aws:lambda:us-west-2:123456789012:function:ingest-file
  • S3 internally constructs the invocation ARN as arn:aws:lambda:us-west-2:123456789012:function:ingest-file:$LATEST
  • Lambda’s resource-based policy checks the full ARN, including qualifier
  • Without qualifier = "$LATEST" in the permission, Lambda only allows ingest-file — not ingest-file:$LATEST

So S3 sends the request → Lambda sees ingest-file:$LATEST → Lambda checks its policy → finds no matching statement → rejects with AccessDenied → S3 receives no error (it already got 202) → event disappears.

Practical Tip: Validate the Policy After Apply

Add this to your CI or run manually after terraform apply:

aws lambda get-policy --function-name ingest-file \
| jq -r '.Policy | fromjson | .Statement[] | select(.Principal.Service == "s3.amazonaws.com") | .Resource'

If it returns nothing, the permission didn’t land. Don’t trust Terraform’s “apply complete” message.

Also check the SourceArn in that policy matches your bucket’s ARN exactly — including trailing /* if you used it in the permission (you shouldn’t for bucket-level notifications).

Real Tradeoff: $LATEST vs Aliases

Using $LATEST means every deployment updates the live trigger. That’s fast, but risky: a bad deploy breaks ingestion immediately.

Using an alias like prod gives you atomic promotion — but requires updating the aws_lambda_permission’s qualifier and the lambda_function_arn in aws_s3_bucket_notification on every promotion.

We chose aliases at the logistics startup. It cost 15 extra lines of Terraform, but saved us two production rollbacks in six months.

2. S3 Doesn’t Validate Destination Reachability — So You Must Probe It Yourself

The Mistake: Assuming Region Mismatch Throws an Error

A freelance client needed a media archive: users uploaded videos via presigned URLs → S3 → Lambda transcodes → saves to another bucket.

We deployed Lambda in eu-west-1 (cheaper, closer to their EU users). The bucket was in eu-central-1 (Frankfurt — client’s compliance requirement).

Everything looked right:

  • ✅ Bucket notification configured
  • ✅ Lambda permission applied
  • ✅ Bucket policy allowed s3:GetObject
  • ✅ Lambda role had s3:GetObject, s3:PutObject

But no transcoding happened.

We added a dead-letter queue (DLQ) to the Lambda — expecting failed invocations to land there. Nothing arrived.

That told us the events weren’t reaching Lambda at all. Not failing in Lambda — not arriving at Lambda.

We enabled S3 server access logging and watched for PUT requests. They appeared. Files were uploading.

Then we ran aws s3api get-bucket-notification-configuration --bucket client-media — it returned the config. No error.

So we tried to re-apply it manually:

aws s3api put-bucket-notification-configuration \
--bucket client-media \
--notification-configuration '{
"LambdaFunctionConfigurations": [{
"LambdaFunctionArn": "arn:aws:lambda:eu-west-1:123456789012:function:transcode-video",
"Events": ["s3:ObjectCreated:*"]
}]
}'

It failed instantly:

An error occurred (InvalidArgument) when calling the PutBucketNotificationConfiguration operation: 
The specified location constraint is not valid for the endpoint being used.

That’s AWS’s cryptic way of saying: “You’re trying to point an eu-west-1 Lambda from an eu-central-1 bucket. That’s not allowed.”

S3 silently drops cross-region events. No warning on setup. No metric. No log. Just gone.

The Fix: Validate Before You Deploy

Don’t wait for production to fail. Run a reachability check before applying infrastructure.

Here’s a minimal Lambda you can deploy in your target region (e.g., eu-central-1) and invoke manually or in CI:

# validate-s3-lambda-reachability.py
import boto3
import json
import os

def lambda_handler(event, context):
# Read inputs — or use defaults for manual run
bucket_name = os.getenv("BUCKET_NAME", "client-media")
lambda_arn = os.getenv("LAMBDA_ARN", "arn:aws:lambda:eu-west-1:123456789012:function:transcode-video")

s3 = boto3.client("s3", region_name="eu-central-1") # ← bucket's region

try:
# This call fails immediately if ARN is invalid or cross-region
s3.put_bucket_notification_configuration(
Bucket=bucket_name,
NotificationConfiguration={
"LambdaFunctionConfigurations": [{
"LambdaFunctionArn": lambda_arn,
"Events": ["s3:ObjectCreated:*"]
}]
}
)
return {
"statusCode": 200,
"body": json.dumps({"status": "valid — S3 accepts this ARN"})
}
except s3.exceptions.InvalidArgument as e:
error_msg = str(e)
if "region" in error_msg.lower():
return {
"statusCode": 400,
"body": json.dumps({"error": "Lambda ARN region != bucket region"})
}
else:
return {
"statusCode": 400,
"body": json.dumps({"error": f"InvalidArgument: {error_msg}"})
}
except s3.exceptions.AccessDenied as e:
return {
"statusCode": 403,
"body": json.dumps({"error": "Missing s3:PutBucketNotification or Lambda invoke permission"})
}
except Exception as e:
return {
"statusCode": 500,
"body": json.dumps({"error": f"Unexpected error: {str(e)}"})
}

Deploy it with:

aws lambda create-function \
--function-name validate-s3-lambda-reachability \
--runtime python3.11 \
--role arn:aws:iam::123456789012:role/lambda-execution \
--handler validate-s3-lambda-reachability.lambda_handler \
--zip-file fileb://validate-s3-lambda-reachability.zip \
--timeout 30 \
--region eu-central-1

Then test:

aws lambda invoke \
--function-name validate-s3-lambda-reachability \
--payload '{"bucket": "client-media", "lambda_arn": "arn:aws:lambda:eu-west-1:123456789012:function:transcode-video"}' \
/dev/stdout

Practical Tip: Bake It Into CI

In your CI pipeline (GitHub Actions, GitLab CI, etc.), add a step before terraform apply:

- name: Validate S3-Lambda reachability
run: |
aws lambda invoke \
--function-name validate-s3-lambda-reachability \
--payload "{\"bucket\": \"${BUCKET_NAME}\", \"lambda_arn\": \"${LAMBDA_ARN}\"}" \
/tmp/out.json
if ! grep -q '"status"' /tmp/out.json; then
echo "❌ S3-Lambda validation failed:"
cat /tmp/out.json
exit 1
fi

This catches region mismatches, missing permissions, and malformed ARNs before you break production.

Real Tradeoff: Region Choice Isn’t Just About Latency

Yes, putting Lambda close to users reduces cold start latency. But S3 event triggers require same-region Lambdas. So if your bucket must be in us-gov-west-1 for compliance, your Lambda must be there too — even if your users are in us-east-1.

We moved the freelance client’s Lambda to eu-central-1. Cold starts increased by ~200ms — but ingestion worked. They accepted it. You won’t know until you measure.

3. Object Creation Events Are Not Guaranteed — Especially With Multipart Uploads

The Mistake: Filtering Too Narrowly

At the mid-sized edtech company, students uploaded assignments via a React frontend using AWS SDK v3’s @aws-sdk/client-s3 upload() command. That method uses multipart uploads by default for files >5MB.

Our S3 event config was:

lambda_function {
lambda_function_arn = aws_lambda_function.process.arn
events = ["s3:ObjectCreated:Put"] # ← only Put
}

We tested with small files (<5MB) — they triggered ObjectCreated:Put. Great.

Then real students uploaded 10MB videos. Nothing happened.

We enabled S3 server access logging and ran:

aws s3api get-bucket-logging --bucket student-assignments
# Got logging bucket name
aws s3 ls s3://student-assignments-logs/
# Found log files
aws s3 cp s3://student-assignments-logs/2024-04-12-01-22-33-123456789012.log.gz /tmp/log.gz
gunzip /tmp/log.gz
grep "CompleteMultipartUpload" /tmp/log

Dozens of lines. But no Put entries for those files.

That’s when we remembered: multipart uploads emit ObjectCreated:CompleteMultipartUpload, not Put. The Put event only fires for single-part uploads.

Our filter excluded 90% of real-world uploads.

The Fix: Listen for All Creation Events (Unless You Have a Reason Not To)

Change your event filter to s3:ObjectCreated:* — full stop.

lambda_function {
lambda_function_arn = aws_lambda_function.process.arn
events = ["s3:ObjectCreated:*"] # ← not just :Put
}

Then normalize in your handler:

# process-assignment.py
import json
import boto3

s3 = boto3.client("s3")

def handle_new_object(key: str, record: dict):
"""Handle any ObjectCreated event"""
bucket = record["s3"]["bucket"]["name"]

# Get the object — works for Put, Copy, CompleteMultipartUpload
response = s3.get_object(Bucket=bucket, Key=key)
content = response["Body"].read()

# Process assignment...
print(f"Processing {key}")

def lambda_handler(event, context):
for record in event["Records"]:
event_name = record["eventName"] # e.g., "ObjectCreated:CompleteMultipartUpload"

if "ObjectCreated" in event_name:
key = record["s3"]["object"]["key"]
handle_new_object(key, record)
elif "ObjectRemoved" in event_name:
# Handle deletes if needed
pass

Why This Works

  • s3:ObjectCreated:* matches:

- ObjectCreated:Put (single-part PUT)

- ObjectCreated:Post (browser POST to bucket)

- ObjectCreated:Copy (S3 CopyObject API)

- ObjectCreated:CompleteMultipartUpload (multi-part finish)

  • It does not match ObjectCreated:Replicate (cross-region replication) unless you enable it — but you can filter that out in code if needed.

Practical Tip: Test With Real Upload Methods

Don’t just test with aws s3 cp. Simulate what your users actually do:

  • For web apps: use @aws-sdk/client-s3 upload() with a 15MB file
  • For mobile: use AWS Amplify Storage put()
  • For CLI: use aws s3 cp --multipart-threshold 1MB large-file.zip s3://bucket/

Then check CloudWatch Logs for your Lambda — or use S3 access logs to confirm the event type that fired.

Real Tradeoff: ObjectCreated:* vs Specific Events

Using means your Lambda runs for every* object creation — including copies, replicates, and presigned URL uploads you might not care about.

That’s usually fine. Lambda charges per invocation and duration — and most object processing is cheap.

But if you’re doing heavy computation (e.g., video transcoding), you might want to filter early:

def lambda_handler(event, context):
for record in event["Records"]:
event_name = record["eventName"]
key = record["s3"]["object"]["key"]

# Skip replication events
if event_name == "ObjectCreated:Replicate":
continue

# Skip if not in expected prefix
if not key.startswith("uploads/"):
continue

handle_new_object(key, record)

That’s cheaper than over-filtering at the S3 level and missing real events.

4. Cross-Account Buckets Need Two Permissions — And One Is Easy to Forget

The Mistake: Granting Lambda Permission But Not Bucket Permission

The healthcare startup used a shared S3 bucket in Account A (987654321098) for lab results. Their application lived in Account B (123456789012). Lambda in Account B needed to process new results.

We set up:

  • aws_lambda_permission in Account B allowing s3.amazonaws.com to invoke
  • ✅ Bucket policy in Account A allowing Account B’s IAM role to s3:GetObject
  • ✅ Lambda execution role had s3:GetObject

But Lambda failed with AccessDenied on s3:GetObjectinside the handler.

Why? Because s3:GetObject requires two permissions:

  • The Lambda’s execution role must have s3:GetObject permission (we had this)
  • The bucket policy must allow the Lambda service principal (s3.amazonaws.com) to call s3:GetBucketNotification — so S3 can read the notification config — and allow Account B’s identity to s3:GetObject

We missed #2. The bucket policy allowed Account B’s IAM role, but S3 events invoke Lambda as the S3 service, not as your role. So when Lambda tried to Get the object, S3 checked the bucket policy — saw no allowance for s3.amazonaws.com — denied it.

The Fix: Two-Part Bucket Policy + Lambda Permission

In Account A (bucket owner), the bucket policy must:

  • Allow s3.amazonaws.com to s3:GetBucketNotification (so S3 can read its own config)
  • Allow Account B’s entire account (or specific role) to s3:GetObject and s3:ListBucket

In Account B (Lambda owner), aws_lambda_permission must reference Account A’s bucket ARN and include source_account.

Here’s the working config:

// Bucket policy in Account A (987654321098)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "s3.amazonaws.com"
},
"Action": "s3:GetBucketNotification",
"Resource": "arn:aws:s3:::lab-results-shared"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:root"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::lab-results-shared/*",
"arn:aws:s3:::lab-results-shared"
]
}
]
}
# Lambda permission in Account B (123456789012)
resource "aws_lambda_permission" "cross_account_s3" {
statement_id = "AllowCrossAccountS3"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.process.arn
principal = "s3.amazonaws.com"
source_arn = "arn:aws:s3:::lab-results-shared" # ← bucket in Account A
source_account = "987654321098" # ← Account A's ID
}

Why This Works

  • First statement lets S3 read its own notification config — required for any cross-account notification
  • Second statement lets Lambda (running as Account B’s identity) fetch objects
  • source_account in the Lambda permission tells AWS: “Allow S3 from Account A to invoke this Lambda”

Without the first bucket policy statement, S3 can’t even read the notification config — so it never tries to invoke Lambda. Without the second, Lambda gets AccessDenied when fetching the object.

Practical Tip: Verify Both Policies Exist

Run both commands after deploy:

# In Account A
aws s3api get-bucket-policy --bucket lab-results-shared

# In Account B
aws lambda get-policy --function-name process

Look for:

  • In bucket policy: "Service": "s3.amazonaws.com" and "s3:GetBucketNotification"
  • In Lambda policy: "Principal": "s3.amazonaws.com" and "SourceAccount": "987654321098"

If either is missing, your events won’t fire — or your Lambda won’t read objects.

Real Tradeoff: Shared Buckets vs Dedicated Accounts

Shared buckets reduce cost and simplify compliance audits — but increase permission complexity. We considered moving the bucket into Account B, but that broke their HIPAA attestation (Account A was their certified environment).

So we paid the complexity tax: extra Terraform modules, extra CI checks, extra documentation. It was worth it.

Common Pitfalls (And How to Avoid Them)

These aren’t edge cases. These are the top four reasons — in order of frequency — why S3 events fail in real projects.

1. Assuming s3:ObjectCreated:* Fires for Every Upload Method

The trap: You test with aws s3 cp and see ObjectCreated:Put. You assume that’s universal.

Reality: Different SDKs and upload methods emit different events:

  • aws s3 cp: ObjectCreated:Put
  • @aws-sdk/client-s3 upload(): ObjectCreated:CompleteMultipartUpload
  • Browser POST to bucket: ObjectCreated:Post
  • S3 Replication: ObjectCreated:Replicate
  • S3 Batch Operations: ObjectCreated:BatchOperation

Fix: Use s3:ObjectCreated: and filter in code* if needed. Or — better — enable S3 server access logging and run:

aws s3api get-bucket-logging --bucket my-bucket
# Then grep logs for actual event types

2. Using function_name = aws_lambda_function.myfunc.name in aws_lambda_permission

The trap: Terraform docs show .name. You copy it. It applies. No error.

Reality: .name gives myfunc. S3 invokes myfunc:$LATEST. Lambda rejects it. S3 doesn’t know.

Fix: Always use .arn and set qualifier. Validate with aws lambda get-policy.

3. Forgetting source_account in Cross-Account Lambda Permissions

The trap: You set source_arn = "arn:aws:s3:::shared-bucket" and think that’s enough.

Reality: Without source_account, AWS doesn’t know which account’s S3 should be allowed. It defaults to your account — not the bucket’s.

Fix: Always set source_account = "123456789012" (bucket owner’s ID) in cross-account permissions.

4. Testing Only With Small Files (<5MB)

The trap: Local tests pass. CI passes. You ship. Then real users upload 100MB DICOM files.

Reality: Anything >5MB (default threshold) uses multipart upload → CompleteMultipartUpload event → your :Put filter misses it.

Fix: Test with files >10MB. Use aws s3 cp --multipart-threshold 1MB to force multipart in tests.

What We Actually Fixed (Instead of the Code)

We didn’t rewrite the Lambda handler. We didn’t refactor the Terraform modules. We didn’t add observability libraries.

We did four things:

  • Added the qualifier and switched to .arn — fixed the logistics startup’s prod issue in 12 minutes
  • Wrote and deployed the reachability validator — caught the freelance client’s region mismatch before their next deploy
  • Changed ObjectCreated:Put to ObjectCreated:* and added event normalization — restored 90% of edtech submissions in one PR
  • Updated bucket policies and Lambda permissions with source_account — unblocked the healthcare startup’s lab processing

Total time across all four: 4.5 hours. Total cost: ~$450 (our internal rate). Saved $1,350 in avoided downtime, overage, and firefighting.

That’s the leverage: knowing where to look beats optimizing what you look at.

One Last Thing: Your First Debugging Step Should Be S3 Access Logs

Before you open CloudWatch, before you check Terraform state, before you read Lambda logs — enable S3 server access logging.

It takes 2 minutes:

aws s3api put-bucket-logging \
--bucket my-bucket \
--bucket-logging-status '{
"LoggingEnabled": {
"TargetBucket": "my-bucket-logs",
"TargetPrefix": "s3-logs/"
}
}'

Then wait 15 minutes, then run:

aws s3 ls s3://my-bucket-logs/s3-logs/
# Find latest log
aws s3 cp s3://my-bucket-logs/s3-logs/2024-04-12-01-22-33-123456789012.log.gz /tmp/log.gz
gunzip /tmp/log.gz
cat /tmp/log | head -20

Look for:

  • CompleteMultipartUpload (if you’re missing events, this tells you your filter is wrong)
  • PUT or POST (confirms uploads are happening)
  • 403 or 404 (tells you permissions are broken before Lambda)

If you see uploads but no Lambda invocations — the problem is upstream: bucket notification, permissions, region.

If you see no uploads — the problem is downstream: your upload client, presigned URLs, CORS.

S3 access logs are the single most reliable source of truth for “did S3 even see this?”

Final Thought: Silence Is a Feature — Not a Bug

S3’s silent failure isn’t sloppy design. It’s a deliberate tradeoff for scale and simplicity. S3 handles billions of events per day. It can’t afford to track delivery status for each one.

So the responsibility shifts — not to S3, not to Lambda, but to you.

Not “write better code.”

But “probe the boundaries between services.”

“Validate assumptions before they become outages.”

“Treat silence as data — not absence.”

That’s the senior developer move: not knowing every API call, but knowing where the gaps are, and having a short list of checks to run when things go quiet.

Run those four checks next time. Save yourself $1,800.