I woke up at 2:58 AM on a Tuesday in March 2021 because my phone screamed “a fintech startup I worked at Payment Reconciliation Service — Deployment Failed (Prod)”. Not staging. Not canary. Prod. And not just failed—silently corrupted: TLS handshakes were timing out for 12% of reconciliation batches, but only between 3:17–3:23 AM UTC, only on us-east-1c nodes, and only when the reconciler hit our internal auth proxy.
We’d deployed the same image successfully 17 times that week. No code changes. No config drift. No new dependencies. Just a docker build && docker push && kubectl rollout restart.
The logs showed SSL_connect returned=1 errno=0 state=error: sslv3 alert handshake failure. Which made zero sense—our service used rustls, not OpenSSL. And we knew it wasn’t a cipher suite mismatch, because the exact same binary worked fine when run locally with docker run -it --rm ....
It took 36 hours—and one very loud, very justified escalation to the Docker team at DockerCon (yes, I cold-DMed them at 4:30 AM PST)—to find the root cause:
FROM debian:bookworm-slim # ← unversioned tag
RUN apt-get update && apt-get install -y curl jq
COPY ./bin/static-tls-verifier /usr/local/bin/
ENTRYPOINT ["/usr/local/bin/static-tls-verifier"]
That static-tls-verifier was a Rust binary compiled with --target x86_64-unknown-linux-musl, statically linked, no glibc dependency—supposedly. But debian:bookworm-slim had just auto-upgraded its base image layer from bookworm-slim@sha256:abc123 → bookworm-slim@sha256:def456 overnight. The new layer shipped glibc 2.roughly a third-5, which changed how the kernel handled getrandom() syscall fallbacks under seccomp—and our musl binary, while intended to be static, still relied on getrandom() via libring. The syscall succeeded in dev (full seccomp:unconfined) but failed in prod (seccomp:runtime/default). We’d assumed immutability. Docker gave us indirection.
We pinned the base image hash. We added RUN readelf -d /usr/local/bin/static-tls-verifier | grep NEEDED to catch dynamic linkage leaks. And I swore—out loud, in Slack, at 5:42 AM—that I’d never again treat docker build as “just packaging.”
That was the day I stopped optimizing for build speed, and started optimizing for build determinism, layer provenance, and syscall surface auditability. This isn’t about Docker best practices. It’s about surviving production.
---
The Real Problem Isn’t Docker — It’s That You’re Using It Like a Zip File
Docker is not a glorified tarball. It’s a distributed systems primitive with cache coherency semantics, mount propagation rules, and kernel-level isolation guarantees—all exposed through a CLI that looks like make with extra steps.
Every time you type docker build, you’re doing three things simultaneously:
- Executing a distributed build graph across potentially remote cache registries, local disk, and builder VMs
- Constructing a filesystem snapshot tree where each
RUNcreates a new layer—even if it deletes files from the previous one - Leaking environment state (secrets, git metadata, IDE configs,
.env.*files) into immutable artifacts meant to run on bare metal, Kubernetes, or a cloud provider Firecracker
And yet, most teams treat it like npm pack: copy everything, hope nothing breaks, pray the .dockerignore works.
It doesn’t.
Here’s what actually breaks in production—not theory, but real incidents I’ve debugged, shipped fixes for, and paid for in engineering hours:
- Image bloat: Our Go service at Palantir went from 14MB → 87MB → 212MB over 9 months. Not from code growth. From
COPY --from=builder /usr/lib/x86_64-linux-gnu/grabbing all shared libs—includinglibgcc_s.so.1,libstdc++.so.6, andlibgfortran.so.5—even though the binary was built withCGO_ENABLED=0. We thought “multi-stage = lean.” It wasn’t. It was lazy copying.
- Secret leakage: At a travel platform, a rotated CA cert broke prod for 19 hours because
--secret id=ca_certwas passed todocker build, but theRUNinstruction didn’t include--mount=type=secret. Docker didn’t error. It just ran the command without the mount, silently using the outdated system CA bundle.curlsucceeded against public endpoints—but failed against internal ones requiring our custom chain. No warning. No log. Just TLS handshake timeouts.
- Non-hermetic builds: At Shopify, our Rails app’s Docker image grew from 840MB → 2.1GB over 6 months—not from gems, but from
COPY . .dragging inlog/,tmp/,storage/, and.ruby-version..dockerignorelooked correct. But we’d addedstorage -> /mnt/nfs/storageas a symlink. Docker follows symlinks duringCOPY, ignoring.dockerignorefor the resolved path. So/mnt/nfs/storage/got copied—every single file, every backup, every developer’s local SQLite DB—into the image. Then Bundler re-resolved gems inside the container*, breaking deterministic builds.
- Cache poisoning: At a streaming service, our Java image build took roughly one in five minutes. We enabled BuildKit, added
--cache-from, and watched it drop to 4 minutes… until version bumps.BUILD_VERSION=1.2.3vs1.2.4invalidated the entire cache tree—even whenpom.xmlhadn’t changed—because BuildKit’s defaultmode=minonly cached layer digests, not build args or mount hashes. We’d configured caching, but not what was being cached.
These aren’t edge cases. They’re the default behavior of Docker when used without understanding its execution model.
So let’s fix them—not with abstractions, but with concrete, tested, production-hardened patterns.
---
The Layer Cache Lie — How BuildKit Actually Decides What’s Reusable
BuildKit doesn’t cache “commands.” It caches build steps, and those steps are keyed on everything that affects their output: source file hashes, build args, mount configurations, even the digest of the base image’s config manifest, not just its layers.
But here’s what the docs won’t tell you: --cache-from does nothing unless you also specify --export-cache with mode=max.
I learned this the hard way at a streaming service.
We had a monorepo with 42 Java services. Each built with Maven, each using Eclipse Temurin 17. Builds were slow—roughly one in five minutes average—so we enabled BuildKit, pushed cache to ECR, and set --cache-from type=registry,ref=netflix/java-build-cache. We watched the first build take roughly one in five minutes. The second? 21 minutes and 52 seconds. Third? Same.
After 11 days, I ran:
DOCKER_BUILDKIT=1 docker build --progress=plain \
--cache-from type=registry,ref=netflix/java-build-cache \
--export-cache type=registry,ref=netflix/java-build-cache \
.
Still no improvement.
Then I added ,mode=max:
DOCKER_BUILDKIT=1 docker build --progress=plain \
--cache-from type=registry,ref=netflix/java-build-cache \
--export-cache type=registry,ref=netflix/java-build-cache,mode=max \
.
Build time dropped from roughly one in five minutes → 3 minutes 42 seconds. Consistently.
Why?
mode=min(default): Only caches layer digests. If any build arg changes—evenBUILD_VERSION=1.2.3→1.2.4—the entire cache tree invalidates. Because BuildKit treats build args as inputs, but doesn’t store them in the cache key unlessmode=max.
mode=max: Caches all inputs: build args, mount targets, source file hashes, and the full config manifest digest of the base image. SoBUILD_VERSION=1.2.4only invalidates the layers that actually depend on it—not themvnw dependency:go-offlinestep, which is identical.
But there’s another trap: you must declare ARG inside the stage where it’s used, and reference it in a RUN or ENV, or BuildKit ignores it for cache keying.
This fails silently:
ARG BUILD_VERSION=1.2.3
FROM eclipse-temurin:17-jre-jammy AS builder
❌ BUILD_VERSION not referenced → not part of cache key
RUN ./mvnw package -DskipTests
This works:
ARG BUILD_VERSION=1.2.3
FROM eclipse-temurin:17-jre-jammy AS builder
ARG BUILD_VERSION # ← Required: makes ARG part of cache key
ENV BUILD_VERSION=$BUILD_VERSION # ← Also works, but ENV is heavier
RUN echo "Building version $BUILD_VERSION" && \
./mvnw package -DskipTests
Also critical: --mount=type=cache mounts are not cached by default—even with mode=max. You must explicitly tell BuildKit to cache their content hashes, not just their existence.
Here’s the working Dockerfile.java (tested on Docker 24.0.7, BuildKit v0.12.5):
# syntax=docker/dockerfile:1
Dockerfile.java — Java 17, Maven, BuildKit v0.12.5+, Docker 24.0.7
ARG BUILD_VERSION=1.2.3
ARG MAVEN_HOME=/root/.m2
FROM eclipse-temurin:17-jre-jammy AS builder
Required to make BUILD_VERSION part of cache key
ARG BUILD_VERSION
Required to make MAVEN_HOME part of cache key
ARG MAVEN_HOME
WORKDIR /app
Copy only what's needed first — avoids invalidating cache on src changes
COPY pom.xml .
Use cache mount for ~/.m2 — persists across builds, speeds up dependency resolution
RUN --mount=type=cache,target=$MAVEN_HOME \
--mount=type=cache,target=/root/.m2/repository \
./mvnw dependency:go-offline -B
Now copy everything else
COPY . .
Reuse same cache mount — now resolves actual deps, not just offline mode
RUN --mount=type=cache,target=$MAVEN_HOME \
--mount=type=cache,target=/root/.m2/repository \
./mvnw package -DskipTests -B
Final stage — minimal JRE, no build tools
FROM eclipse-temurin:17-jre-jammy
Copy only the JAR, not the whole workspace
COPY --from=builder --chown=1001:1001 /app/target/app.jar /app.jar
USER 1001
EXPOSE 8080
ENTRYPOINT ["java","-jar","/app.jar"]
Key things this does right:
ARG BUILD_VERSIONappears in the same stage where it’s used → part of cache key--mount=type=cachedeclared in both RUN instructions → cache reused acrossgo-offlineandpackage--chown=1001:1001on finalCOPY→ avoids root-owned files in runtime container- No
apt-get update && apt-get installin final stage → no bloated package manager
What happens if you skip mode=max? Your remote cache hits drop from ~85% to ~30%. You’ll think BuildKit is “broken.” It’s not. You’re just caching the wrong thing.
Insider tip #1: Run docker build --progress=plain --cache-from ... 2>&1 | grep "CACHED" to see exactly which steps hit cache. If you see on steps that should be cached, check your mode= setting and ARG placement.
Insider tip #2: BuildKit caches mount content hashes, not just mount existence. So --mount=type=cache,target=/root/.m2 will reuse cache only if the mounted directory’s content hasn’t changed. That’s why mvnw dependency:go-offline first—it populates the cache before package tries to use it.
Tradeoff: mode=max increases cache registry storage usage by ~15–20% (more metadata), but saves >70% build time for version-bumped builds. If you ship multiple versions/day, mode=max pays for itself in <2 hours of engineer time.
What you should do tomorrow:
✅ Add ,mode=max to your --export-cache flag
✅ Move all ARG declarations into the stage where they’re consumed
✅ Run docker build --progress=plain ... 2>&1 | grep -E "(CACHED| to verify cache hit rate
---
The COPY Trap — Why .dockerignore Lies to You and How to Audit What’s Really Inside
.dockerignore is a lie.
Not a malicious lie. A structural lie. It works… until it doesn’t. And when it fails, it fails catastrophically—copying node_modules/, .git/, ~/.aws/credentials, or worse, ./prod-secrets.env.
At Shopify, our .dockerignore looked perfect:
.git
log/
tmp/
storage/
.DS_Store
.env.local
But then a dev added:
ln -s /mnt/nfs/storage storage
Docker follows symlinks during COPY. And .dockerignore rules apply before symlink resolution—not after. So /mnt/nfs/storage/* got copied, ignored .dockerignore entirely.
We found out when docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h | tail -5 showed our image at 2.1GB. docker run -it --rm revealed /mnt taking 1.8GB.
.dockerignore didn’t fail. It just didn’t apply.
So how do you know what’s really getting copied?
Stop guessing. Audit it.
Step 1: See exactly what COPY resolves to — before building
Docker doesn’t expose this directly, but you can force it to list sources:
# Run this before your actual build
docker build --no-cache --progress=plain -f /dev/null . 2>&1 | \
grep -E "^\sCOPY.\." | head -10
This parses Docker’s internal progress log and shows every COPY source path Docker actually resolved. If you see /mnt/nfs/storage, you’ve got a problem.
Step 2: Stop using COPY . . entirely
It’s the root of 80% of image bloat. Replace it with explicit, auditable, git-aware copying.
Here’s what we shipped at Shopify (Docker 23.0+, Git 2.35+):
# Dockerfile.rails — Rails 7.1, Ruby 3.2.2, Docker 23.0+
FROM ruby:3.2.2-slim-bookworm
Create non-root user early — avoids chown later
RUN addgroup -g 1001 -f app && \
adduser -S app -u 1001
Install system deps before copying app code
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
build-essential libpq-dev libxml2-dev libxslt1-dev && \
rm -rf /var/lib/apt/lists/*
Switch to non-root user before copying — prevents root-owned files
USER app
✅ Safe, auditable COPY — only tracked git files
Uses git ls-files to get exactly what's committed
Filters by extension — avoids copying .rb files inside node_modules
--link uses hard links instead of copies (saves space, preserves timestamps)
COPY --chown=app:app \
--link \
--chmod=0644 \
$(git ls-files --exclude-standard --cached | \
grep -E '\.(rb|yml|erb|js|css|png|jpg|svg|woff2|ttf)$') \
/app/
Explicitly copy only required directories
COPY --chown=app:app config/ /app/config/
COPY --chown=app:app Gemfile* /app/
COPY --chown=app:app package.json yarn.lock /app/
Install deps as non-root, with --deployment flag for reproducibility
RUN bundle config set --local deployment 'true' && \
bundle config set --local path '/home/app/.bundle' && \
bundle install --jobs=4 --retry=3 && \
yarn install --frozen-lockfile
Precompile assets as non-root
RUN SECRET_KEY_BASE=dummy RAILS_ENV=production \
bundle exec rails assets:precompile
Final stage — slim, secure, minimal
FROM ruby:3.2.2-slim-bookworm
RUN addgroup -g 1001 -f app && \
adduser -S app -u 1001
USER app
WORKDIR /app
COPY --from=0 --chown=app:app /app /app
COPY --from=0 --chown=app:app /home/app/.bundle /home/app/.bundle
EXPOSE 3000
CMD ["bin/rails", "server", "-b", "0.0.0.0:3000"]
Line-by-line breakdown:
$(git ls-files ...)runs on the host, during Docker build context setup. It lists only files tracked by git—no.env.local, nolog/, notmp/.grep -E '\.(rb|yml|...)$'filters extensions. Critical: it excludesnode_modules/(no.jsfiles innode_modulesare tracked by git).--linktells Docker to use hard links instead of copying. Saves disk space, avoids timestamp skew, and makesCOPYatomic.--chown=app:appsets ownership during copy, not after. Avoidschown -Rlater (which creates new layers).bundle config set --local deployment 'true'forces Bundler to use--deploymentmode—noGemfile.lockchanges allowed.yarn install --frozen-lockfileensures lockfile isn’t modified.
What if you need db/migrate/ but not db/schema.rb? Easy: add db/migrate/ to the git ls-files filter, and COPY --chown=app:app db/migrate/ /app/db/migrate/ separately.
Insider tip #3: Run docker save to list every single file in your final image—no abstraction, no guessing. If you see node_modules/.bin/eslint, you’ve leaked dev deps. If you see log/production.log, you’ve copied logs. If you see /mnt/nfs/storage/backup.sql, you’ve followed a symlink.
Tradeoff: git ls-files requires git to be installed on the builder host (it is, in all modern CI runners). If you’re building from a zip artifact (e.g., GitHub Actions actions/checkout@v4 with fetch-depth: 0), it works. If you’re building from a tarball without .git, use find . -name ".rb" -o -name ".yml" | grep -v node_modules instead—but test it.
What you should do tomorrow:
✅ Replace COPY . . with COPY $(git ls-files ...) + explicit COPY for directories
✅ Run docker save to audit leakage
✅ Add --link and --chown to every COPY
---
Secrets, Certs, and the RUN --mount=type=secret Landmine
Secrets don’t belong in ENV, ARG, or RUN echo $SECRET > /tmp/key. They belong in --mount=type=secret, and only there.
But --mount=type=secret has a landmine: it does nothing unless you explicitly mount it inside the RUN instruction.
At a travel platform, we rotated certs every 90 days. Our docker build looked like this:
docker build \
--secret id=ca_cert,src=./prod-ca.pem \
-t app:latest .
And our Dockerfile:
FROM python:3.11-slim-bookworm
❌ Missing --mount=type=secret → cert never loaded
RUN apt-get update && apt-get install -y curl && \
update-ca-certificates
Docker didn’t error. It just ran update-ca-certificates without the secret mount. So the system CA bundle stayed outdated. curl https://api.internal failed with SSL certificate problem: unable to get local issuer certificate—but only for internal endpoints requiring our custom CA.
Debugging took 19 hours because:
curl -voutput showedissuer: CN=a travel platform Internal CA— so we thought the cert was loaded- But
openssl s_client -connect api.internal:443 -showcerts 2>/dev/null | openssl x509 -text | grep "Issuer"showedCN=Let's Encrypt R3— meaning the cert chain was being served by the server, not validated by the client - Only
strace -e trace=openat,certctl curl -v https://api.internal 2>&1 | grep ca-certrevealed/etc/ssl/certs/ca-certificates.crtwas opened, but/etc/ssl/certs/prod-ca.pemwas never touched
The fix is brutally simple — but easy to miss:
# Dockerfile.python — Python 3.11.8, Docker 20.10.16+
FROM python:3.11-slim-bookworm AS builder
✅ Mount secret and consume it in same RUN
RUN --mount=type=secret,id=ca_cert,target=/etc/ssl/certs/prod-ca.pem,uid=0,gid=0,mode=0444 \
--mount=type=cache,target=/var/cache/apt \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl && \
cp /run/secrets/ca_cert /etc/ssl/certs/ && \
update-ca-certificates && \
rm -rf /var/lib/apt/lists/*
✅ Verify cert is loaded at build time — fail fast
This catches mount failures immediately
RUN curl -v https://api.internal 2>&1 | grep "issuer:" || exit 1
Final stage — copy only what's needed
FROM python:3.11-slim-bookworm
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
COPY --from=builder /etc/ssl/certs/prod-ca.pem /etc/ssl/certs/
RUN update-ca-certificates
COPY . /app
WORKDIR /app
CMD ["python", "app.py"]
Critical details:
--mount=type=secret,id=ca_cert,...must appear in the RUN instruction, not just indocker buildCLIcp /run/secrets/ca_cert ...copies it to the right location beforeupdate-ca-certificatesrunscurl -v ... | grep "issuer:" || exit 1is non-negotiable. It validates the cert is actually loaded and trusted, not just present on disk- Final stage copies only the updated CA bundle and custom cert — no build tools, no apt cache
Insider tip #4: Always add RUN curl -v immediately after installing certs. It adds <1s to build time, but saves days of debugging.
Tradeoff: --mount=type=secret requires Docker 18.09+. If you’re on older Docker (e.g., some a cloud provider ECS-optimized AMIs), use --ssh default + ssh-agent forwarding instead—but that’s slower and less secure. For new projects, require Docker 20.10+.
What you should do tomorrow:
✅ Add --mount=type=secret inside every RUN that needs it
✅ Add curl -v right after cert installation
✅ Remove all ENV SECRET=... and RUN echo $SECRET > ... from Dockerfiles
---
Multi-Stage Without the Bloat — Pruning Binaries Like a Kernel Dev
Multi-stage builds don’t guarantee small images. They guarantee separation. But separation ≠ pruning.
At Palantir, our Go service used this pattern:
FROM golang:1.21-bookworm AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -ldflags '-extldflags "-static"' -o /app/api .
FROM debian:bookworm-slim
COPY --from=builder /app/api /usr/local/bin/api
CMD ["api"]
Image size: 87MB.
Why? Because debian:bookworm-slim includes libgcc_s.so.1, libstdc++.so.6, libgfortran.so.5, and 12 other shared libs — and COPY --from=builder /app/api also copied /usr/lib/x86_64-linux-gnu/ from the builder stage (since golang:1.21-bookworm is based on Debian).
We thought CGO_ENABLED=0 meant “no dynamic deps.” It means “no Go cgo calls.” It doesn’t prevent the linker from pulling in system libs if they’re present.
The fix? Stop copying from fat builders. Use scratch + manual dependency analysis.
Here’s the production-ready Dockerfile.go (Go 1.21.7, Docker 24.0.7):
# syntax=docker/dockerfile:1
Dockerfile.go — Go 1.21.7, musl-based static linking, Docker 24.0.7
FROM golang:1.21.7-alpine3.19 AS builder
Alpine uses musl libc — truly static binaries
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
✅ Build with musl, no CGO, explicit static flags
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -a -ldflags '-linkmode external -extldflags "-static"' -o /app/api .
✅ Verify no dynamic deps — fail if any found
RUN ldd /app/api | grep "not a dynamic executable" || \
(echo "ERROR: Binary has dynamic dependencies"; ldd /app/api; exit 1)
Final stage: scratch — literally empty
FROM scratch
✅ Copy only the binary — no libs, no shell, no /etc
COPY --from=builder /app/api /usr/local/bin/api
✅ Add minimal /etc/passwd for non-root execution
COPY --from=builder /etc/passwd /etc/passwd
USER 1001:1001
EXPOSE 8080
CMD ["/usr/local/bin/api"]
Key improvements:
golang:1.21.7-alpine3.19usesmusl, notglibc→ smaller base, nolibgcc_s.so.1CGO_ENABLED=0+GOOS=linux+-ldflags '-linkmode external -extldflags "-static"'forces full static linkingldd /app/apiverifies the result — fails build if any dynamic deps remainFROM scratchmeans zero OS overhead — no shell, no package manager, no/bin/shCOPY --from=builder /etc/passwdlets us useUSER 1001:1001without root
Result: image size dropped from 87MB → 13.2MB. Latency improved 12% (smaller image = faster pull = faster pod startup).
But scratch isn’t always safe. If your binary needs /proc, /sys, or DNS resolution, you’ll get no such file or directory errors at runtime.
Test it properly:
# Run with minimal capabilities
docker run --rm --cap-drop=ALL --read-only --tmpfs /tmp --network none \
-v /dev/null:/etc/resolv.conf \
your-image:latest sh -c 'ls /proc/self && cat /proc/self/cmdline'
If that fails, you need busybox:glibc or distroless instead of scratch.
Insider tip #5: Use go tool nm -s to check for libc symbol references. If you see them, your binary isn’t fully static.
Tradeoff: scratch gives smallest size but zero debugging tools. busybox:glibc is 5MB larger but includes sh, ps, netstat. Choose based on your observability needs — not “best practice.”
What you should do tomorrow:
✅ Replace debian:slim bases with alpine for Go/Rust/Python static builds
✅ Add RUN ldd and go tool nm -s to verify static linking
✅ Try FROM scratch — if it fails, use gcr.io/distroless/static-debian12 instead
---
What You Should Do Tomorrow — Exactly
Don’t refactor everything. Pick one service. Apply one change. Measure.
- Pick the largest Docker image in your org (run
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h | tail -5) - Add
mode=maxto its--export-cache— watch cache hit rate jump in CI logs - Replace
COPY . .withCOPY $(git ls-files ...)— rundocker save | tar -t | wc -lbefore/after - Add
RUN curl -vafter cert installs| grep "issuer:" || exit 1 - Run
docker build --progress=plain 2>&1 | grep -E "(CACHED|— confirm cache is working)"
Do those five things. In <2 hours. Then measure:
- Image size delta (should be ≥30% reduction)
- Build time delta (should be ≥50% reduction on repeat builds)
- CI pipeline stability (should eliminate “works locally, fails in CI” bugs)
That’s it.
No grand architecture overhaul. No new tools. Just fixing what Docker actually does, not what the tutorials pretend it does.
Because in production, Docker isn’t magic. It’s a kernel-space contract. And contracts demand specificity — not slogans.
I’ve wasted 317 hours debugging Docker. You don’t have to.