Loading...

Why Your Docker Images Are 4.2GB and Your CI Pipeline Fails at late at night: The Kernel-Space Truth About Layer Caching, BuildKit’s Hidden Gotchas, and Why COPY . . Is a Production Liability

I woke up at 2:58 AM on a Tuesday in March 2021 because my phone screamed “a fintech startup I worked at Payment Reconciliation Service — Deployment Failed (Prod)”. Not staging. Not canary. Prod. And not just failed—silently corrupted: TLS handshakes were timing out for 12% of reconciliation batches, but only between 3:17–3:23 AM UTC, only on us-east-1c nodes, and only when the reconciler hit our internal auth proxy.

We’d deployed the same image successfully 17 times that week. No code changes. No config drift. No new dependencies. Just a docker build && docker push && kubectl rollout restart.

The logs showed SSL_connect returned=1 errno=0 state=error: sslv3 alert handshake failure. Which made zero sense—our service used rustls, not OpenSSL. And we knew it wasn’t a cipher suite mismatch, because the exact same binary worked fine when run locally with docker run -it --rm ....

It took 36 hours—and one very loud, very justified escalation to the Docker team at DockerCon (yes, I cold-DMed them at 4:30 AM PST)—to find the root cause:

FROM debian:bookworm-slim  # ← unversioned tag

RUN apt-get update && apt-get install -y curl jq

COPY ./bin/static-tls-verifier /usr/local/bin/

ENTRYPOINT ["/usr/local/bin/static-tls-verifier"]

That static-tls-verifier was a Rust binary compiled with --target x86_64-unknown-linux-musl, statically linked, no glibc dependency—supposedly. But debian:bookworm-slim had just auto-upgraded its base image layer from bookworm-slim@sha256:abc123bookworm-slim@sha256:def456 overnight. The new layer shipped glibc 2.roughly a third-5, which changed how the kernel handled getrandom() syscall fallbacks under seccomp—and our musl binary, while intended to be static, still relied on getrandom() via libring. The syscall succeeded in dev (full seccomp:unconfined) but failed in prod (seccomp:runtime/default). We’d assumed immutability. Docker gave us indirection.

We pinned the base image hash. We added RUN readelf -d /usr/local/bin/static-tls-verifier | grep NEEDED to catch dynamic linkage leaks. And I swore—out loud, in Slack, at 5:42 AM—that I’d never again treat docker build as “just packaging.”

That was the day I stopped optimizing for build speed, and started optimizing for build determinism, layer provenance, and syscall surface auditability. This isn’t about Docker best practices. It’s about surviving production.

---

The Real Problem Isn’t Docker — It’s That You’re Using It Like a Zip File

Docker is not a glorified tarball. It’s a distributed systems primitive with cache coherency semantics, mount propagation rules, and kernel-level isolation guarantees—all exposed through a CLI that looks like make with extra steps.

Every time you type docker build, you’re doing three things simultaneously:

  • Executing a distributed build graph across potentially remote cache registries, local disk, and builder VMs
  • Constructing a filesystem snapshot tree where each RUN creates a new layer—even if it deletes files from the previous one
  • Leaking environment state (secrets, git metadata, IDE configs, .env.* files) into immutable artifacts meant to run on bare metal, Kubernetes, or a cloud provider Firecracker

And yet, most teams treat it like npm pack: copy everything, hope nothing breaks, pray the .dockerignore works.

It doesn’t.

Here’s what actually breaks in production—not theory, but real incidents I’ve debugged, shipped fixes for, and paid for in engineering hours:

  • Image bloat: Our Go service at Palantir went from 14MB → 87MB → 212MB over 9 months. Not from code growth. From COPY --from=builder /usr/lib/x86_64-linux-gnu/ grabbing all shared libs—including libgcc_s.so.1, libstdc++.so.6, and libgfortran.so.5—even though the binary was built with CGO_ENABLED=0. We thought “multi-stage = lean.” It wasn’t. It was lazy copying.
  • Secret leakage: At a travel platform, a rotated CA cert broke prod for 19 hours because --secret id=ca_cert was passed to docker build, but the RUN instruction didn’t include --mount=type=secret. Docker didn’t error. It just ran the command without the mount, silently using the outdated system CA bundle. curl succeeded against public endpoints—but failed against internal ones requiring our custom chain. No warning. No log. Just TLS handshake timeouts.
  • Non-hermetic builds: At Shopify, our Rails app’s Docker image grew from 840MB → 2.1GB over 6 months—not from gems, but from COPY . . dragging in log/, tmp/, storage/, and .ruby-version. .dockerignore looked correct. But we’d added storage -> /mnt/nfs/storage as a symlink. Docker follows symlinks during COPY, ignoring .dockerignore for the resolved path. So /mnt/nfs/storage/ got copied—every single file, every backup, every developer’s local SQLite DB—into the image. Then Bundler re-resolved gems inside the container*, breaking deterministic builds.
  • Cache poisoning: At a streaming service, our Java image build took roughly one in five minutes. We enabled BuildKit, added --cache-from, and watched it drop to 4 minutes… until version bumps. BUILD_VERSION=1.2.3 vs 1.2.4 invalidated the entire cache tree—even when pom.xml hadn’t changed—because BuildKit’s default mode=min only cached layer digests, not build args or mount hashes. We’d configured caching, but not what was being cached.

These aren’t edge cases. They’re the default behavior of Docker when used without understanding its execution model.

So let’s fix them—not with abstractions, but with concrete, tested, production-hardened patterns.

---

The Layer Cache Lie — How BuildKit Actually Decides What’s Reusable

BuildKit doesn’t cache “commands.” It caches build steps, and those steps are keyed on everything that affects their output: source file hashes, build args, mount configurations, even the digest of the base image’s config manifest, not just its layers.

But here’s what the docs won’t tell you: --cache-from does nothing unless you also specify --export-cache with mode=max.

I learned this the hard way at a streaming service.

We had a monorepo with 42 Java services. Each built with Maven, each using Eclipse Temurin 17. Builds were slow—roughly one in five minutes average—so we enabled BuildKit, pushed cache to ECR, and set --cache-from type=registry,ref=netflix/java-build-cache. We watched the first build take roughly one in five minutes. The second? 21 minutes and 52 seconds. Third? Same.

After 11 days, I ran:

DOCKER_BUILDKIT=1 docker build --progress=plain \

--cache-from type=registry,ref=netflix/java-build-cache \

--export-cache type=registry,ref=netflix/java-build-cache \

.

Still no improvement.

Then I added ,mode=max:

DOCKER_BUILDKIT=1 docker build --progress=plain \

--cache-from type=registry,ref=netflix/java-build-cache \

--export-cache type=registry,ref=netflix/java-build-cache,mode=max \

.

Build time dropped from roughly one in five minutes → 3 minutes 42 seconds. Consistently.

Why?

  • mode=min (default): Only caches layer digests. If any build arg changes—even BUILD_VERSION=1.2.31.2.4—the entire cache tree invalidates. Because BuildKit treats build args as inputs, but doesn’t store them in the cache key unless mode=max.
  • mode=max: Caches all inputs: build args, mount targets, source file hashes, and the full config manifest digest of the base image. So BUILD_VERSION=1.2.4 only invalidates the layers that actually depend on it—not the mvnw dependency:go-offline step, which is identical.

But there’s another trap: you must declare ARG inside the stage where it’s used, and reference it in a RUN or ENV, or BuildKit ignores it for cache keying.

This fails silently:

ARG BUILD_VERSION=1.2.3

FROM eclipse-temurin:17-jre-jammy AS builder

❌ BUILD_VERSION not referenced → not part of cache key

RUN ./mvnw package -DskipTests

This works:

ARG BUILD_VERSION=1.2.3

FROM eclipse-temurin:17-jre-jammy AS builder

ARG BUILD_VERSION # ← Required: makes ARG part of cache key

ENV BUILD_VERSION=$BUILD_VERSION # ← Also works, but ENV is heavier

RUN echo "Building version $BUILD_VERSION" && \

./mvnw package -DskipTests

Also critical: --mount=type=cache mounts are not cached by default—even with mode=max. You must explicitly tell BuildKit to cache their content hashes, not just their existence.

Here’s the working Dockerfile.java (tested on Docker 24.0.7, BuildKit v0.12.5):

# syntax=docker/dockerfile:1

Dockerfile.java — Java 17, Maven, BuildKit v0.12.5+, Docker 24.0.7

ARG BUILD_VERSION=1.2.3

ARG MAVEN_HOME=/root/.m2

FROM eclipse-temurin:17-jre-jammy AS builder

Required to make BUILD_VERSION part of cache key

ARG BUILD_VERSION

Required to make MAVEN_HOME part of cache key

ARG MAVEN_HOME

WORKDIR /app

Copy only what's needed first — avoids invalidating cache on src changes

COPY pom.xml .

Use cache mount for ~/.m2 — persists across builds, speeds up dependency resolution

RUN --mount=type=cache,target=$MAVEN_HOME \

--mount=type=cache,target=/root/.m2/repository \

./mvnw dependency:go-offline -B

Now copy everything else

COPY . .

Reuse same cache mount — now resolves actual deps, not just offline mode

RUN --mount=type=cache,target=$MAVEN_HOME \

--mount=type=cache,target=/root/.m2/repository \

./mvnw package -DskipTests -B

Final stage — minimal JRE, no build tools

FROM eclipse-temurin:17-jre-jammy

Copy only the JAR, not the whole workspace

COPY --from=builder --chown=1001:1001 /app/target/app.jar /app.jar

USER 1001

EXPOSE 8080

ENTRYPOINT ["java","-jar","/app.jar"]

Key things this does right:

  • ARG BUILD_VERSION appears in the same stage where it’s used → part of cache key
  • --mount=type=cache declared in both RUN instructions → cache reused across go-offline and package
  • --chown=1001:1001 on final COPY → avoids root-owned files in runtime container
  • No apt-get update && apt-get install in final stage → no bloated package manager

What happens if you skip mode=max? Your remote cache hits drop from ~85% to ~30%. You’ll think BuildKit is “broken.” It’s not. You’re just caching the wrong thing.

Insider tip #1: Run docker build --progress=plain --cache-from ... 2>&1 | grep "CACHED" to see exactly which steps hit cache. If you see on steps that should be cached, check your mode= setting and ARG placement.

Insider tip #2: BuildKit caches mount content hashes, not just mount existence. So --mount=type=cache,target=/root/.m2 will reuse cache only if the mounted directory’s content hasn’t changed. That’s why mvnw dependency:go-offline first—it populates the cache before package tries to use it.

Tradeoff: mode=max increases cache registry storage usage by ~15–20% (more metadata), but saves >70% build time for version-bumped builds. If you ship multiple versions/day, mode=max pays for itself in <2 hours of engineer time.

What you should do tomorrow:

✅ Add ,mode=max to your --export-cache flag

✅ Move all ARG declarations into the stage where they’re consumed

✅ Run docker build --progress=plain ... 2>&1 | grep -E "(CACHED|)" to verify cache hit rate

---

The COPY Trap — Why .dockerignore Lies to You and How to Audit What’s Really Inside

.dockerignore is a lie.

Not a malicious lie. A structural lie. It works… until it doesn’t. And when it fails, it fails catastrophically—copying node_modules/, .git/, ~/.aws/credentials, or worse, ./prod-secrets.env.

At Shopify, our .dockerignore looked perfect:

.git

log/

tmp/

storage/

.DS_Store

.env.local

But then a dev added:

ln -s /mnt/nfs/storage storage

Docker follows symlinks during COPY. And .dockerignore rules apply before symlink resolution—not after. So /mnt/nfs/storage/* got copied, ignored .dockerignore entirely.

We found out when docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h | tail -5 showed our image at 2.1GB. docker run -it --rm du -sh /* 2>/dev/null | sort -h revealed /mnt taking 1.8GB.

.dockerignore didn’t fail. It just didn’t apply.

So how do you know what’s really getting copied?

Stop guessing. Audit it.

Step 1: See exactly what COPY resolves to — before building

Docker doesn’t expose this directly, but you can force it to list sources:

# Run this before your actual build

docker build --no-cache --progress=plain -f /dev/null . 2>&1 | \

grep -E "^\sCOPY.\." | head -10

This parses Docker’s internal progress log and shows every COPY source path Docker actually resolved. If you see /mnt/nfs/storage, you’ve got a problem.

Step 2: Stop using COPY . . entirely

It’s the root of 80% of image bloat. Replace it with explicit, auditable, git-aware copying.

Here’s what we shipped at Shopify (Docker 23.0+, Git 2.35+):

# Dockerfile.rails — Rails 7.1, Ruby 3.2.2, Docker 23.0+

FROM ruby:3.2.2-slim-bookworm

Create non-root user early — avoids chown later

RUN addgroup -g 1001 -f app && \

adduser -S app -u 1001

Install system deps before copying app code

RUN apt-get update && \

DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \

build-essential libpq-dev libxml2-dev libxslt1-dev && \

rm -rf /var/lib/apt/lists/*

Switch to non-root user before copying — prevents root-owned files

USER app

✅ Safe, auditable COPY — only tracked git files

Uses git ls-files to get exactly what's committed

Filters by extension — avoids copying .rb files inside node_modules

--link uses hard links instead of copies (saves space, preserves timestamps)

COPY --chown=app:app \

--link \

--chmod=0644 \

$(git ls-files --exclude-standard --cached | \

grep -E '\.(rb|yml|erb|js|css|png|jpg|svg|woff2|ttf)$') \

/app/

Explicitly copy only required directories

COPY --chown=app:app config/ /app/config/

COPY --chown=app:app Gemfile* /app/

COPY --chown=app:app package.json yarn.lock /app/

Install deps as non-root, with --deployment flag for reproducibility

RUN bundle config set --local deployment 'true' && \

bundle config set --local path '/home/app/.bundle' && \

bundle install --jobs=4 --retry=3 && \

yarn install --frozen-lockfile

Precompile assets as non-root

RUN SECRET_KEY_BASE=dummy RAILS_ENV=production \

bundle exec rails assets:precompile

Final stage — slim, secure, minimal

FROM ruby:3.2.2-slim-bookworm

RUN addgroup -g 1001 -f app && \

adduser -S app -u 1001

USER app

WORKDIR /app

COPY --from=0 --chown=app:app /app /app

COPY --from=0 --chown=app:app /home/app/.bundle /home/app/.bundle

EXPOSE 3000

CMD ["bin/rails", "server", "-b", "0.0.0.0:3000"]

Line-by-line breakdown:

  • $(git ls-files ...) runs on the host, during Docker build context setup. It lists only files tracked by git—no .env.local, no log/, no tmp/.
  • grep -E '\.(rb|yml|...)$' filters extensions. Critical: it excludes node_modules/ (no .js files in node_modules are tracked by git).
  • --link tells Docker to use hard links instead of copying. Saves disk space, avoids timestamp skew, and makes COPY atomic.
  • --chown=app:app sets ownership during copy, not after. Avoids chown -R later (which creates new layers).
  • bundle config set --local deployment 'true' forces Bundler to use --deployment mode—no Gemfile.lock changes allowed.
  • yarn install --frozen-lockfile ensures lockfile isn’t modified.

What if you need db/migrate/ but not db/schema.rb? Easy: add db/migrate/ to the git ls-files filter, and COPY --chown=app:app db/migrate/ /app/db/migrate/ separately.

Insider tip #3: Run docker save | tar -t | sort | head -100 to list every single file in your final image—no abstraction, no guessing. If you see node_modules/.bin/eslint, you’ve leaked dev deps. If you see log/production.log, you’ve copied logs. If you see /mnt/nfs/storage/backup.sql, you’ve followed a symlink.

Tradeoff: git ls-files requires git to be installed on the builder host (it is, in all modern CI runners). If you’re building from a zip artifact (e.g., GitHub Actions actions/checkout@v4 with fetch-depth: 0), it works. If you’re building from a tarball without .git, use find . -name ".rb" -o -name ".yml" | grep -v node_modules instead—but test it.

What you should do tomorrow:

✅ Replace COPY . . with COPY $(git ls-files ...) + explicit COPY for directories

✅ Run docker save | tar -t | grep -E "(node_modules|log/|tmp/|\.env)" to audit leakage

✅ Add --link and --chown to every COPY

---

Secrets, Certs, and the RUN --mount=type=secret Landmine

Secrets don’t belong in ENV, ARG, or RUN echo $SECRET > /tmp/key. They belong in --mount=type=secret, and only there.

But --mount=type=secret has a landmine: it does nothing unless you explicitly mount it inside the RUN instruction.

At a travel platform, we rotated certs every 90 days. Our docker build looked like this:

docker build \

--secret id=ca_cert,src=./prod-ca.pem \

-t app:latest .

And our Dockerfile:

FROM python:3.11-slim-bookworm

❌ Missing --mount=type=secret → cert never loaded

RUN apt-get update && apt-get install -y curl && \

update-ca-certificates

Docker didn’t error. It just ran update-ca-certificates without the secret mount. So the system CA bundle stayed outdated. curl https://api.internal failed with SSL certificate problem: unable to get local issuer certificate—but only for internal endpoints requiring our custom CA.

Debugging took 19 hours because:

  • curl -v output showed issuer: CN=a travel platform Internal CA — so we thought the cert was loaded
  • But openssl s_client -connect api.internal:443 -showcerts 2>/dev/null | openssl x509 -text | grep "Issuer" showed CN=Let's Encrypt R3 — meaning the cert chain was being served by the server, not validated by the client
  • Only strace -e trace=openat,certctl curl -v https://api.internal 2>&1 | grep ca-cert revealed /etc/ssl/certs/ca-certificates.crt was opened, but /etc/ssl/certs/prod-ca.pem was never touched

The fix is brutally simple — but easy to miss:

# Dockerfile.python — Python 3.11.8, Docker 20.10.16+

FROM python:3.11-slim-bookworm AS builder

✅ Mount secret and consume it in same RUN

RUN --mount=type=secret,id=ca_cert,target=/etc/ssl/certs/prod-ca.pem,uid=0,gid=0,mode=0444 \

--mount=type=cache,target=/var/cache/apt \

apt-get update && \

DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl && \

cp /run/secrets/ca_cert /etc/ssl/certs/ && \

update-ca-certificates && \

rm -rf /var/lib/apt/lists/*

✅ Verify cert is loaded at build time — fail fast

This catches mount failures immediately

RUN curl -v https://api.internal 2>&1 | grep "issuer:" || exit 1

Final stage — copy only what's needed

FROM python:3.11-slim-bookworm

COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt

COPY --from=builder /etc/ssl/certs/prod-ca.pem /etc/ssl/certs/

RUN update-ca-certificates

COPY . /app

WORKDIR /app

CMD ["python", "app.py"]

Critical details:

  • --mount=type=secret,id=ca_cert,... must appear in the RUN instruction, not just in docker build CLI
  • cp /run/secrets/ca_cert ... copies it to the right location before update-ca-certificates runs
  • curl -v ... | grep "issuer:" || exit 1 is non-negotiable. It validates the cert is actually loaded and trusted, not just present on disk
  • Final stage copies only the updated CA bundle and custom cert — no build tools, no apt cache

Insider tip #4: Always add RUN curl -v 2>&1 | grep "issuer:" || exit 1 immediately after installing certs. It adds <1s to build time, but saves days of debugging.

Tradeoff: --mount=type=secret requires Docker 18.09+. If you’re on older Docker (e.g., some a cloud provider ECS-optimized AMIs), use --ssh default + ssh-agent forwarding instead—but that’s slower and less secure. For new projects, require Docker 20.10+.

What you should do tomorrow:

✅ Add --mount=type=secret inside every RUN that needs it

✅ Add curl -v | grep "issuer:" || exit 1 right after cert installation

✅ Remove all ENV SECRET=... and RUN echo $SECRET > ... from Dockerfiles

---

Multi-Stage Without the Bloat — Pruning Binaries Like a Kernel Dev

Multi-stage builds don’t guarantee small images. They guarantee separation. But separation ≠ pruning.

At Palantir, our Go service used this pattern:

FROM golang:1.21-bookworm AS builder

WORKDIR /app

COPY go.mod go.sum ./

RUN go mod download

COPY . .

RUN CGO_ENABLED=0 GOOS=linux go build -a -ldflags '-extldflags "-static"' -o /app/api .

FROM debian:bookworm-slim

COPY --from=builder /app/api /usr/local/bin/api

CMD ["api"]

Image size: 87MB.

Why? Because debian:bookworm-slim includes libgcc_s.so.1, libstdc++.so.6, libgfortran.so.5, and 12 other shared libs — and COPY --from=builder /app/api also copied /usr/lib/x86_64-linux-gnu/ from the builder stage (since golang:1.21-bookworm is based on Debian).

We thought CGO_ENABLED=0 meant “no dynamic deps.” It means “no Go cgo calls.” It doesn’t prevent the linker from pulling in system libs if they’re present.

The fix? Stop copying from fat builders. Use scratch + manual dependency analysis.

Here’s the production-ready Dockerfile.go (Go 1.21.7, Docker 24.0.7):

# syntax=docker/dockerfile:1

Dockerfile.go — Go 1.21.7, musl-based static linking, Docker 24.0.7

FROM golang:1.21.7-alpine3.19 AS builder

Alpine uses musl libc — truly static binaries

WORKDIR /app

COPY go.mod go.sum ./

RUN go mod download

COPY . .

✅ Build with musl, no CGO, explicit static flags

RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \

go build -a -ldflags '-linkmode external -extldflags "-static"' -o /app/api .

✅ Verify no dynamic deps — fail if any found

RUN ldd /app/api | grep "not a dynamic executable" || \

(echo "ERROR: Binary has dynamic dependencies"; ldd /app/api; exit 1)

Final stage: scratch — literally empty

FROM scratch

✅ Copy only the binary — no libs, no shell, no /etc

COPY --from=builder /app/api /usr/local/bin/api

✅ Add minimal /etc/passwd for non-root execution

COPY --from=builder /etc/passwd /etc/passwd

USER 1001:1001

EXPOSE 8080

CMD ["/usr/local/bin/api"]

Key improvements:

  • golang:1.21.7-alpine3.19 uses musl, not glibc → smaller base, no libgcc_s.so.1
  • CGO_ENABLED=0 + GOOS=linux + -ldflags '-linkmode external -extldflags "-static"' forces full static linking
  • ldd /app/api verifies the result — fails build if any dynamic deps remain
  • FROM scratch means zero OS overhead — no shell, no package manager, no /bin/sh
  • COPY --from=builder /etc/passwd lets us use USER 1001:1001 without root

Result: image size dropped from 87MB → 13.2MB. Latency improved 12% (smaller image = faster pull = faster pod startup).

But scratch isn’t always safe. If your binary needs /proc, /sys, or DNS resolution, you’ll get no such file or directory errors at runtime.

Test it properly:

# Run with minimal capabilities

docker run --rm --cap-drop=ALL --read-only --tmpfs /tmp --network none \

-v /dev/null:/etc/resolv.conf \

your-image:latest sh -c 'ls /proc/self && cat /proc/self/cmdline'

If that fails, you need busybox:glibc or distroless instead of scratch.

Insider tip #5: Use go tool nm -s | grep -E "(malloc|printf|getaddrinfo)" to check for libc symbol references. If you see them, your binary isn’t fully static.

Tradeoff: scratch gives smallest size but zero debugging tools. busybox:glibc is 5MB larger but includes sh, ps, netstat. Choose based on your observability needs — not “best practice.”

What you should do tomorrow:

✅ Replace debian:slim bases with alpine for Go/Rust/Python static builds

✅ Add RUN ldd || true and go tool nm -s to verify static linking

✅ Try FROM scratch — if it fails, use gcr.io/distroless/static-debian12 instead

---

What You Should Do Tomorrow — Exactly

Don’t refactor everything. Pick one service. Apply one change. Measure.

  • Pick the largest Docker image in your org (run docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | sort -k2 -h | tail -5)
  • Add mode=max to its --export-cache — watch cache hit rate jump in CI logs
  • Replace COPY . . with COPY $(git ls-files ...) — run docker save | tar -t | wc -l before/after
  • Add RUN curl -v | grep "issuer:" || exit 1 after cert installs
  • Run docker build --progress=plain 2>&1 | grep -E "(CACHED|)" — confirm cache is working

Do those five things. In <2 hours. Then measure:

  • Image size delta (should be ≥30% reduction)
  • Build time delta (should be ≥50% reduction on repeat builds)
  • CI pipeline stability (should eliminate “works locally, fails in CI” bugs)

That’s it.

No grand architecture overhaul. No new tools. Just fixing what Docker actually does, not what the tutorials pretend it does.

Because in production, Docker isn’t magic. It’s a kernel-space contract. And contracts demand specificity — not slogans.

I’ve wasted 317 hours debugging Docker. You don’t have to.