Docker Image Builds Keep Breaking Your CI? Here’s How I Fixed That

The Problem: Your CI Pipeline Takes 12 Minutes to Build a 2GB Image

A 2GB image that takes 12 minutes to build on every push isn’t just annoying — it’s a compounding tax on your entire team. Every feature branch, every hotfix, every “just a one-liner” commit burns 12 minutes of CI time. Multiply that by 10 developers pushing 5 times a day and you’re hemorrhaging CI minutes before lunch. The money is one thing. The context-switching cost of waiting is worse.

The failure modes I see repeatedly across codebases look like this: a COPY . . on line 4 that nukes every downstream cache layer on every commit, a npm install that re-downloads the entire node_modules tree because nobody pinned the lockfile copy before the source copy, and a final image that’s 1.9GB because someone apt-get installed build tools and never cleaned up or switched to a multi-stage build. Then there’s the classic production break — the image builds fine locally on an M2 Mac but silently ships linux/arm64 binaries into an x86_64 ECS cluster. No error during build, pure chaos at runtime.

This guide covers the practical patterns I’ve landed on after debugging CI pipelines across GitHub Actions, GitLab CI, and Jenkins. Specifically: layer ordering for maximum cache reuse, registry-based caching strategies that actually survive across runners, multi-stage builds that cut image size by 60–80%, BuildKit flags that make a measurable difference, and platform targeting that prevents the ARM/AMD64 mismatch problem from biting you. I’ll include real Dockerfile snippets and CI config fragments — not pseudocode.

One thing I’m not covering: how Docker works, what a layer is, or why you should use .dockerignore. If you need that, there are better resources. I’m assuming you’re already running Docker builds in CI and the pain is real enough that you’re here looking for specific fixes, not a primer. I’m also assuming you have some control over your Dockerfile — if you’re building from a locked-down base image someone else owns, some of this won’t apply directly, but the caching and registry sections still will.

For teams also evaluating what tooling is actually worth paying for across your stack, check out our guide on Essential SaaS Tools for Small Business in 2026 — the CI/CD section specifically has some honest takes on where shared runners stop making financial sense.

Always Enable BuildKit — It’s Not Default on Older Setups

The thing that caught me off guard early on was how many CI environments quietly use the legacy builder. You push a “fast” multi-stage Dockerfile, watch your pipeline take 4 minutes, and assume that’s just how Docker works. It isn’t. The legacy builder executes stages sequentially, ignores parallelism, and its cache invalidation is genuinely bad. BuildKit fixes all of that, but on runners provisioned with Ubuntu 18.04 or 20.04, the Docker package from the default apt repo is often stuck at 19.x or early 20.x — and BuildKit isn’t the default until Docker 23.0.

The fastest fix is setting the env var inline or in your CI environment config:

# Legacy builder — sequential stages, no secret mounts, slow cache
docker build -t myapp:latest .

# BuildKit enabled — parallel stages, inline cache, --mount=type=secret works
DOCKER_BUILDKIT=1 docker build -t myapp:latest .

For a permanent fix on the runner host itself, drop this into /etc/docker/daemon.json and restart the daemon:

{
  "features": {
    "buildkit": true
  }
}
sudo systemctl restart docker

What BuildKit actually gives you that matters in CI: parallel execution of independent FROM stages (so a multi-stage build with a test stage and a build stage running simultaneously is real, not a promise), smarter layer caching that doesn’t blow up the entire cache when an unrelated file changes, and --mount=type=secret which lets you pass credentials like NPM_TOKEN or GITHUB_TOKEN into the build without baking them into a layer. That last one is a security necessity, not a nice-to-have. Without BuildKit, your only option is ARG, which writes the value into the image history.

The verification trick is dead simple — just read the first line of build output. If you see something like #1 [internal] load build definition from Dockerfile, BuildKit is active. If you see Sending build context to Docker daemon as the very first line, you’re on the legacy builder. I’ve added a quick assertion to several CI scripts just to fail fast if someone provisions a new runner without it:

build_output=$(DOCKER_BUILDKIT=1 docker build -t myapp:latest . 2>&1)
echo "$build_output" | grep -q "\[internal\] load build definition" \
  || { echo "BuildKit not active — check Docker version or daemon config"; exit 1; }

One more gotcha: GitHub Actions’ ubuntu-22.04 runners ship with Docker 24.x and BuildKit on by default, so you may never hit this. But self-hosted runners — especially ones your platform team provisioned 18 months ago and nobody updated — are the danger zone. Run docker version on your runner and if the Engine version is below 23.0, either upgrade Docker or set DOCKER_BUILDKIT=1 explicitly in every pipeline that matters. Don’t assume.

Layer Order Is the First Thing I Check When a Build Is Slow

The first thing I do when someone complains their CI builds take 8 minutes is open their Dockerfile and scroll to see where the COPY . . sits. It’s almost always in the wrong place — above the package install step — and fixing it drops build times to under 90 seconds without touching anything else.

Docker’s layer cache is invalidated the moment a layer changes, and every layer that comes after it gets rebuilt from scratch. So if you COPY . . first and then run npm ci, you’re installing every dependency from the network on every single commit — even if you only changed a comment in index.ts. The fix is to copy only the files that control your dependencies, install them, and then copy the rest of your source:

# Copy only what npm needs to resolve dependencies
COPY package.json package-lock.json ./

# This layer is cached until package.json or package-lock.json changes
RUN npm ci --omit=dev

# Now bring in source — this invalidates on every commit, but npm ci doesn't re-run
COPY . .

RUN npm run build

With this order, npm ci only re-executes when a dependency actually changes. On a project with 200 packages, that’s the difference between a 3-minute install and a 2-second cache hit. The COPY . . step still invalidates on every push, but copying files is fast — it’s the install step that kills you.

The exact same pattern applies across ecosystems. Python: COPY requirements.txt ./ then RUN pip install --no-cache-dir -r requirements.txt, then COPY . .. Go is slightly trickier because the module cache involves two files:

COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o /app ./cmd/server

Rust with Cargo needs both Cargo.toml and Cargo.lock, and there’s a well-known trick of creating a dummy src/main.rs to force a dependency compile without your real source, since cargo build needs something to compile against. It’s ugly but it works for large dependency trees where compile time dominates:

COPY Cargo.toml Cargo.lock ./
# Compile dependencies only — src/main.rs is a stub so this layer caches
RUN mkdir src && echo "fn main() {}" > src/main.rs && cargo build --release && rm -rf src

# Now copy real source — only your code recompiles, not the 80 crates you depend on
COPY src ./src
RUN touch src/main.rs && cargo build --release

The touch src/main.rs is necessary because Cargo checks file modification times, not just content. Without it, Cargo may skip recompiling your actual code because the timestamp predates the copy. That’s the kind of thing that bites you once at 11pm and you never forget it.

Multi-Stage Builds: Stop Shipping Your Build Tools to Production

The single biggest mistake I see in production Dockerfiles is treating the build environment and the runtime environment as the same thing. They’re not. Your TypeScript compiler, webpack config, test runner, and 400MB of node_modules devDependencies have no business being in the image that runs on your servers. Multi-stage builds are how you fix that, and the size difference is genuinely shocking the first time you see it.

Here’s a real annotated Dockerfile for a Node.js app. The comments explain the decisions, not just the syntax:

# --- Stage 1: Install deps and compile TypeScript ---
# We use the full node:18 image here because we need build tools,
# native module compilers, and the full npm ecosystem available.
FROM node:18 AS builder

WORKDIR /app

# Copy lockfile first — this layer only invalidates when deps change,
# not every time you touch source code. Critical for CI cache hits.
COPY package.json package-lock.json ./
RUN npm ci --include=dev

COPY . .
RUN npm run build
# /app/dist now contains compiled JS, nothing else needed at runtime

# --- Stage 2: Run tests against the built output ---
# In CI you can stop here with --target=test. Never reaches production.
FROM builder AS test
RUN npm run test

# --- Stage 3: The actual production image ---
# node:18-alpine is ~130MB vs ~950MB for node:18 (Debian-based).
# We copy ONLY what we need from builder — no devDeps, no source.
FROM node:18-alpine AS production

WORKDIR /app

# Only prod dependencies in the final image
COPY package.json package-lock.json ./
RUN npm ci --omit=dev

# Pull compiled output from the builder stage, not from your local filesystem
COPY --from=builder /app/dist ./dist

# Don't run as root. Ever.
USER node

EXPOSE 3000
CMD ["node", "dist/index.js"]

The size story is real: a typical Node.js 18 app starts at 1.1GB+ when you base it on node:18 and install everything into one stage. Switch to this pattern and your production image lands around 180MB — sometimes lower if you’re disciplined about what you copy in. That’s not just a vanity metric. Smaller images push faster over the wire, reduce your attack surface, and make your ECR/Docker Hub storage bill noticeably cheaper over time.

In CI, the --target flag is where this pattern pays off beyond just image size. You can build and run tests without ever constructing the production image:

# In your CI pipeline — build only the test stage, run tests, bail early on failure
docker build --target=test -t myapp:test .
docker run --rm myapp:test

# Only if tests pass, build the production artifact
docker build --target=production -t myapp:${GIT_SHA} .

This means a failing test suite never burns time building and pushing a production image. It also means your test environment is reproducible — the exact same built output that got tested is what you ship, not a fresh build with slightly different timing or environment state.

Naming your stages with AS builder, AS test, AS production isn’t just aesthetic. The names are how you reference outputs in later COPY --from= instructions, and they’re what you pass to --target. If you leave stages unnamed, you can only reference them by index (COPY --from=0), and the moment you insert a new stage, every index below it shifts. I’ve broken builds this way. Name your stages.

The Alpine gotcha is the one that will bite you in production after everything looks fine locally. Alpine uses musl libc instead of glibc, and some npm packages that include native bindings — bcrypt, sharp, certain database drivers — either fail to install or behave differently against musl. The error you get is usually cryptic: a segfault or a missing symbol at runtime rather than a clean failure at build time. My rule: if your app uses any native addons, build a test matrix that includes the Alpine-based production image explicitly, don’t just trust that “it works in builder.” Run node dist/index.js as a smoke test in CI against the production stage before pushing to a registry. Ten seconds of container startup validation has saved me several midnight pages.

Cache Mounts for Package Managers: The Flag Most Tutorials Skip

The flag that changed how I think about CI build times isn’t in most Docker tutorials: --mount=type=cache. Most guides stop at layer caching — cache the node_modules layer, restore it if package-lock.json didn’t change. That works until you add a dependency, bust the cache, and wait four minutes for pip to download the world again. Cache mounts solve a different problem: they keep the package manager’s own download cache warm, separate from the image layers entirely.

Here’s what the Node version looks like:

# syntax=docker/dockerfile:1
FROM node:20-alpine

WORKDIR /app
COPY package*.json ./

# npm's download cache lives in /root/.npm — mount it as a persistent cache
# npm ci still does a clean node_modules install, but skips re-downloading tarballs
RUN --mount=type=cache,target=/root/.npm \
    npm ci --prefer-offline

COPY . .
RUN npm run build

The key insight: npm ci still blows away and rebuilds node_modules every time, so your image stays reproducible. But the tarballs themselves — lodash, react, whatever — stay cached in /root/.npm on the host. Next build, npm finds them locally and skips the network round-trip. Your image layer cache is about skipping work entirely. Cache mounts are about making the work fast when you can’t skip it.

Same pattern works for pip, and the gains are even more dramatic on Python projects with heavy scientific stacks:

# syntax=docker/dockerfile:1
FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .

# pip's HTTP cache + wheel cache both live under /root/.cache/pip
# wheels for compiled packages (numpy, psycopg2) get reused without recompilation
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --no-deps -r requirements.txt

COPY . .

On a Django project I maintain — Django 4.2, DRF, celery, psycopg2-binary, boto3, a dozen smaller packages — cold pip install was consistently around 4 minutes. After switching to cache mounts, warm builds drop to about 45 seconds. The big wins come from compiled packages: psycopg2 and anything with C extensions gets pre-built into wheels on the first run and reused from cache on every subsequent build. That’s the part layer caching alone can never help with, because it only kicks in when requirements.txt is unchanged.

Two caveats you’ll hit in real CI that tutorials don’t warn you about:

  • BuildKit is required. If you run a plain docker build without it, you’ll get a cryptic syntax error about --mount being unknown. Set DOCKER_BUILDKIT=1 as an env var, or add export DOCKER_BUILDKIT=1 to your CI environment config. Docker 23+ enables BuildKit by default, but plenty of CI runners are still on 20.x.
  • The cache is local to the runner. It lives on the host filesystem under Docker’s BuildKit cache directory (/var/lib/docker/buildkit). If your CI spins up a fresh VM per job — which GitHub Actions hosted runners do — you get zero benefit from cache mounts alone. They shine on persistent, self-hosted runners where the same machine handles multiple builds. GitLab shared runners, Buildkite agents you control, Jenkins agents — those all keep the cache between runs. Check whether your runner is ephemeral before investing time in this optimization.

If you’re on ephemeral runners, you can combine both strategies: use cache mounts for warm-cache speedups when the runner happens to be reused, and layer caching (via --cache-from with a registry) as the guaranteed fallback. They’re not mutually exclusive. But if you run self-hosted runners and you haven’t added --mount=type=cache to your Dockerfiles yet, you’re leaving the easiest build-time win on the table.

Registry Cache: How to Persist Cache Across Different CI Runners

The thing that bites people hardest when they first set up Docker builds in CI is expecting the cache to behave like it does on a developer laptop. On GitHub Actions hosted runners, every job starts from a completely clean VM. There’s no shared /var/lib/docker, no lingering layers from yesterday’s build. Each push triggers a full rebuild from scratch unless you explicitly tell Docker where to fetch cache from.

The fix is pushing your build cache to a registry as a manifest and pulling it back on the next run. BuildKit handles this natively with --cache-from and --cache-to. The full command looks like this:

# Build, pull cache from registry, and push updated cache back
docker buildx build \
  --cache-from type=registry,ref=ghcr.io/myorg/myapp:buildcache \
  --cache-to type=registry,ref=ghcr.io/myorg/myapp:buildcache,mode=max \
  --tag ghcr.io/myorg/myapp:latest \
  --push \
  .

The mode=max vs mode=min decision matters more than it looks. mode=min only exports the final stage’s layers — cheap to store but only useful if nothing in your multi-stage build changed before the final stage. mode=max exports every intermediate layer from every stage, so if your RUN npm install in a builder stage is cache-warm, the next run skips it entirely even when the final stage changes. I default to mode=max for Node.js and Go builds where dependency installs are expensive; the cache blob is bigger (sometimes 500MB+) but the time savings are real. For small images with fast builds, mode=min keeps your registry storage bill reasonable.

GitHub Actions with docker/build-push-action wires this up cleanly. Here’s a working workflow:

name: Build and Push

on: [push]

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write  # required to push to GHCR

    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ghcr.io/myorg/myapp:latest
          cache-from: type=registry,ref=ghcr.io/myorg/myapp:buildcache
          cache-to: type=registry,ref=ghcr.io/myorg/myapp:buildcache,mode=max

One gotcha: the buildcache tag is a fake image tag used purely for cache storage. It’s a BuildKit cache manifest, not a runnable image. Don’t let that confuse you if you see it sitting in your registry — it’s supposed to be there. Also, GHCR is free for public repos but counts against your storage quota for private ones. If you’re on ECR, swap the ref for something like 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:buildcache and authenticate with the AWS credentials action instead.

On GitLab CI, you have two realistic paths. If you’re using the Kaniko executor (common in Kubernetes-based runners), Kaniko has its own cache flag:

build:
  image:
    name: gcr.io/kaniko-project/executor:v1.23.0-debug
    entrypoint: [""]
  script:
    - /kaniko/executor
      --context $CI_PROJECT_DIR
      --dockerfile $CI_PROJECT_DIR/Dockerfile
      --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
      --cache=true
      --cache-repo $CI_REGISTRY_IMAGE/cache
      --cache-ttl 168h  # 7 days; stale cache gets rebuilt automatically

If you’re running docker buildx directly on GitLab with a Docker-in-Docker or shell executor, the command is almost identical to the GitHub Actions version — just swap the registry refs to $CI_REGISTRY_IMAGE and authenticate with docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY before building. The main operational difference is that GitLab’s container registry doesn’t distinguish cache manifests from real images in the UI, so I recommend using a dedicated sub-path like $CI_REGISTRY_IMAGE/buildcache to keep things tidy.

Secrets in Docker Builds: Don’t Use ARG for Sensitive Values

The pattern I see most often in leaked credentials incidents isn’t someone committing a .env file — it’s someone doing this in their Dockerfile:

# DON'T DO THIS — the token gets baked into the image layer permanently
ARG NPM_TOKEN
RUN echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > /root/.npmrc && \
    npm install && \
    rm /root/.npmrc

That rm at the end gives a false sense of security. The file is gone from the final layer, but the RUN command — including every environment variable and ARG value passed to it — is frozen in the layer history forever. Run this on any image built with that pattern and watch what comes out:

docker history myimage:latest --no-trunc
# IMAGE          CREATED BY
# sha256:abc...  /bin/sh -c echo "//registry.npmjs.org/:_authToken=npm_abc123secret..." > ...

The full token shows up in plain text. Anyone who can pull the image — including from a public registry if you accidentally push it there — can read your credentials. This isn’t theoretical. I’ve seen production registries with tokens sitting in layer history for months before anyone noticed.

BuildKit solves this cleanly with secret mounts. The secret never touches a layer — it’s mounted as a tmpfs file for the duration of that single RUN command and disappears completely when it exits. Here’s the correct pattern:

# syntax=docker/dockerfile:1.4
FROM node:20-alpine

WORKDIR /app
COPY package*.json ./

# Secret is mounted at /root/.npmrc only during this RUN — never written to a layer
RUN --mount=type=secret,id=npm_token \
    cp /run/secrets/npm_token /root/.npmrc && \
    npm ci && \
    rm /root/.npmrc

COPY . .
RUN npm run build

Or even cleaner — skip the copy entirely and mount directly where npm expects the config:

RUN --mount=type=secret,id=npm_token,dst=/root/.npmrc npm ci

To pass the secret during a local build or in CI, use the --secret flag. You’re pulling from an environment variable, not a file:

# Local build — reads NPM_TOKEN from your shell environment
docker build \
  --secret id=npm_token,env=NPM_TOKEN \
  -t myimage:latest .

# Or from a file if you prefer
docker build \
  --secret id=npm_token,src=.npmrc \
  -t myimage:latest .

In GitHub Actions, wire it up like this. The key thing is passing the secret as an environment variable to the build step, not as a build arg:

- name: Build Docker image
  env:
    NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
  run: |
    docker build \
      --secret id=npm_token,env=NPM_TOKEN \
      -t myimage:${{ github.sha }} .

If you’re using docker/build-push-action instead of raw docker build, the secrets input handles this — but the syntax is slightly different and the docs bury it. You pass a multiline string where each line is key=value:

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myimage:${{ github.sha }}
    secrets: |
      npm_token=${{ secrets.NPM_TOKEN }}

One gotcha: BuildKit must be enabled, which it is by default on Docker 23.0+ and on GitHub Actions runners since they use a recent Docker version. But if you’re on an older self-hosted runner or a base image that calls Docker explicitly, set DOCKER_BUILDKIT=1 before your build command. Without it, the --mount=type=secret syntax either silently fails or throws a parse error depending on the Docker version.

Tagging Strategy That Doesn’t Make You Want to Quit

The thing that breaks most teams isn’t their Dockerfile — it’s their tagging. I’ve debugged incidents where three different images all claimed to be latest and nobody could tell which one was actually running in production. You end up in a rollback situation where the tag you want to redeploy points to something completely different than what you think it does. latest in CI is a footgun. Stop using it as your primary tag.

Here’s the actual system I use, and it’s not complicated. Three tag types cover every scenario:

  • Git SHA (short): immutable, traceable, maps directly to a commit. This is your source of truth.
  • Branch name: mutable, always points to the latest build on that branch. Useful for staging environments that auto-deploy.
  • Semver tag: for releases only. v1.4.2 pushed when you cut a release tag in Git.

The SHA tag is the one you actually pin in deployment manifests. The branch tag is what lets your staging CD system say “give me the latest from main” without hardcoding anything. They serve different purposes and you need both. Locally or in any CI runner, the basic version is just:

# --short gives you 7 chars by default, enough to be unique
TAG=$(git rev-parse --short HEAD)
docker build -t myapp:$TAG -t myapp:$(git rev-parse --abbrev-ref HEAD) .
docker push myapp:$TAG
docker push myapp:$(git rev-parse --abbrev-ref HEAD)

In GitHub Actions, you get the same information through environment variables, which is cleaner than shelling out to git. github.sha gives you the full 40-char SHA (slice it yourself if you want short), and github.ref_name gives you the branch or tag name depending on what triggered the workflow. A real job step looks like this:

- name: Build and push image
  env:
    SHORT_SHA: ${{ fromJSON('{}') }}  # don't do this
  run: |
    SHORT_SHA="${{ github.sha }}"
    SHORT_SHA="${SHORT_SHA:0:7}"
    IMAGE="ghcr.io/myorg/myapp"

    docker build \
      -t "${IMAGE}:${SHORT_SHA}" \
      -t "${IMAGE}:${{ github.ref_name }}" \
      .

    docker push "${IMAGE}:${SHORT_SHA}"
    docker push "${IMAGE}:${{ github.ref_name }}"

For release pipelines triggered by a semver tag like v2.1.0, github.ref_name will actually equal v2.1.0 — so the same logic naturally produces a version tag when you push a Git tag. You can add a condition to also push a latest tag on semver releases if your deployment tooling genuinely needs it, but I’d make that an explicit opt-in, not the default. The SHA tag and the version tag together are enough: the SHA gives your ops team traceability back to the exact commit, and the version tag gives your Helm chart or Flux config something human-readable to pin to.

One gotcha with branch names: slashes break Docker tag syntax. A branch named feature/auth-refactor will cause docker tag to fail or silently misbehave depending on the registry. Sanitize it before tagging:

# Replace slashes with dashes, lowercase everything
BRANCH_TAG=$(echo "${{ github.ref_name }}" | tr '/' '-' | tr '[:upper:]' '[:lower:]')
docker build -t "myapp:${BRANCH_TAG}" .

The other thing I’ve gotten burned by: forgetting that pushing a branch tag from two parallel CI runs for the same branch creates a race condition. If two PRs merge to main within seconds of each other, whichever push finishes last “wins” the main tag, and the other image is now only reachable by SHA. That’s fine — it’s actually correct behavior — but your team needs to understand that the branch tag is volatile and the SHA tag is permanent. Deploy with SHA in anything that matters.

Scanning Images in the Pipeline Before They Hit Production

The thing nobody tells you upfront: the moment you add a scanner to CI and gate the build on it, your pipeline will start failing within 24 hours. Not because your code is bad — because the base image you’re pulling from Docker Hub has unpatched CVEs in its OpenSSL or glibc packages that have existed for months. Understanding that distinction (base image noise vs. real application risk) is what separates teams that actually use scanning from teams that disable it in frustration after a week.

I’ve run three scanners in real pipelines. Trivy (from Aqua Security) is what I reach for by default — it’s free, open source, updates its vulnerability database on every run, and the CLI is predictable. Grype from Anchore is a solid alternative if you’re already in their ecosystem or want SBOMs as a first-class artifact. Docker Scout is convenient if your team is already on Docker Hub, but the free tier limits how many scans you can run per month, and the policy engine is gated behind paid plans. For CI pipelines doing dozens of builds a day, Trivy wins on cost alone.

Here’s the command that actually gates your build:

# Fails the build if any HIGH or CRITICAL CVEs are found
# --exit-code 1 is the critical flag — without it, Trivy reports but never blocks
trivy image \
  --exit-code 1 \
  --severity HIGH,CRITICAL \
  --no-progress \
  myapp:latest

Drop the --no-progress if you want verbose output locally. In CI it just clutters the logs. The --exit-code 1 flag is what turns this from a reporting tool into an actual gate — without it you get a nice report and a green build regardless of what was found. Wire this after your docker build step but before any push to a registry or deployment step. If it fails, the image never moves forward.

The .trivyignore file is where teams either get disciplined or start making excuses. The right way to use it:

# .trivyignore
# CVE-2023-4911 (glibc LOONEY TUNABLES) - affects dynamic linker setuid
# Our containers run as non-root with no setuid binaries — confirmed mitigated 2024-01-15
# Revisit when debian:bookworm-slim patches base image
CVE-2023-4911

# CVE-2024-0567 (GnuTLS) - only triggered by malformed cert chains
# We terminate TLS at the load balancer, not inside the container
# Accepted risk, reviewed 2024-03-02
CVE-2024-0567

Every ignored CVE needs a reason and a date. Future you — or your security audit — will thank you. The pattern I’ve seen fail teams is ignoring CVEs without context and then never revisiting them. The ignore file ends up with 40 entries nobody understands, and the scanner stops being a meaningful signal.

For GitHub Actions, the aquasecurity/trivy-action is genuinely well-maintained and handles the SARIF output format that GitHub’s Security tab can consume directly:

- name: Scan image for vulnerabilities
  uses: aquasecurity/[email protected]
  with:
    image-ref: myapp:${{ github.sha }}
    format: sarif
    output: trivy-results.sarif
    severity: HIGH,CRITICAL
    exit-code: '1'
    ignore-unfixed: true   # skips CVEs with no available fix — cuts noise significantly

- name: Upload SARIF to GitHub Security
  uses: github/codeql-action/upload-sarif@v3
  if: always()   # upload even on failure so you can see what blocked the build
  with:
    sarif_file: trivy-results.sarif

The ignore-unfixed: true flag is one I now consider default-on for most pipelines. If there’s no patched version available yet, failing your build over it doesn’t give you any actionable path forward — it just blocks deploys until upstream fixes something you can’t control. You still want those in your SARIF report for visibility, but they shouldn’t gate a release. CVEs with available fixes? Those should absolutely block the build, because you have a concrete next step: update the package.

.dockerignore: The File You Set Once and Forget Until It Burns You

The thing that catches most people off guard isn’t a broken RUN command or a misconfigured COPY — it’s a 3-minute CI build that should take 20 seconds. Nine times out of ten, the culprit is a bloated build context being shoved over the wire before a single layer gets built. The .dockerignore file sits quietly in your repo, saving you from that fate, right up until you forget to update it and watch your pipeline grind.

Docker’s build process works like this: before the daemon processes a single instruction in your Dockerfile, it tars up everything in the build directory and sends it across. On a remote Docker host or a CI runner connecting to a daemon socket, that means your entire node_modules folder — potentially 600–900MB of nested packages — crosses the wire on every single build. I’ve personally watched a context transfer balloon to 800MB on a Node project because nobody added node_modules to .dockerignore. After one line, it dropped to 2MB. Same image, same result, 98% less waste.

Watch for this line at the top of docker build output:

Sending build context to Docker daemon  2.048kB

If that number is in the hundreds of megabytes, stop and fix your .dockerignore before anything else. Here’s a baseline that covers most projects:

# VCS metadata — never needed inside an image
.git
.gitignore

# Dependency directories — these get rebuilt inside the image
node_modules
vendor
__pycache__
*.pyc
*.pyo

# Environment and secrets — should never be baked in anyway
.env
.env.*
*.pem

# Test and coverage output — irrelevant to the final artifact
tests/
test/
coverage/
.coverage
htmlcov/
*.test.js

# Build artifacts you generate locally
dist/
build/
*.log

# IDE and OS noise
.idea/
.vscode/
.DS_Store
Thumbs.db

Here’s the gotcha that’s bitten me when I got lazy with formatting: .dockerignore does not support inline comments. The shell-style convention of trailing comments does not work here. Write this:

# This will silently break your pattern
node_modules # local dependency cache

…and Docker will try to match a directory literally named node_modules # local dependency cache, which of course doesn’t exist, so your actual node_modules gets included in full. Comments must go on their own line, full stop. The parser is not forgiving about this and won’t warn you — it just quietly does the wrong thing.

One more thing that trips people up: .dockerignore uses its own glob syntax, not the same as .gitignore. A pattern like **/.env works as expected for nested files, but a bare .env only matches at the root. If you have environment files nested under subdirectories (monorepos, I’m looking at you), you need explicit patterns. Also, unlike .gitignore, leading slashes mean root-relative — /.git and .git behave differently. Worth checking with docker build --no-cache . and watching that context size number to confirm your patterns are actually firing.

Full GitHub Actions Example: What My Actual Workflow File Looks Like

The Workflow File I Actually Ship With

Most tutorials show you a 20-line workflow that technically builds an image but falls apart the moment you hit a real project. The file below is closer to what I actually run in production — it handles caching properly, doesn’t spam your registry with every PR build, and blocks the push if the scan finds critical CVEs. I’ll walk through why each section is there, not just what it does.

name: Docker Build

on:
  push:
    branches: [main]
    tags: ["v*.*.*"]
  pull_request:
    branches: [main]

# Prevent multiple workflow runs from fighting over the same cache
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build:
    runs-on: ubuntu-22.04
    permissions:
      contents: read
      packages: write        # needed to push to ghcr.io
      security-events: write # needed to upload SARIF scan results

    outputs:
      image-digest: ${{ steps.build.outputs.digest }}

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      # Buildx gives you BuildKit features: inline cache, multi-platform,
      # faster layer resolution. Without this you're on legacy builder.
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      # Login runs unconditionally — even PRs need read access to pull
      # cached layers from ghcr.io. Skipping this breaks your cache hits.
      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix=sha-,format=short

      - name: Build image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          # Don't push yet — we want to scan first
          push: false
          load: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
          # mode=max caches every intermediate layer, not just the final image.
          # Costs more storage but cuts rebuild time dramatically on large images.
          cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max

  scan:
    runs-on: ubuntu-22.04
    needs: build
    permissions:
      security-events: write

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      # Rebuild from cache — this hits the registry cache we wrote above,
      # so it's fast. We need the image loaded locally for Trivy to scan it.
      - name: Rebuild from cache for scanning
        uses: docker/build-push-action@v5
        with:
          context: .
          push: false
          load: true
          tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:scan-target
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache

      - name: Run Trivy vulnerability scan
        uses: aquasecurity/[email protected]
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:scan-target
          format: sarif
          output: trivy-results.sarif
          # CRITICAL severity gates the push. HIGH is reported but not blocking.
          severity: CRITICAL
          exit-code: "1"

      - name: Upload scan results to Security tab
        uses: github/codeql-action/upload-sarif@v3
        if: always()   # upload even if scan failed so you can see what broke
        with:
          sarif_file: trivy-results.sarif

  push:
    runs-on: ubuntu-22.04
    # Only runs after scan passes AND only on main/tags — never on PRs
    needs: [build, scan]
    if: |
      github.event_name != 'pull_request' &&
      (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/tags/v'))
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix=sha-,format=short

      # This push hits the registry cache and only uploads the final image layers.
      # Because cache-from points at the buildcache tag, this is usually under 30s.
      - name: Push to registry
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
          provenance: true
          sbom: true

Why Three Separate Jobs Instead of One

The split between build, scan, and push is intentional. The needs chain means the push literally cannot run if Trivy exits with code 1. If you put all of this in a single job with an if: condition on the push step, a mistake in your condition logic can let the push slip through. Separate jobs with needs is a hard gate — the job doesn’t start, it doesn’t have credentials, it can’t push. The failure is also clearer in the Actions UI: you see “push” as skipped rather than “build” as partially failed.

The Cache Setup That Actually Speeds Things Up

The mode=max on cache-to is the thing I got wrong for the first three months. Default cache mode (mode=min) only stores the final stage’s layers. That’s fine for single-stage builds but kills you with multi-stage Dockerfiles — your builder stage gets rebuilt from scratch every time. Switch to mode=max and every intermediate stage gets cached. You’ll see your buildcache tag balloon in size (a Go application’s builder stage with all its modules is easily 800MB), but the time savings on a CI runner are real. One project went from 4-minute builds to under 60 seconds after this change.

The Login-on-PRs Gotcha

I call out the login step running unconditionally because this trips people up. If you add if: github.event_name != 'pull_request' to the login step, your PR builds can’t pull cached layers from ghcr.io because they’re not authenticated. The result is that every PR build is a cold build — painfully slow and expensive on paid runners. The GITHUB_TOKEN on a PR from a fork has read-only access to packages by default, which is safe. You’re not exposing write credentials; you’re just letting the runner authenticate for cache reads.

Tag Strategy and the Rebuild in the Scan Job

You’ll notice the scan job does a second docker/build-push-action call with load: true instead of receiving the image as an artifact from the build job. GitHub Actions doesn’t have a native way to pass a Docker image between jobs — you’d need to save it to a tarball with docker save, upload it as an artifact (gigabytes, slow), and reload it. Rebuilding from the warm registry cache is almost always faster and costs you nothing extra in bandwidth because all layers are already there. The scan-target tag is ephemeral and gets overwritten on every run; it’s just a handle for Trivy to reference.

Quick Checklist Before You Merge That Dockerfile

Most Dockerfile mistakes I’ve seen in code review aren’t exotic — they’re the same handful of issues that slow down builds, bloat images, or quietly ship credentials to a registry. Run through this before you hit merge.

BuildKit enabled in your CI environment config

If you’re on Docker 23+, BuildKit is default for local builds but your CI runner might still be invoking the legacy builder depending on how the Docker daemon is configured. Verify it explicitly:

# In your CI environment variables or runner config
DOCKER_BUILDKIT=1

# Or in your pipeline step — GitHub Actions example
- name: Build image
  env:
    DOCKER_BUILDKIT: "1"
  run: docker build -t myapp:${{ github.sha }} .

Without BuildKit you lose parallel stage execution, cache mounts, and the --secret flag. The build will still work — it’ll just be slower and less secure.

Layer order optimized: dependencies before source code

The rule is simple: put the things that change least at the top, things that change most at the bottom. Your package.json and lockfile don’t change on every commit, but your source code does. Copy them separately so Docker can cache the install step:

# Wrong — busts cache on every code change
COPY . .
RUN npm ci

# Right — npm ci only reruns when package-lock.json changes
COPY package.json package-lock.json ./
RUN npm ci
COPY . .

The thing that catches people out: if you add a new dependency without changing any source files, the cache busts correctly. If you change source files without touching dependencies, the install layer gets reused. That’s exactly what you want.

Multi-stage build used for compiled languages and front-end apps

Single-stage Dockerfiles for Go, Rust, Java, or any front-end bundler ship the entire toolchain to production. A Go binary needs the Go compiler to build but not to run. A React app needs Node and webpack to compile but the final artifact is just static files. Multi-stage keeps the runtime image lean:

FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /bin/server ./cmd/server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /bin/server /server
ENTRYPOINT ["/server"]

The distroless base has no shell, no package manager, no nothing — just the binary and its runtime deps. That final image is typically under 10MB for a Go service versus 300MB+ if you kept the builder stage. Smaller image = smaller attack surface and faster pulls in your CD pipeline.

.dockerignore in place — check the context size in build logs

BuildKit prints the context transfer size at the start of every build. If you see something like => transferring context: 847.3MB, you have a problem. Common offenders: node_modules, .git, local .env files, and build output directories. A minimal .dockerignore:

.git
node_modules
dist
.env*
*.log
.DS_Store
coverage/
__pycache__

The context gets sent to the Docker daemon before a single layer is evaluated. A large context on every CI run adds seconds to every build regardless of caching, and it can accidentally pull secrets into the image if a COPY . . catches a .env.local file.

No secrets in ARG/ENV — use secret mounts or runtime env vars

An ARG or ENV baked into a build will appear in docker history and in the image layers — even if you unset it in a later RUN command. The correct pattern for secrets needed only at build time (private npm registry tokens, pip credentials, etc.) is BuildKit’s --secret flag:

# Dockerfile
RUN --mount=type=secret,id=npm_token \
    NPM_TOKEN=$(cat /run/secrets/npm_token) npm ci

# CI invocation
docker build \
  --secret id=npm_token,env=NPM_TOKEN \
  -t myapp:${GIT_SHA} .

The secret is available as a tmpfs mount during that RUN step only. It never hits a layer. For runtime secrets (database passwords, API keys), don’t bake them at all — inject them via your orchestrator’s environment or secrets manager at runtime.

Image scan step present and blocking on HIGH/CRITICAL

Trivy is my default here — it’s fast, free, and the false positive rate is low enough to actually block on. Add it as a CI step after the build, before the push:

- name: Scan image for vulnerabilities
  run: |
    docker run --rm \
      -v /var/run/docker.sock:/var/run/docker.sock \
      aquasec/trivy:0.51.1 image \
      --exit-code 1 \
      --severity HIGH,CRITICAL \
      --ignore-unfixed \
      myapp:${{ github.sha }}

--ignore-unfixed is important — without it you’ll block on CVEs that have no available fix yet, which just trains your team to ignore scan failures. --exit-code 1 makes the CI step fail, stopping the push. Run this before your registry push step, not after.

Tags include git SHA, not just ‘latest’

latest is not a version. If a deploy goes wrong and you need to roll back, “re-pull latest” is meaningless because latest already got overwritten. Tag with the git SHA so every image is traceable to a specific commit:

# GitHub Actions
- name: Build and push
  run: |
    IMAGE=ghcr.io/myorg/myapp
    SHA=${{ github.sha }}
    
    docker build -t ${IMAGE}:${SHA} -t ${IMAGE}:latest .
    docker push ${IMAGE}:${SHA}
    docker push ${IMAGE}:latest  # optional convenience tag — but SHA is authoritative

I keep latest as a convenience tag for local dev, but deployments always reference the SHA tag. Your Kubernetes manifests or ECS task definitions should pin to the SHA — that way rollback is just changing one string back to the previous commit hash, and you can verify exactly what’s running with a git show.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Lead AI Engineer & SaaS Strategist

Eric is a seasoned software architect specializing in LLM orchestration and autonomous agent systems. With over 15 years in Silicon Valley, he now focuses on scaling AI-first applications.

Leave a Comment