Self-Hosting GitLab Runners: What the Docs Don’t Tell You

Why I Stopped Using Shared Runners (And What That Cost Me)

The breaking point for me wasn’t gradual — it was a Monday morning deploy at 9:15 AM where the pipeline sat in “pending” for 47 minutes before a single job ran. Not failing. Not running. Just waiting in a queue behind every other team on GitLab.com who also decided Monday morning was deploy time. That’s when shared runners stopped being a minor annoyance and became an actual business problem.

The minute limits are the other part of this. GitLab’s free tier on GitLab.com gives you 400 CI/CD minutes per month — that evaporates in about a week if you have more than two developers actively pushing. Their paid tiers (check gitlab.com/pricing directly because these numbers have shifted multiple times) give you more, but you’re paying per 1,000 additional minutes once you blow past the included allotment. A mid-sized team running integration test suites that take 8–12 minutes per pipeline run can burn through thousands of minutes before the month is halfway over. The math stops working fast.

The security angle is less obvious but it matters more in certain contexts. When a shared runner picks up your job, your code — including any secrets injected via CI variables — executes on infrastructure you don’t control, don’t audit, and can’t inspect. For most open-source projects or internal tools, that’s fine. The moment you’re dealing with PCI-DSS scope, HIPAA data flows, or a SOC 2 audit, “your code ran on GitLab’s shared infrastructure” becomes a conversation you don’t want to have with a compliance reviewer. Self-hosted runners let you put those jobs on machines you own, in your network, with your logging stack watching them.

Here’s the trade-off nobody says out loud: the shared runner bill is partly a payment for someone else’s ops burden. The moment you go self-hosted, you own:

  • Runner version upgrades (GitLab moves fast — mismatched versions cause weird failures)
  • The underlying machine’s patching and uptime
  • Debugging why a job passes locally but hangs on the runner
  • Capacity planning when three pipelines want to run concurrently and you only registered one runner

None of that is insurmountable, but going in thinking “I’ll just spin up a runner and never think about it again” is how you end up with a stale runner running GitLab Runner 15.x while your .gitlab-ci.yml uses features from 16.x. I’ve been there. The flip side: once the runner is tuned, you get consistent sub-2-minute queue times, unlimited minutes at the cost of compute, and full control over the execution environment. For teams thinking holistically about their toolchain — including where AI-assisted development fits in — the Best AI Coding Tools in 2026 guide is worth a read, because a reliable self-hosted runner is often the prerequisite before AI pipeline integrations like automated code review or test generation actually work at any useful speed.

What You Actually Need Before You Start

The spec question trips people up more than anything else. A 2-core/4GB RAM VM handles basic pipelines fine — linting, unit tests, lightweight builds. I run exactly that configuration on Hetzner CX22 instances (€3.79/month) for non-critical projects and it holds up. The moment you introduce Docker-in-Docker (dind), that math falls apart fast. A single dind job can spike to 3GB RAM on its own just pulling layers and spinning up the daemon. Run two concurrent dind jobs on a 4GB machine and you’re watching the OOM killer make decisions for you. My rule: if any job in your pipeline uses services: [docker:dind], start at 4-core/8GB minimum, and set concurrent = 2 in your runner config rather than the default.

Ubuntu 22.04 LTS is what I’d recommend without hesitation. The GitLab Runner apt repo works first try, systemd integration is clean, and when something breaks I can usually find a Stack Overflow answer from 2023 that still applies. RHEL 8/9 and Rocky Linux work, but the repo setup requires adding the runner’s RPM repo manually and SELinux will silently block socket mounts if you’re running Docker executor without the right labels. I’ve spent hours debugging pipelines that “just failed” only to find avc: denied in audit logs. If you’re on RHEL for compliance reasons, fine — just run sudo setenforce 0 temporarily during setup to confirm SELinux is your problem before you spend three hours elsewhere.

Network is simpler than people think but the one gotcha bites constantly: your runner machine needs outbound TCP on port 443 to reach your GitLab instance, whether that’s gitlab.com or your internal GitLab at gitlab.company.internal. That’s it for the control plane. Where people get surprised is object storage — if your GitLab instance uses S3-compatible storage for artifacts and caches, the runner downloads those directly, not through GitLab. So if your runner is in a different VPC or behind a restrictive egress firewall, you need a separate outbound rule for your S3 endpoint. I’ve seen pipelines pass registration and then fail every single job because artifacts couldn’t be uploaded, and the error message is not helpful.

The token situation is genuinely confusing right now because GitLab made a breaking change in 16.x and the documentation is in a halfway state. The old flow used a registration token — a static secret you’d grab from Admin → Runners and pass to gitlab-runner register. GitLab deprecated that in 16.0 and removed support entirely in 17.0. The new flow uses an authentication token that you generate by creating a runner in the UI first (Admin → CI/CD → Runners → New instance runner), which gives you a token starting with glrt-. Then you register with that token. The practical difference:

# OLD (broken in 17.x — don't use)
sudo gitlab-runner register \
  --url https://gitlab.com \
  --registration-token YOUR_OLD_TOKEN \
  --executor docker \
  --docker-image alpine:3.19

# NEW (16.x+ with glrt- token)
sudo gitlab-runner register \
  --url https://gitlab.com \
  --token glrt-xxxxxxxxxxxxxxxxxxxx \
  --executor docker \
  --docker-image alpine:3.19

The flag name looks identical (--token vs --registration-token) which is why people waste 30 minutes wondering why their “working” command suddenly returns a 401. Check your GitLab instance version first: https://gitlab.yourinstance.com/help shows it in the bottom right. If you’re on 17.x and your docs or internal runbook still shows --registration-token, throw it out and use the UI-first flow.

Installing the GitLab Runner Binary

The thing that bites most people early is reaching for whatever shell script they find on a random gist or blog post. GitLab publishes a dedicated package repository — use that. It means you get GPG-verified packages, proper systemd integration, and a clean upgrade path. The official install script just configures that repo and nothing else, which is the right amount of magic.

Run these two commands in sequence on any Debian/Ubuntu host:

# Adds the official GitLab runner apt repo and its GPG key
curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash

# Then install — but pin the version (explained below)
sudo apt install gitlab-runner=16.11.0

On RHEL/Rocky/Alma you’d swap in the script.rpm.sh variant and use dnf install gitlab-runner-16.11.0. The repo script handles both, but it won’t install the package itself — that’s intentional.

Pin the version. I cannot stress this enough. I got burned during a production deploy when apt upgrade pulled in a minor GitLab Runner update mid-pipeline. The runner restarted, orphaned a job, and our deploy webhook fired twice. Pinning to something like 16.11.0 means unattended upgrades won’t touch it. To prevent accidental upgrades going forward:

# Hold the package at whatever version you installed
sudo apt-mark hold gitlab-runner

# Confirm it's held
apt-mark showhold
# Output: gitlab-runner

After install, verify it actually worked:

gitlab-runner --version

You should see something like this — the exact build metadata matters if you’re ever filing a bug or matching against a GitLab instance compatibility matrix:

Version:      16.11.0
Git revision: ac8e767a
Git branch:   16-11-stable
GO version:   go1.21.9
Built:        2024-05-16T13:42:12+0000
OS/Arch:      linux/amd64

One thing the docs mention but don’t emphasize enough: the installer automatically creates a gitlab-runner system user and group. This isn’t cosmetic — it’s the user your CI jobs run as by default when using the shell executor. If your pipeline touches files owned by root or another service user, you’ll hit permission errors that feel mysterious until you remember this. I’ve seen people spend an hour debugging a Permission denied on a build artifact that was written by a previous root-owned cron job. Check your file ownership before registering the runner.

Registering Your Runner: The New Way (GitLab 16.x+)

GitLab 16.0 moved the runner registration UI again. The old path through the project’s Settings > CI/CD > Runners still technically exists, but the actual token generation now lives at a different spot depending on whether you’re registering a project-scoped, group-scoped, or instance-scoped runner. For project-level runners: Settings > CI/CD > Runners > New project runner. You’ll get a one-time authentication token that starts with glrt- — that prefix matters, because 16.x switched from the old registration token format to this new “runner authentication token” system. The old gitlab-runner register flow with the legacy token still works for now, but GitLab has been signaling they’ll deprecate it. Use the new tokens.

Here’s the actual command you want for non-interactive registration with Docker executor:

sudo gitlab-runner register \
  --non-interactive \
  --url https://gitlab.com \
  --token glrt-xxxxxxxxxxxxxxxxxxxx \
  --executor docker \
  --docker-image alpine:3.19 \
  --description "my-build-server" \
  --tag-list "docker,linux" \
  --docker-volumes /var/run/docker.sock:/var/run/docker.sock

The --docker-volumes flag for the Docker socket is optional but you’ll almost certainly need it if your pipelines build or push Docker images. Drop it if you’re running pure code builds. The alpine:3.19 default image is just a fallback — your .gitlab-ci.yml image: key overrides it per job. I set it to something lightweight like alpine rather than ubuntu because forgetting to specify an image in a job should fail fast, not silently pull a 70MB layer you didn’t want.

The reason you absolutely must use --non-interactive in any automation — Ansible playbook, cloud-init script, Dockerfile — is that without it, the register command drops into a TTY prompt and your script hangs indefinitely waiting for input it will never get. I’ve seen this kill EC2 user-data scripts in ways that are genuinely hard to debug because the instance looks “healthy” but the runner never comes up. Pass every option as a flag and use --non-interactive unconditionally in scripts.

After registration, everything lands in /etc/gitlab-runner/config.toml. A minimal two-runner config looks like this:

concurrent = 4  # max jobs across ALL runners on this host

[[runners]]
  name = "my-build-server"
  url = "https://gitlab.com"
  token = "glrt-xxxxxxxxxxxxxxxxxxxx"
  executor = "docker"
  [runners.docker]
    image = "alpine:3.19"
    privileged = false          # set true only if you need Docker-in-Docker
    volumes = ["/cache"]
    shm_size = 0

Version-control this file. Seriously — strip the token before committing (or use sops/age encryption), but the rest of the config — concurrency, volumes, resource limits — should live in your repo alongside your Ansible roles or Terraform. Machines get rebuilt. I’ve been burned by losing a runner config with carefully tuned concurrent and pull_policy settings that took an afternoon to rediscover.

The self-signed cert gotcha is real and the failure mode is terrible. If you’re registering against a self-hosted GitLab with an internal CA, the command just returns something like couldn't execute POST against https://gitlab.internal/api/v4/runners with no indication that the issue is certificate verification. The fix:

sudo gitlab-runner register \
  --non-interactive \
  --url https://gitlab.internal \
  --tls-ca-file /etc/ssl/certs/my-internal-ca.pem \
  --token glrt-xxxxxxxxxxxxxxxxxxxx \
  --executor docker \
  --docker-image alpine:3.19

That CA file path gets written into config.toml under tls-ca-file, so future runner-to-GitLab communication also works. Alternatively, add your CA to the system trust store (update-ca-certificates on Debian/Ubuntu, update-ca-trust on RHEL) before registering, and you won’t need the flag at all. I prefer the explicit tls-ca-file in the config because it survives a system CA store reset and makes the dependency obvious to anyone reading the TOML.

Choosing Your Executor: Docker Is Usually Right, But Not Always

The executor choice shapes everything downstream — your cache strategy, your security posture, your on-call pain at 2am. I’ve run all four of these in production at different points, and the “just use Docker” advice you see everywhere is correct maybe 80% of the time. The other 20% will bite you if you don’t think it through upfront.

Shell Executor: Dangerous by Default

The shell executor runs jobs directly on the host as the gitlab-runner user with zero containerization. I use it exclusively for tasks that require bare metal access — BIOS flashing scripts, firmware provisioning, anything that needs direct PCI passthrough or specific kernel modules. The moment a shell executor job touches the internet, you’ve got a problem: a compromised build script can read /etc/passwd, sniff environment variables from sibling processes, and pivot through your internal network. If you inherited a setup running web app tests on shell executors, fix that this week, not next sprint.

Docker Executor: The Correct Default

Fresh container per job, predictable environment, no state leakage between runs. This is what most teams should run. The one thing that catches people off guard is docker pull latency — if every job on a cold runner pulls a 2GB image, your pipeline times balloon fast. The fix is configuring a pull policy in config.toml:

[[runners]]
  name = "docker-runner-01"
  executor = "docker"
  [runners.docker]
    image = "node:20-alpine"
    # try local cache first, only pull if missing or :latest
    pull_policy = ["if-not-present", "always"]
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]

Using if-not-present as the first policy means pinned image tags (like node:20.11.0-alpine instead of node:20-alpine) will be cached on disk and only pulled once per runner host. This alone can cut job startup time from 45 seconds to under 5 on a warm runner.

Docker Machine Executor: Stop Using It

GitLab formally deprecated Docker Machine executor in GitLab 17.x. If you’re still running docker+machine in your config.toml, you’re on borrowed time — it won’t receive security fixes and the underlying Docker Machine project itself has been archived. GitLab’s own replacement path is the fleeting plugin system with provider-specific autoscalers (AWS, GCP). Migration isn’t trivial, but the longer you wait the messier it gets. I’ve seen teams sit on deprecated executors for 18 months and then scramble when a CVE forced an emergency runner upgrade that broke half their pipelines.

Kubernetes Executor: Earn It First

If you’re already running k8s and your team knows how to operate it, the Kubernetes executor is genuinely excellent — each job becomes a pod, resource limits are enforced at the platform level, and you get horizontal scaling for free via the cluster autoscaler. The config is significantly heavier than Docker though:

[[runners]]
  name = "k8s-runner"
  executor = "kubernetes"
  [runners.kubernetes]
    namespace = "gitlab-runners"
    image = "alpine:3.19"
    # these limits apply per job pod
    cpu_request = "500m"
    cpu_limit = "2"
    memory_request = "512Mi"
    memory_limit = "2Gi"
    # pull secrets for private registries
    image_pull_secrets = ["registry-credentials"]
    [[runners.kubernetes.volumes.empty_dir]]
      name = "repo"
      mount_path = "/builds"
      medium = "Memory"  # faster than disk for small repos

The trap here is networking. Your pods need to reach your GitLab instance, your artifact stores, and your Docker registry. Get your NetworkPolicy wrong and you’ll spend hours debugging why git clone hangs inside a job pod but works from your laptop. Don’t adopt the k8s executor just because it sounds more scalable — if you’re running three Docker executor runners and they handle your load, they’re the right answer.

The Autoscaling Argument (Where Docker+Machine Was Useful)

The reason Docker Machine stuck around so long was autoscaling for bursty workloads — you push a tag, 40 jobs fire at once, you want 40 ephemeral VMs to spin up and terminate when done. Plain Docker executor on fixed hosts can’t do that. With Docker Machine gone, your options are the fleeting-based autoscaler or the Kubernetes executor with cluster autoscaling. For teams on AWS, the fleeting AWS plugin with an autoscaling group gets you the same behavior: idle capacity of 2 runners, max burst to 20, scale-down after 10 minutes idle. If your pipeline load is consistent (same jobs, same frequency, no Monday morning deploy spikes), fixed Docker executor runners are simpler to operate and you should prefer them.

Configuring config.toml: The Settings That Actually Matter

Most teams copy a config.toml from a blog post, change the registration token, and move on. Then they hit mysterious cache misses six weeks later, or a runaway build takes down the host machine at 2am. The defaults are not production-ready — here’s every knob you should actually understand.

# /etc/gitlab-runner/config.toml

# Total jobs that can run simultaneously across ALL runners on this machine.
# This is a global ceiling — individual runner concurrency can't exceed this.
concurrent = 4

# How often (seconds) the runner polls GitLab for new jobs.
# 3 is the default. Lower = more API calls. Don't go below 3 on shared infra.
check_interval = 3

[[runners]]
  name = "my-docker-runner"
  url = "https://gitlab.example.com"
  token = "YOUR_RUNNER_TOKEN"
  executor = "docker"

  [runners.docker]
    # Pull order matters — see explanation below
    pull_policy = ["if-not-present", "always"]
    image = "alpine:3.19"

    # Named Docker volume for cache, not a bind mount
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]

    # Hard limits — skip these and a bad build owns your host
    memory = "2g"
    memory_swap = "2g"   # same as memory = no swap allowed

    # Keep this false unless you explicitly need DinD
    privileged = false

    # Clean up container filesystems after each job
    disable_cache = false
    shm_size = 0

The concurrent = 4 value looks arbitrary but it maps directly to your hardware. A rough rule I use: set it to your CPU core count if your jobs are CPU-bound (compilation, tests), or 2× core count if jobs are mostly I/O-bound (linting, artifact uploads). On a 2-core VPS with 4GB RAM running Docker jobs, 4 is already aggressive — each Docker executor spins up a container with its own filesystem layer writes. I run htop during a full pipeline and watch memory before committing to a number. If you set concurrent = 8 on a 2GB machine without memory limits, you’ll find out about it the hard way when the OOM killer wakes up mid-deploy.

The pull_policy = ["if-not-present", "always"] array is processed left-to-right as a fallback chain. With this config, GitLab Runner first checks if the image exists locally — if yes, skip the pull. If the image isn’t cached locally at all, it falls back to pulling from the registry. The trap people fall into is using ["always"] in prod because “it feels safer.” Sure, you’ll always get fresh images, but you’re now hammering Docker Hub on every single job, which means rate limits bite you at exactly the wrong moment. The smarter move: use if-not-present for your internal/pinned images and handle freshness through explicit image tags in your .gitlab-ci.yml. If security is the concern — that someone could slip a malicious layer into a cached image — pin your image SHAs instead of relying on pull policy to save you.

The volumes line deserves more attention than it gets. Writing "/cache" as a bare path tells the Docker executor to create a named Docker volume for that path, not a directory on the host filesystem. This is the difference between cache that persists correctly across jobs and cache that silently misses. Bind mounts like /tmp/runner-cache:/cache look equivalent but they’re not — permissions issues between the container user and host directory cause silent write failures where your cache seems to populate but doesn’t persist. The named volume approach lets Docker manage ownership, and it survives runner container restarts. I’ve debugged two separate “why is our npm cache not working” incidents that both traced back to this exact bind mount permission issue.

Memory limits are the most skipped production safety feature. Without them, a build that spins up a webpack compilation or a Java test suite can allocate unbounded RAM. Setting memory = "2g" and memory_swap = "2g" (both equal) means the container gets 2GB of RAM and zero swap — the job gets killed cleanly if it hits the limit rather than thrashing swap and slowing your entire host. The job fails fast, you get a clear OOM error in the logs, and your other running jobs are unaffected. If you set memory_swap higher than memory, the difference becomes available as swap — useful if you have jobs that occasionally spike but you don’t want hard OOM kills. Just don’t leave both unset.

The privileged = true flag exists almost exclusively for Docker-in-Docker (DinD) — building Docker images inside a CI job. It works, but it means the container has full host access, including the ability to mount arbitrary host paths and escape the container namespace. For most pipelines that just need to build Docker images, the socket mount approach is safer and actually simpler to set up:

  [runners.docker]
    privileged = false
    volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]

With this config, your CI job uses the host’s Docker daemon directly — no nested Docker, no privilege escalation. The trade-off is real: jobs can see and manipulate other containers on the host, so don’t use this pattern on a runner shared with sensitive production workloads. If you’re on a dedicated build machine, it’s a solid middle ground. Reserve privileged = true for isolated runner hosts where the blast radius of a compromised job is contained — and if you do enable it, make sure that runner has a specific tag and your pipelines explicitly opt in to using it.

Docker-in-Docker (DinD): Making It Work Without Shooting Yourself

The thing that bit me hardest when I first set up DinD wasn’t the configuration — it was not understanding why there are two completely different approaches with completely different security profiles. Privileged mode gives your container full access to the host kernel. You’re basically handing it root on the machine. The Docker socket bind-mount approach (/var/run/docker.sock) is slightly less scary in theory, but any container that can talk to the Docker socket can escape to the host just as easily — so don’t let anyone tell you socket binding is “safe.” Pick your poison based on your threat model: privileged mode for isolated VMs where you don’t care about container breakout (your runner is already the blast radius), socket binding for bare-metal runners where all jobs share one Docker daemon and you accept the shared-state risk.

The DOCKER_TLS_CERTDIR variable is what destroyed two hours of my life. Starting with docker:20+, DinD runs with TLS enabled by default. If you leave DOCKER_TLS_CERTDIR unset, the DinD service generates certs and waits for the client to pick them up — but your job container doesn’t know where to look, so every docker build fails with a “Cannot connect to the Docker daemon” error that tells you nothing useful. You have two choices: set it to an empty string to disable TLS entirely (fine for internal runners behind a firewall), or set it to /certs and mount that as a shared volume between the service and the job. I went with the empty-string approach on isolated runner VMs. Here’s why the cert-sharing approach is fragile: the service has to write the certs before the job container starts, and there’s a race condition if your runner storage is slow.

# .gitlab-ci.yml — the combination that actually works for me
variables:
  DOCKER_HOST: tcp://docker:2375          # 2375 = no TLS, 2376 = TLS
  DOCKER_TLS_CERTDIR: ""                  # disable TLS on isolated runners
  DOCKER_DRIVER: overlay2                 # always set this; vfs is painfully slow

build-image:
  image: docker:24-cli                    # only the CLI — no daemon bloat
  services:
    - name: docker:24-dind
      alias: docker                       # this alias must match the host in DOCKER_HOST
  before_script:
    - docker info                         # fails fast if daemon isn't up yet
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA .
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
    - docker tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA $CI_REGISTRY_IMAGE:latest
    - docker push $CI_REGISTRY_IMAGE:latest

The reason I settled on docker:24-dind as the service and docker:24-cli as the job image is purely about separation of concerns and image size. The dind image runs dockerd as a service in the background — it’s the daemon. The cli image is ~40MB and only ships the docker binary. Using docker:24 (the combined image) as your job image means you’re pulling a heavier image every job for no reason, and you get a second daemon attempting to start and immediately failing. Keep them separate, pin to 24 not latest, and your pipeline stops randomly breaking when Docker ships a major version bump on a Tuesday morning.

One non-obvious gotcha: the alias: docker on the service block must match the hostname in DOCKER_HOST. If you forget the alias and just declare - docker:24-dind, GitLab registers the service under docker anyway — but only if it’s the first service. Add a second service and the hostname resolution breaks in ways that are genuinely hard to debug. Always be explicit with the alias. Also, your runner’s config.toml needs privileged = true under the [runners.docker] section for DinD to work at all:

[[runners]]
  name = "dind-runner"
  url = "https://gitlab.example.com"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "docker:24-cli"
    privileged = true                     # required for DinD; not optional
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0

One last thing people miss: docker info in your before_script isn’t just a sanity check — it’s a synchronization point. The DinD service starts asynchronously, and on a busy runner the daemon might not be ready by the time your first docker build fires. The docker info call blocks until the socket responds or times out, which surfaces the problem immediately instead of giving you a cryptic mid-build failure. If you want something more solid, wrap it in a retry loop: for i in $(seq 1 10); do docker info && break || sleep 3; done. Saved me from flaky pipeline failures on a runner with slow disk I/O.

Caching That Actually Works

The single biggest gotcha I hit when scaling from one runner to two: local filesystem cache silently stops working. Each runner has its own /cache volume, so Runner A builds and populates the cache, then Job B lands on Runner B — cold miss, full reinstall, no error, no warning. You just wonder why your cache hit rate dropped to zero. Local cache is fine when you have exactly one runner and you’re not autoscaling. The moment you register a second runner or spin up an ephemeral one in Kubernetes, you need shared storage or you’re wasting everyone’s time.

I run MinIO in the same docker-compose.yml as my runners. It’s an S3-compatible object store that you own, runs on your hardware, and costs nothing except disk. Here’s the exact setup I use:

version: "3.8"
services:
  minio:
    image: minio/minio:RELEASE.2024-03-15T01-07-19Z
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: gitlab_cache
      MINIO_ROOT_PASSWORD: changeme_use_a_real_secret
    volumes:
      - minio_data:/data
    ports:
      - "9000:9000"
      - "9001:9001"  # web console, handy for debugging
    restart: unless-stopped

  gitlab-runner:
    image: gitlab/gitlab-runner:v17.0.1
    volumes:
      - ./config:/etc/gitlab-runner
      - /var/run/docker.sock:/var/run/docker.sock
    restart: unless-stopped
    depends_on:
      - minio

volumes:
  minio_data:

Before the runner can use it, create the bucket manually once. You can do it via the MinIO console at http://localhost:9001 or with the CLI: mc alias set local http://localhost:9000 gitlab_cache changeme_use_a_real_secret && mc mb local/gitlab-runner-cache. MinIO doesn’t auto-create buckets when the runner tries to write to them — you’ll get a silent cache miss and a confusing log line about a 403 or NoSuchBucket error.

The [runners.cache] block in config.toml is where most guides give you placeholder values that don’t actually work. Here’s a real one:

[runners.cache]
  Type = "s3"
  Shared = true  # critical — lets all registered runners share the same cache

  [runners.cache.s3]
    ServerAddress = "minio:9000"      # docker network hostname, not localhost
    AccessKey = "gitlab_cache"
    SecretKey = "changeme_use_a_real_secret"
    BucketName = "gitlab-runner-cache"
    BucketLocation = "us-east-1"      # MinIO ignores this but the runner requires it
    Insecure = true                   # set to false if you front MinIO with TLS

Shared = true is the flag that most people miss. Without it, each runner treats the cache as private to itself even though they’re all pointed at the same bucket. Also note ServerAddress uses the Docker Compose service name minio, not localhost — the runner container can’t reach localhost:9000 since that resolves inside its own network namespace.

Cache key strategy matters more than most people realize. The default key: ${CI_COMMIT_REF_SLUG} keys per branch, which means every new branch starts cold. That’s often fine. What’s not fine is keying off nothing or off $CI_COMMIT_SHA — the latter gives you zero reuse across commits. For dependency caches, key off the lockfile hash:

cache:
  key:
    files:
      - package-lock.json   # cache busts only when lockfile changes
  paths:
    - .npm/                 # NOT node_modules — read below

# If you want per-branch isolation on top of lockfile keying:
cache:
  key: "$CI_COMMIT_REF_SLUG-$CI_JOB_NAME-${CI_COMMIT_REF_SLUG}"
  key:
    prefix: "$CI_COMMIT_REF_SLUG"
    files:
      - package-lock.json
  paths:
    - .npm/

The .npm vs node_modules distinction is what burns everyone at least once. node_modules is not safely cacheable across different machines or even across minor OS image changes — symlinks break, native modules compiled for one glibc version silently fail on another, and the directory can be gigabytes. The correct approach is to cache npm’s content-addressable cache (the .npm folder) and let npm ci reinstall from it. Set npm config set cache .npm --global in your job before running npm ci, or use npm ci --cache .npm/ directly. The install still runs, but it reads tarballs from local disk instead of the network — typically 10–20x faster than a cold pull from the registry, and the restored artifact is maybe 200–400MB instead of 2GB.

Securing Your Runner: Non-Negotiable Steps

The default install puts the gitlab-runner user in the docker group, which — if you’re running the Docker executor — is functionally equivalent to root on the host. I got burned by this on an internal project when I realized any job could mount /etc/shadow via a volume flag in the before_script. The fix is using Docker-in-Docker or the socket proxy pattern instead of bind-mounting the Docker socket directly. If you absolutely must use the socket, layer on a tool like Tecnativa’s docker-socket-proxy that filters which API calls are allowed. At minimum, verify the runner user can’t write to anything outside its home directory:

# Check what the gitlab-runner user can actually reach
sudo -u gitlab-runner ls /root          # should fail
sudo -u gitlab-runner cat /etc/shadow   # should fail

# The runner user's home should be locked down
stat /home/gitlab-runner
# drwx------ should be the result, not drwxr-xr-x

Runner tags are your first line of defense against tag-squatting, where a developer’s side project pipeline accidentally (or intentionally) picks up compute from your production runner. Set run_untagged = false in your config.toml and you’ve cut off every untagged pipeline immediately. The configuration lives at /etc/gitlab-runner/config.toml and the relevant block looks like this:

[[runners]]
  name = "prod-builder"
  url = "https://gitlab.example.com"
  token = "glrt-xxxxxxxxxxxx"
  run_untagged = false       # only pick up jobs that explicitly request this runner
  tag_list = ["prod", "docker", "x86_64"]
  executor = "docker"

Protected runners are a feature people enable without thinking through the consequences. When you mark a runner as protected in the GitLab UI, it will only run jobs on protected branches and protected tags. That means merge request pipelines from forks — which run on a detached ref, not a protected branch — will never reach this runner. That’s usually what you want for a runner that has production deploy credentials. The gotcha: if your branching strategy uses feature branches that aren’t protected, those pipelines will hang silently waiting for a non-protected runner that doesn’t exist. Make sure you have a separate runner for feature branch work before enabling this.

Your firewall rules should be aggressively minimal. The runner itself only needs outbound 443 to your GitLab instance to poll for jobs and POST results back. Whatever the jobs need (pulling from Docker Hub, pushing to S3, hitting an internal service) should be explicitly opened per-runner based on what that runner actually builds. I use a dedicated network namespace or a restrictive egress security group per runner host in AWS. Here’s the ufw equivalent for a runner that only talks to GitLab SaaS and Docker Hub:

# Outbound only — no inbound rules needed for the runner daemon itself
ufw default deny incoming
ufw default deny outgoing

# GitLab
ufw allow out 443/tcp to 34.74.90.64   # gitlab.com — verify current IPs

# Docker Hub pulls
ufw allow out 443/tcp to 54.236.113.205

# DNS — easy to forget this one
ufw allow out 53/tcp
ufw allow out 53/udp

ufw enable

Masked CI/CD variables will hide the value from job logs, but they don’t prevent a job from exfiltrating the secret to an external endpoint via curl. The masking is cosmetic protection against accidental log leakage, not a security boundary. For anything that controls production access, use HashiCorp Vault or AWS Secrets Manager and fetch secrets at runtime with short-lived credentials rather than baking a long-lived token into GitLab’s variable store. The other hard limit: masked variables can’t contain newlines, which means PEM-encoded private keys can’t be masked at all — you’ll need to base64 encode them and decode inside the job, which is its own footgun to document carefully.

Stale runners silently accumulate and each one is a potential access vector if the token gets compromised. I run this cleanup as a monthly cron on the runner host:

# Check which runners are actually reachable from the host
sudo gitlab-runner verify

# Example output:
# Verifying runner... is alive                  runner=abc12345
# Verifying runner... is alive                  runner=def67890
# Verifying runner... DELETE NOT CONNECTED      runner=zzz99999

# Remove the stale one
sudo gitlab-runner unregister --name "old-staging-runner"

# Or nuke everything and re-register from scratch (destructive — CI will pause)
sudo gitlab-runner unregister --all-runners

The verify command doesn’t just confirm network connectivity — it tells you if the registration token is still valid on the GitLab side. If a runner was deleted in the UI but the process is still running on the host, it’ll show up as not connected here, and that orphaned process is still consuming system resources. Cross-reference gitlab-runner verify output with what’s listed under Settings → CI/CD → Runners in your project or group, and treat any discrepancy as a thing to fix immediately, not eventually.

Running the Runner as a System Service

The part that surprised me when I first set this up: gitlab-runner install doesn’t just drop a binary somewhere — it writes a systemd unit file to /etc/systemd/system/gitlab-runner.service and calls systemctl daemon-reload automatically. So by the time you run gitlab-runner start, you’re already in systemd territory whether you realize it or not.

# Run these as root or with sudo
sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
sudo gitlab-runner start

# Verify systemd picked it up
sudo systemctl status gitlab-runner

The --user flag matters here. I’ve seen setups where the runner was installed as root and it caused permission nightmares with Docker socket access and file artifacts. Use a dedicated gitlab-runner system user. The gitlab-runner install command will create it if it doesn’t exist.

What healthy vs. broken output looks like

After systemctl status gitlab-runner, a healthy runner shows Active: active (running) with a recent timestamp and you’ll see log lines about the runner polling for jobs. An unhealthy one shows Active: failed (Result: exit-code) or activating (auto-restart) — the second one means it crashed and systemd is retrying. The most common cause I’ve hit is a corrupted /etc/gitlab-runner/config.toml after a botched manual edit.

# Healthy
● gitlab-runner.service - GitLab Runner
   Loaded: loaded (/etc/systemd/system/gitlab-runner.service; enabled)
   Active: active (running) since Mon 2025-01-13 09:14:22 UTC; 3h 22min ago
 Main PID: 1423 (gitlab-runner)

# Unhealthy — crashed and stuck
   Active: failed (Result: exit-code) since Mon 2025-01-13 12:31:07 UTC; 4s ago
  Process: 8821 ExecStart=/usr/bin/gitlab-runner run (code=exited, status=1/FAILURE)

Following logs in real time

Forget grepping through log files. Since the runner runs under systemd, all output goes to the journal:

# Tail live logs — essential when debugging a stuck pipeline
sudo journalctl -u gitlab-runner -f

# Last 100 lines, then follow
sudo journalctl -u gitlab-runner -n 100 -f

# Filter by time — useful for post-mortem on a 3am failure
sudo journalctl -u gitlab-runner --since "2025-01-13 03:00:00" --until "2025-01-13 04:00:00"

The logs will show job pickup, executor spin-up, and any fatal errors. If a job hangs with no output, the runner log will usually show the executor forked but never completed — that’s almost always a Docker network or registry timeout issue, not a code problem.

Auto-restart is already configured — here’s how to confirm it

The generated unit file includes Restart=always out of the box, but I always verify this on any machine I didn’t personally provision:

sudo cat /etc/systemd/system/gitlab-runner.service | grep -i restart
# Should output:
# Restart=always
# RestartSec=42s

The 42 second delay on RestartSec is intentional — it prevents a tight crash loop from hammering your GitLab instance with reconnection attempts. If you’re on a high-traffic runner and want faster recovery, you can drop it to RestartSec=10s, but edit the file carefully and run sudo systemctl daemon-reload && sudo systemctl restart gitlab-runner after. Don’t just systemctl edit a value without reloading or it won’t take effect.

Updating the runner without nuking active jobs

Updating during a deploy pipeline or a long test suite is a bad time. The runner process gets replaced mid-execution and GitLab marks the job as failed with a confusing “runner disconnected” error. I do updates during off-hours — midnight maintenance windows for anything serving production CI.

# On Debian/Ubuntu — runner was installed via the official apt repo
sudo apt-get update && sudo apt-get install --only-upgrade gitlab-runner

# Check what version you're upgrading to before committing
apt-cache policy gitlab-runner

# After upgrade, verify the service recovered
sudo systemctl status gitlab-runner
sudo gitlab-runner --version

Before upgrading, check the runner changelog for the version you’re jumping to. There have been breaking changes in executor config between minor versions — particularly around the Docker executor’s privileged mode and certificate handling. Jumping multiple minor versions at once has burned me before. Upgrade one minor version at a time if you’re more than two behind.

Gotchas I Hit in Production

The one that burned me hardest wasn’t even a code problem — it was timezones. My GitLab instance was configured to display in America/New_York but the runner host was running in UTC. Scheduled pipelines were firing 5 hours late. The UI showed “midnight” but the runner saw 5am. Fix it by pinning the runner host timezone explicitly:

# On Ubuntu/Debian runner host
sudo timedatectl set-timezone UTC

# Then in GitLab UI: Admin > Settings > Preferences > Default first day of week
# AND verify your schedule in the pipeline UI is using the right reference timezone
# GitLab CE stores schedules in UTC internally — your display offset is the trap

The safest pattern is to keep the runner host in UTC and remember that every cron-style schedule you type in the GitLab UI is interpreted in the server’s local time. If your self-hosted GitLab server has a non-UTC timezone set, your schedule times in the UI don’t map 1:1 to what UTC midnight means on the runner. I now keep both server and runner in UTC and mentally translate at schedule entry time. It’s annoying but it’s unambiguous.

Docker layer cache will silently eat your disk alive. A moderately busy runner doing 50–100 builds per day can fill 100GB in under two weeks if you’re building Node or Python images with fat node_modules layers. The fix is brutally simple but you have to actually schedule it:

# /etc/cron.d/docker-prune
# Runs at 3am daily — tune the --filter if you want to keep recent cache
0 3 * * * root docker system prune -f --volumes >> /var/log/docker-prune.log 2>&1

Don’t use docker system prune -a unless you want to nuke every cached layer including base images. The plain -f version only removes dangling images and stopped containers — that’s usually enough. Add --filter "until=24h" if you want to be smarter about it. Either way, monitor /var/lib/docker with something like du -sh /var/lib/docker/overlay2 before you trust it.

The concurrent limit vs. CPU starvation problem is subtle. Setting concurrent = 8 in /etc/gitlab-runner/config.toml on a 4-core box sounds fine — you’re thinking “maybe they won’t all be CPU-bound at once.” But without container-level CPU limits, every job competes for all 4 cores equally and the scheduler thrashes. Jobs that normally finish in 3 minutes start taking 12. The fix is to set CPU limits in your runner config:

[[runners]]
  name = "my-docker-runner"
  concurrent = 4  # match this to actual core count or go slightly over with limits below
  [runners.docker]
    cpus = "1.0"       # hard cap per container
    memory = "2g"      # prevent one job from eating all RAM too
    memory_swap = "2g" # disable swap for jobs — fail fast instead of thrash

A runner stuck in the stuck state almost always means a zombie container is holding the job open. GitLab’s UI timeout doesn’t clean it up — it just marks it stuck and leaves the container running on the host. Your first two commands should be:

# See what's still running or exited but not removed
docker ps -a --filter "label=com.gitlab.gitlab-runner.job.id"

# Then clean up dead runners from the coordinator's perspective
gitlab-runner verify --delete

# If a specific container is the zombie, force-remove it
docker rm -f 

The GIT_STRATEGY choice is genuinely situational. fetch is 3–10x faster on large repos because it only pulls new commits instead of re-cloning, but it preserves workspace state between jobs — which is exactly what breaks submodule workflows. If your pipeline uses git submodule update --init, a dirty fetch cache can leave you with stale submodule pointers that don’t match the current commit. I use fetch everywhere except repos with submodules, where I drop back to clone and accept the speed hit:

# In your .gitlab-ci.yml — set per-job, not globally, so you can be surgical
variables:
  GIT_STRATEGY: fetch       # fast path for most jobs

build-with-submodules:
  variables:
    GIT_STRATEGY: clone     # clean slate — slower but safe
    GIT_SUBMODULE_STRATEGY: recursive
  script:
    - git submodule update --init --recursive

Scaling Beyond One Runner

The first sign you need more runners isn’t a dashboard alert — it’s a developer complaining that their merge request has been sitting in “pending” for eight minutes. By the time you check the GitLab UI under Admin → CI/CD → Runners, you’ll see the queue depth climbing and jobs timestamped well before any runner picked them up. GitLab shows “waiting for runner” in the pipeline view, and if that number regularly exceeds 2-3 minutes during peak hours, you’re already behind. Don’t wait for it to become a 20-minute problem.

The cheap first move is bumping concurrent in your existing config.toml. If your machine has cores to spare, a runner sitting at concurrent = 1 is leaving capacity on the table:

# /etc/gitlab-runner/config.toml
concurrent = 8  # was 1 — this alone can unblock small teams

[[runners]]
  name = "primary-runner"
  executor = "docker"
  # ...each job still gets its own container, concurrency is safe here

That buys time. But once your build machine’s CPU or RAM is the bottleneck, more concurrency just makes everything slower. At that point you’re choosing between adding runners to the same host (bad idea past maybe 2-3 extra processes on a 4-core box) or spinning up separate VMs. Separate VMs give you real isolation — a runaway job doesn’t starve other builds, you can size machines differently per workload, and you can tear one down without touching CI availability. The downside is operational overhead: each VM needs the runner binary installed, registered, and monitored. For most teams that means a quick Ansible playbook or a Packer image rather than clicking around.

The pattern I use for multi-machine deployments is a single config.toml template committed to a private repo, with the registration token pulled from an environment variable at deploy time. This avoids token sprawl across machine snapshots:

# config.toml.template — committed to repo, no secrets here
concurrent = 4
check_interval = 0

[[runners]]
  name = "${RUNNER_HOSTNAME}"
  url = "https://gitlab.example.com/"
  token = "${RUNNER_TOKEN}"   # injected at provision time, never hardcoded
  executor = "docker"
  [runners.docker]
    image = "alpine:3.19"
    privileged = false
    volumes = ["/cache"]
# provision.sh — runs on each new VM
export RUNNER_TOKEN=$(vault kv get -field=token secret/gitlab/runner)
export RUNNER_HOSTNAME=$(hostname)
envsubst < config.toml.template > /etc/gitlab-runner/config.toml
systemctl restart gitlab-runner

This keeps every machine config identical except the name and token, which makes debugging much less painful when you have six runners and something breaks.

The Kubernetes executor is a different category entirely. Instead of pre-provisioning VMs, it spins up a pod per job using your cluster’s existing node pool, runs the job, then terminates the pod. The jump makes sense when you’re already running Kubernetes and want runners to scale to zero when idle (saving real money on cloud VMs), or when you have wildly different job resource requirements that don’t map well to a fixed VM size. The setup involves deploying the GitLab Runner Helm chart and configuring a config.toml inside a Kubernetes secret — the runner itself runs as a deployment, but job execution happens in ephemeral pods. It’s more moving parts, so I’d resist it until you’re actually hitting VM management pain or need dynamic scaling beyond what concurrent handles.

Runner health monitoring is something most teams skip until something breaks silently. The GitLab Runner binary exposes a Prometheus metrics endpoint on port 9252 by default — no configuration needed, it’s on by default since Runner 13.x:

# verify it's alive
curl http://localhost:9252/metrics | grep gitlab_runner

# useful metrics to alert on:
# gitlab_runner_jobs{state="running"}  — current active jobs
# gitlab_runner_failed_jobs_total      — cumulative failures
# gitlab_runner_request_concurrency    — how often runner asks for jobs (stalls = sign of overload)

Pipe that into your existing Prometheus scrape config with a static_configs target, then build a Grafana dashboard with two panels: running jobs over time and failed jobs rate. If you’re not running Prometheus yet, even curl-ing that endpoint from a cron job and alerting on gitlab_runner_failed_jobs_total increasing unexpectedly is better than nothing. The runner also logs to systemd journal — journalctl -u gitlab-runner -f during a deploy will surface registration errors, Docker socket permission issues, and network timeouts that the GitLab UI never shows you.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Lead AI Engineer & SaaS Strategist

Eric is a seasoned software architect specializing in LLM orchestration and autonomous agent systems. With over 15 years in Silicon Valley, he now focuses on scaling AI-first applications.

Leave a Comment