Zero-Downtime Database Migrations on Kubernetes: How I Stopped Fearing Deploy Day

The Problem: Why Database Migrations Are Still the Scariest Part of a Deploy

The Classic Failure Mode Nobody Warns You About

Here’s the scenario that’s burned me — and almost every backend dev I know — at least once: you push a migration that renames a column or adds a NOT NULL constraint. Your CI pipeline goes green. You kick off the deploy. Kubernetes starts rolling your pods, the migration job runs… and then the old pods, which are still serving traffic during the rollout, start throwing 500s because they’re trying to write to a column that no longer exists. Half your fleet is on the new code expecting user_email, half is on the old code writing to email. Your database is the referee in a fight it didn’t sign up for. Everything is on fire and your on-call rotation is about to have a very bad night.

The real damage isn’t just downtime. It’s in-flight requests that were mid-transaction when the schema changed under them. Think a user who just hit “Submit Order” — their request started on an old pod, got routed mid-way, and now the write fails with a constraint violation. You don’t get a clean error. You get partial data, silent failures, or worse, a corrupted record that looks fine until someone runs a report three weeks later. I’ve seen that exact scenario cause a refund nightmare at a company I consulted for.

Kubernetes makes this worse in ways that aren’t obvious until you’ve already been burned. Rolling updates sound safe — they are, for stateless config changes. But they’re actively dangerous with schema migrations because you now have a window, sometimes several minutes long, where multiple versions of your app are simultaneously live against the same database. Init containers feel like the solution — run your migration before the app starts — but they fire on every pod, which means if you have 6 replicas, your migration can run 6 times unless you’ve written it to be truly idempotent. And “truly idempotent” is harder than it sounds when you’re dealing with index creation or column renames.

# What you think init containers do:
# 1. Run migration once
# 2. Start app
# 3. Done

# What they actually do with 6 replicas:
# Pod 1 init container: starts migration
# Pod 2 init container: also starts migration (race condition)
# Pod 3 init container: also starts migration
# Your DB: receiving 3 concurrent ALTER TABLE statements
# You: refreshing Datadog at 2am

There’s also the terminationGracePeriodSeconds trap. By default it’s 30 seconds. If your migration takes 45 seconds and something triggers a pod restart mid-run, Kubernetes will SIGKILL the process at the 30-second mark, leaving your schema in a half-applied state. I switched my migration jobs to use dedicated Kubernetes Job resources with their own resource limits and restart policies specifically because of this — the app deployment and the migration need to be decoupled at the Kubernetes level, not just logically in your head.

What we’re actually solving here is a sequencing and compatibility problem. The goal isn’t “run migrations faster” — it’s making your schema changes and your application code changes independent of each other so there’s no moment where the two are incompatible. That means writing migrations in phases (expand, then contract), managing deploy order explicitly with Kubernetes hooks or separate Jobs, and making sure every schema state your database can be in during a rollout is one your app can handle without crashing. For a broader look at tooling decisions in production environments, check out our guide on Essential SaaS Tools for Small Business in 2026. The pattern has a name — expand/contract migrations — and once you build your deploy pipeline around it, zero-downtime schema changes stop being a heroic effort and start being the default.

The Mental Model You Need Before Writing a Single YAML File

Most database disasters I’ve seen — and a few I’ve personally caused — come down to one thing: treating a schema migration like a code deploy. You write a migration, you push the code, they go out together, done. That works fine until you’re running more than one pod, have any meaningful traffic, or care about rollbacks. The expand/contract pattern (sometimes called parallel change) exists specifically to break that assumption. The core idea: your database and your application code are two separate things that change at different speeds, and your migration strategy has to respect that.

Here’s how the three steps actually work in practice:

  1. Expand — Add the new column, table, or index. Old code keeps running. New column is nullable or has a default so existing INSERT statements don’t break. Deploy this as its own release. Nothing else changes.
  2. Migrate data — Backfill existing rows in batches. Not UPDATE users SET new_col = old_col run as a single transaction against 4 million rows. That’s how you lock a table for 8 minutes at 11am on a Tuesday.
  3. Contract — Deploy the new application code that reads/writes the new column. Once that’s been stable for at least one full release cycle, drop the old column in a separate, later migration.

The backfill step is where I see people cut corners the most, so let me be specific. A safe batch backfill in PostgreSQL looks like this — you process rows in chunks using the primary key, sleep between batches, and keep each transaction small enough that your lock wait timeout doesn’t blow up:

DO $$
DECLARE
  batch_size INT := 1000;
  last_id    BIGINT := 0;
  max_id     BIGINT;
BEGIN
  SELECT MAX(id) INTO max_id FROM users;
  WHILE last_id < max_id LOOP
    UPDATE users
      SET display_name = first_name || ' ' || last_name
    WHERE id > last_id
      AND id <= last_id + batch_size
      AND display_name IS NULL;
    last_id := last_id + batch_size;
    PERFORM pg_sleep(0.05);
  END LOOP;
END $$;

That 50ms sleep is real. Without it, even batched updates will saturate I/O on a busy RDS instance. I found this out after a “quick backfill” spiked read latency for every other query running on the same host. The thing that caught me off guard the first time: the CPU and I/O hit from a backfill doesn’t show up on the migrating table alone — it degrades query performance across the whole database instance. Run it during low-traffic hours, and if your table is above a few million rows, consider running it as a Kubernetes Job so you can monitor it, restart it on failure, and limit its resource consumption via resources.limits.

Skipping the contract step — specifically, dropping the old column in the same deploy where you start using the new one — is the exact failure mode that generates 3am pages. Here’s why: Kubernetes rolling updates mean you will, for some window of time, have pods running the old code and pods running the new code simultaneously. If the old pods expect a column that you just dropped, they crash. If the new pods write to a column the old pods don’t know about yet, you get constraint violations or silent data loss depending on your schema. Neither is acceptable. The multi-release cadence isn’t bureaucracy — it’s the only way to guarantee that any version of your application code can talk to any version of your schema during a deploy.

One more sharp edge worth knowing before you write any YAML: this pattern assumes your migrations are idempotent. If your migration tool re-runs ALTER TABLE ADD COLUMN without IF NOT EXISTS, a pod restart mid-migration will explode. In Flyway, that means versioned migrations only — never edit a migration that’s already been applied. In Liquibase, the runOnChange flag is a trap unless you know exactly what you’re doing with it. With raw psql or golang-migrate, wrap every ALTER in a check. Migrations that can safely run twice are migrations that don’t page you at 3am.

Tooling Choice: Flyway vs Liquibase in a Kubernetes World

Flyway vs Liquibase: Pick One and Commit

The honest starting point: both tools do the job. The real question is how much config overhead your team will tolerate before someone disables the migration step in CI because it “kept breaking.” I’ve watched that happen with Liquibase on a team that didn’t need its complexity. They ended up running raw SQL by hand in production for three months. Pick the simpler tool unless you have a concrete reason not to.

Flyway: The One I Actually Reach For

Flyway’s mental model is dead simple — numbered SQL files in a directory, applied in order, never touched again. That’s it. Your migration directory looks like this:

db/migration/
  V1__create_users_table.sql
  V2__add_email_index.sql
  V3__backfill_user_roles.sql

And you run it from a Kubernetes init container like this:

initContainers:
  - name: flyway-migrate
    image: flyway/flyway:9.22-alpine
    args:
      - -url=jdbc:postgresql://$(DB_HOST):5432/$(DB_NAME)
      - -user=$(DB_USER)
      - -password=$(DB_PASSWORD)
      - -connectRetries=10
      - migrate
    envFrom:
      - secretRef:
          name: db-credentials

The -connectRetries=10 flag is one you’ll need in Kubernetes — without it, Flyway fires the migration before the database pod is ready and fails instantly. That’s the kind of thing that isn’t in the quick-start docs but bites you the first time you deploy on a cold cluster. Flyway Community 9.x is completely free and covers everything most teams need: versioned migrations, repeatable migrations (prefixed R__ for views and stored procedures), and baseline support for databases that already have data. Flyway Teams adds dry-run mode (outputs SQL without executing it) and undo migrations, and it’s priced per engine — check their site for current pricing, but expect it to run in the hundreds per year. Useful if you’re deploying to regulated environments where you need to preview exactly what hits production before it runs.

Liquibase: More Power, More Pain

Liquibase gives you XML or YAML changelogs, and its rollback support is genuinely better than Flyway’s out of the box. You can define an explicit rollback block per changeset, which matters when you’re managing schema changes across a heavily audited system. The tradeoff is that the config is verbose in a way that makes my eyes glaze over:

<databaseChangeLog
  xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
    http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.8.xsd">

  <changeSet id="1" author="dan">
    <addColumn tableName="users">
      <column name="last_login" type="TIMESTAMP"/>
    </addColumn>
    <rollback>
      <dropColumn tableName="users" columnName="last_login"/>
    </rollback>
  </changeSet>
</databaseChangeLog>

That’s a lot of markup to add one column. The thing that caught me off guard the first time was Liquibase’s error messages — they’re frequently cryptic stack traces that bury the actual problem four lines down. A checksum mismatch (which happens if you edit a changeset after it’s been applied) throws a wall of Java exception text when a two-sentence error would do. Budget time for debugging the tooling itself, not just your migrations.

When Liquibase Actually Makes Sense

If your organization runs Oracle in one product, MySQL in another, and Postgres everywhere else, Liquibase’s database-agnostic changelog format pays off. Writing one changelog that deploys correctly across all three engines without manual adaptation is something Flyway can’t match at that level. That’s the specific scenario where I’d choose it. Another legitimate case: if your ops team requires formal rollback scripts attached to every schema change for compliance reasons, Liquibase’s explicit rollback blocks give auditors something concrete to review. For everyone else — startups, product teams, any org running a single database engine — the added ceremony doesn’t buy you enough to justify it.

  • Choose Flyway if your team writes SQL fluently, you’re running one database engine, and you want migrations to just work without a dedicated config file ecosystem.
  • Choose Liquibase if you’re managing schema changes across multiple database vendors in the same organization, or if compliance requires explicit, attached rollback scripts per change.
  • Flyway Community 9.x is free and covers versioned migrations, repeatable migrations, and baseline — no license needed for production.
  • Flyway Teams adds dry-run and undo migrations — worth evaluating if you’re deploying to environments where “preview before apply” is a hard requirement.

Pattern 1: Running Migrations as a Kubernetes Job Before Deployment

The cleanest migration setup I’ve shipped uses a Kubernetes Job that runs before the new Deployment ever starts. Flyway runs, applies every pending migration, exits zero — then and only then does your app roll out. If Flyway exits non-zero, the whole release halts. Your old pods keep serving traffic. Nobody gets paged at 2am because a column doesn’t exist yet.

Here’s the actual Job spec. Drop this in your Helm chart under templates/migration-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "myapp.fullname" . }}-migrate
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation
spec:
  ttlSecondsAfterFinished: 120
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: flyway-migrate
          image: flyway/flyway:10-alpine
          args: ["migrate"]
          env:
            - name: FLYWAY_URL
              value: "jdbc:postgresql://postgres-svc:5432/myapp"
            - name: FLYWAY_USER
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: username
            - name: FLYWAY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
            - name: FLYWAY_LOCATIONS
              value: "filesystem:/flyway/sql"
          volumeMounts:
            - name: migrations
              mountPath: /flyway/sql
      volumes:
        - name: migrations
          configMap:
            name: {{ include "myapp.fullname" . }}-migrations

The helm.sh/hook: pre-upgrade,pre-install annotation is what enforces ordering. Helm sees that annotation and runs the Job to completion before touching your Deployment. The hook-weight: "-5" matters if you have multiple pre-upgrade hooks — lower numbers run first, so you can sequence a schema migration before, say, a cache-warming job. The thing that caught me off guard the first time: you need both pre-install and pre-upgrade unless you only care about upgrades. First deploy on a fresh namespace will skip a hook that’s only tagged pre-upgrade.

The gotcha that will quietly destroy your namespace if you ignore it: Jobs don’t clean themselves up. Every helm upgrade creates a new Job object. Without ttlSecondsAfterFinished: 120, those completed Jobs (and their pods) accumulate indefinitely. I’ve seen namespaces with 200+ dead migration pods because someone forgot this. Set ttlSecondsAfterFinished: 120 — two minutes is plenty of time to grab logs if something went wrong — and the Job plus its pod disappear automatically. The before-hook-creation delete policy handles the edge case where the previous Job is still hanging around when you run another upgrade.

Set backoffLimit: 0. This is non-negotiable. Flyway migrations are not safe to retry blindly — some operations are partially applied before a failure. You want the Job to fail hard and fast on the first error, not Kubernetes optimistically rerunning it three times and making your schema state ambiguous. When the Job exits non-zero, Helm marks the release as failed and rolls back to the previous revision automatically. Your old Deployment — with the schema it was built against — keeps running. You get time to look at kubectl logs job/myapp-migrate, fix the migration file, and redeploy cleanly.

One honest trade-off: this pattern requires your migrations to be backward compatible. The old app version will run against the new schema for however long the rollout takes. Adding a nullable column? Fine. Dropping a column the old code still reads? That’s a three-release process — add the column, stop using it in code, then drop it. The Job-before-Deployment pattern enforces correctness, but it doesn’t remove the discipline required to write safe migrations. If your team isn’t writing backward-compatible SQL, you’ll still have incidents — they’ll just be different ones.

Pattern 2: Init Containers — When You Want Migrations Tightly Coupled to the Pod

The init container runs first, then your app — that’s the whole mechanic, and it’s deceptively useful

An init container completes before Kubernetes starts your main application container. Full stop. If the init container exits with a non-zero code, the pod restarts. If it exits cleanly, your app boots. For database migrations, this sounds ideal: migrations run, then your app starts with a guaranteed schema. I used this pattern on three projects before I understood exactly where it falls apart at scale.

The problem surfaces the moment you run more than one replica. Say you have 3 replicas of your API deployment. Kubernetes spins up all 3 pods roughly simultaneously. All 3 init containers pull the Flyway image and immediately run flyway migrate against the same database. Flyway uses an advisory lock table (flyway_schema_history) to prevent concurrent migrations, so you won’t corrupt your schema — but you’ll still get two of those init containers sitting there blocked, waiting for the lock, logging confusing output, and delaying your pod startup. In a Deployment rollout with maxUnavailable: 1 and maxSurge: 1, that blockage can cascade into what looks like a deployment hang. The thing that caught me off guard was how silent it was — the pod just sat in Init:0/1 and Kubernetes didn’t surface any obvious reason why.

Here’s what the spec actually looks like. This is a real Flyway init container config I’ve shipped:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      volumes:
        - name: migration-scripts
          configMap:
            name: flyway-migrations
      initContainers:
        - name: flyway-migrate
          image: flyway/flyway:10.10-alpine
          command:
            - flyway
            - -url=jdbc:postgresql://$(DB_HOST):5432/$(DB_NAME)
            - -user=$(DB_USER)
            - -password=$(DB_PASSWORD)
            - -locations=filesystem:/flyway/sql
            - migrate
          env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: host
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: username
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
            - name: DB_NAME
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: dbname
          volumeMounts:
            - name: migration-scripts
              mountPath: /flyway/sql
      containers:
        - name: api
          image: your-org/api-server:latest
          # ... rest of your container spec

Notice the migration scripts come in via a ConfigMap mounted as a volume. You could also bake them into a custom image derived from flyway/flyway, which is cleaner for CI/CD — but the ConfigMap approach means you can update migrations without rebuilding every image, which is useful during development. I’d go custom image in production for reproducibility, ConfigMap mount in local dev for iteration speed.

When this pattern actually makes sense

  • Single-replica services: One pod means one init container. No lock contention, no ambiguity. This is the sweet spot for the pattern.
  • Local dev with kind or minikube: You almost always run 1 replica locally, and the tight coupling means you never have to remember to run migrations separately before testing.
  • When you control the migration tool’s locking: Flyway’s lock behavior is well-documented. Liquibase has the same. If you’re rolling your own SQL migration runner that doesn’t lock, skip init containers entirely — you will get double-applied migrations eventually.
  • Stateful services that already enforce single-instance: If your deployment is intentionally replicas: 1 for other reasons (say, it writes to a local PVC), init containers are a natural fit.

I’d actively avoid this pattern for any service that autoscales. When HPA fires up new replicas during a traffic spike, those new pods all run init containers — which means they all try to acquire the Flyway lock while your app is under load. Most of the time nothing breaks, but your new pods take longer to become ready exactly when you need them fastest. That’s a trade-off I stopped accepting once I had a cleaner alternative available.

Pattern 3: Argo CD + Sync Waves for Multi-Step Migration Pipelines

If your team is already running Argo CD, you’re sitting on a sequencing primitive that most people ignore: sync waves. The idea is simple — annotate resources with a wave number, and Argo CD processes them in ascending order, waiting for each wave to reach a healthy state before touching the next. For database migrations, this maps perfectly: your migration Job goes in wave 0, your Deployment goes in wave 1. The application won’t roll out until the schema changes are done.

Here’s what the actual manifests look like. On your migration Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  annotations:
    argocd.argoproj.io/sync-wave: "0"
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: myapp:v2.3.1
          command: ["python", "manage.py", "migrate"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: url

And on your Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  replicas: 3
  # ... rest of your spec

When you push both to git and Argo CD syncs, it applies the Job first, watches it, and only moves to the Deployment once that Job hits Succeeded. Not RunningSucceeded. This distinction is the thing that caught me off guard the first time I set this up.

The backoffLimit setting is where people get burned badly. The default is 6. If your migration crashes on startup — bad env var, wrong image tag, whatever — Kubernetes retries the Job six times before marking it as Failed. Argo CD will sit there waiting through all six attempts, which depending on your activeDeadlineSeconds could be ten or fifteen minutes of total blockage. Set backoffLimit: 2 at most. I usually go with 1 for migration jobs specifically, because a migration that fails twice probably has a real problem that retrying won’t fix, and I’d rather the sync fail fast so the engineer on call sees the alert immediately instead of after Kubernetes has spent 20 minutes churning through retries.

You can also layer in a wave -1 for pre-flight checks — a Job that validates the database connection or checks the current schema version before the migration even starts. That looks like this:

metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"

Argo CD handles negative wave numbers fine. The full sequence becomes: wave -1 (preflight) → wave 0 (migration) → wave 1 (application) → wave 2 (smoke test Job if you want one). Each step has to go healthy before the next one touches your cluster. Compare this to init containers or Helm hooks: sync waves operate at the application level across multiple resources, which makes them much easier to visualize in the Argo CD UI — you can literally watch each wave go green in sequence. The honest trade-off is that this only works if you’re fully committed to Argo CD as your delivery mechanism. If you’re half-using Argo CD and half-running kubectl apply manually for some resources, the wave ordering won’t be reliable and you’ll spend a confusing afternoon figuring out why things applied out of order.

  • Use argocd app wait --health in your CI pipeline if you want to block a PR merge until the sync completes — don’t just trigger a sync and assume it worked
  • Set activeDeadlineSeconds on your migration Job or Argo CD will wait indefinitely if the pod gets stuck in Pending (e.g., node pressure, missing PVC)
  • If you use syncPolicy: automated, be deliberate — an automated sync on a broken migration will keep retrying on every git push until you manually intervene
  • The Argo CD UI shows wave numbers in the resource graph; check the “Sync Status” tab, not just the app-level health badge, to see which wave actually blocked

Writing Migrations That Can’t Break a Running App

The Additive-Only Rule Is the Only Rule That Matters

Every production incident I’ve seen caused by a database migration comes down to the same mistake: the migration and the new application code deploy together, and the migration removes or renames something the old application code still depends on. In Kubernetes, during a rolling deploy, you will have pods running the old app version and pods running the new one at the same time. If your migration drops a column that the old pods are still querying, those pods start throwing 500s immediately. The fix is simple but non-negotiable: the migration that ships with a release must only add things. Never rename. Never drop. Never change a column type. Never add a NOT NULL constraint without a default. Removals and renames happen in a later deploy, after every pod has been updated and the old code is completely gone.

Adding NOT NULL Columns Without Blowing Up

The classic trap is adding a NOT NULL column in a single migration. PostgreSQL has to rewrite the entire table if there’s no default, and even with a default it will lock the table in older versions. The right sequence is three separate migrations across three separate deploys:

  1. Deploy 1: Add the column as nullable, with a DEFAULT.
  2. Deploy 2: Backfill existing rows, then add the NOT NULL constraint.
  3. Optionally drop the DEFAULT in Deploy 3 if you don’t want it persisted.
-- Migration 1: safe to run while app is live
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ DEFAULT NULL;

-- Migration 2: after all pods are running code that writes shipped_at
UPDATE orders SET shipped_at = created_at WHERE shipped_at IS NULL;
ALTER TABLE orders ALTER COLUMN shipped_at SET NOT NULL;

The backfill in Migration 2 can be painful on a large table. I batch it instead of running one giant UPDATE — something like WHERE id BETWEEN x AND y in a loop from a one-off job — because a single UPDATE locking millions of rows for 30 seconds is its own kind of outage.

Index Creation: Flyway Won’t Save You Here

This caught me off guard the first time. Flyway executes your SQL verbatim, which means if you write CREATE INDEX, you get a full table lock for the duration. On a table with 50 million rows, that’s minutes. The fix is CREATE INDEX CONCURRENTLY in PostgreSQL, which builds the index without holding a lock — but it can’t run inside a transaction block. Flyway wraps migrations in transactions by default.

You need to disable that per-migration:

-- flyway:disableChecksum
-- @flyway:executeInTransaction false
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

Add disableChecksum because Flyway gets confused about non-transactional migrations and their checksums depending on version. Test this in staging first — I’ve seen the annotation syntax differ between Flyway 8 and 9. Check your exact version’s docs rather than copy-pasting blindly.

Long-Running Migrations and Connection Timeouts

If your database has a statement_timeout set (and it should), a long migration will get killed partway through and leave your schema in a half-applied state. The solution is a dedicated migration database user that has a higher or no statement timeout, separate from the application user:

-- Create a migration user with no statement timeout
CREATE ROLE flyway_runner LOGIN PASSWORD 'secure_password';
ALTER ROLE flyway_runner SET statement_timeout = 0;
GRANT ALL ON DATABASE myapp TO flyway_runner;

Then in your Flyway config or Kubernetes Job spec:

flyway.url=jdbc:postgresql://db-host:5432/myapp
flyway.user=flyway_runner
flyway.password=${FLYWAY_PASSWORD}
flyway.connectRetries=10
flyway.connectRetriesInterval=5

connectRetries matters because during a Kubernetes deploy the database pod might be briefly unavailable or the service DNS hasn’t resolved yet. Without retries, the migration job fails on connect and your init container or pre-deploy hook exits with an error, blocking the rollout. Ten retries at five-second intervals gives you 50 seconds of breathing room, which covers most transient connectivity issues I’ve encountered.

Renaming a Column the Safe Way: Three Deploys, No Downtime

This is the real-world example that’s worth walking through completely. Say you have a users table with a column called full_name and you want to rename it to display_name. The wrong way: one migration with ALTER TABLE users RENAME COLUMN full_name TO display_name deployed alongside code that references display_name. Old pods immediately break. Here’s how to do it without anyone noticing:

Deploy 1 — Add the new column, dual-write in app code:

ALTER TABLE users ADD COLUMN display_name TEXT;
UPDATE users SET display_name = full_name;

Your application code in this deploy writes to both full_name and display_name, reads from full_name. Old pods still work fine.

Deploy 2 — Switch reads to the new column:

No migration needed. Just update the app to read from display_name and write to both. Once all pods are on this version, full_name is no longer being read anywhere.

Deploy 3 — Drop the old column, stop dual-writing:

ALTER TABLE users DROP COLUMN full_name;

App now only writes to display_name. Done.

Yes, this takes three separate deploys spanning potentially days. That’s the price. The alternative is a 3am incident where you’re manually running ALTER TABLE to put the old column back while half your users get errors. I’ve paid that price once — three deploys is cheap by comparison.

Handling Rollbacks Without Losing Your Mind

Here’s the hard truth nobody wants to say out loud: you cannot roll back a dropped column. You can roll back your application code in about 30 seconds with a Kubernetes deployment rollout, but if your migration deleted data, that data is gone. I’ve seen teams trigger an “emergency rollback” after a bad deploy, watch the app come back up on the old code, and then spend three days figuring out why everything was broken — because the old code was querying a column that no longer existed. Rollback in your head means “go back to how things were.” Rollback in reality means “go back to old code, but keep the new schema.”

This is exactly why the expand/contract pattern isn’t optional — it’s survival. Your old app version must be able to run against the new schema. Full stop. During the “expand” phase, you add the new column, the new table, whatever you need, while keeping everything the old code depends on intact. Your old pods and new pods run simultaneously during a rolling deploy, and both of them need to work. The “contract” phase — where you actually remove the old stuff — only happens in a separate migration, deployed after you’ve confirmed the new code is stable and the old code is completely gone from your cluster. If you try to compress both phases into one migration, you will have an incident.

Flyway’s undo migrations exist if you’re on the Teams tier (currently $199/year per developer seat as of their pricing page). I’ve used them. They’re genuinely useful for development environments and for carefully controlled, reversible changes like adding a nullable column. But I want to be direct: they are not a magic undo button for production. An undo script is SQL you write yourself that reverses the change — Flyway doesn’t generate it. If your V3 migration adds a column and your U3 undo drops it, you just wrote the undo yourself. For anything involving data transformation, your undo script can corrupt data if the new code already wrote rows in the new format. Treat undo migrations as a useful escape hatch for simple structural changes, not as a safety net for complex ones.

The practical thing I started doing that actually saved me twice: stamp your database with the app version before every migration runs. I keep a simple metadata table:

CREATE TABLE IF NOT EXISTS schema_deployment_log (
  id SERIAL PRIMARY KEY,
  app_version VARCHAR(50) NOT NULL,
  migration_version VARCHAR(20) NOT NULL,
  deployed_at TIMESTAMPTZ DEFAULT NOW(),
  deployed_by VARCHAR(100)
);

Then in my init container that runs migrations, before Flyway executes anything:

kubectl run db-stamp --image=postgres:15 --restart=Never \
  --env="PGPASSWORD=$DB_PASSWORD" -- \
  psql -h $DB_HOST -U $DB_USER -d $DB_NAME \
  -c "INSERT INTO schema_deployment_log (app_version, migration_version, deployed_by) VALUES ('$APP_VERSION', '$FLYWAY_TARGET', '$DEPLOYER');"

When something breaks at 2am and you’re staring at a database that looks wrong, this table tells you exactly what migrations ran with which deploy, who triggered it, and when. Without it, you’re diffing Flyway’s history table against your git log and trying to correlate timestamps across systems. With it, you can immediately see “okay, V12 ran at 03:14 with app version 2.7.1, and that’s when alerts started firing.” That correlation is worth more than almost any other debugging tool I’ve reached for during a production incident.

  • Reversible changes (safe to undo): adding a nullable column, creating a new index, adding a new table — undo scripts work fine here
  • Dangerous changes (no real rollback): dropping a column, renaming a column, changing a column type, deleting rows — your only real option is a restore from backup or accepting the new state
  • The middle ground: data backfills — always do these in a separate migration from the schema change, and always keep the old column until you’re 100% sure the backfill is correct

Secrets Management: Don’t Put Your DB Password in the Job Spec

Stop Hardcoding Credentials — Here’s What to Do Instead

I’ve seen Job specs in production with the database password sitting right there in the YAML, committed to Git, indexed by GitHub’s search. The dev who did it wasn’t careless — they were just moving fast and nobody reviewed it. The fix is a ten-minute change that permanently eliminates that category of mistake. Use envFrom with a Kubernetes Secret, and the credential never touches your Job manifest at all.

Here’s what a proper Secret manifest looks like for a Flyway migration job:

apiVersion: v1
kind: Secret
metadata:
  name: flyway-db-credentials
  namespace: migrations
type: Opaque
stringData:
  FLYWAY_URL: "jdbc:postgresql://db.internal:5432/myapp"
  FLYWAY_USER: "migration_user"
  FLYWAY_PASSWORD: "your-password-here"

Apply it with kubectl apply -f secret.yaml — and then immediately remove that file from your repo or add it to .gitignore. Better, generate the secret with a script that pulls from your actual secret store rather than a static file. Now reference it in the Flyway Job like this:

spec:
  template:
    spec:
      containers:
        - name: flyway
          image: flyway/flyway:10-alpine
          envFrom:
            - secretRef:
                name: flyway-db-credentials
          args: ["migrate"]
      restartPolicy: Never

Flyway reads FLYWAY_URL, FLYWAY_USER, and FLYWAY_PASSWORD directly from the environment. No flags, no config file mounting, no credential visible in kubectl describe job. That’s already miles better. But the thing that caught me off guard early on: Kubernetes Secrets are base64-encoded by default, not encrypted at rest — unless you’ve explicitly configured envelope encryption with a KMS provider. Check your cluster config before assuming “it’s in a Secret so it’s safe.”

Go Further: External Secrets Operator + AWS Secrets Manager or Vault

The real unlock is External Secrets Operator. I switched to it because credential rotation was becoming a manual chore — someone rotates the DB password in AWS Secrets Manager, then someone else has to remember to update the Kubernetes Secret, and there’s always a gap. With ESO, you define an ExternalSecret resource that syncs from Secrets Manager on a schedule you control (every 1h, every 5m, whatever). When the password rotates, Kubernetes picks it up automatically without anyone touching a manifest. The operator itself is straightforward to install via Helm and the AWS provider setup takes maybe 30 minutes the first time, mostly spent on IAM policy configuration.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: flyway-credentials
  namespace: migrations
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: ClusterSecretStore
  target:
    name: flyway-db-credentials
  data:
    - secretKey: FLYWAY_PASSWORD
      remoteRef:
        key: prod/myapp/db
        property: password
    - secretKey: FLYWAY_USER
      remoteRef:
        key: prod/myapp/db
        property: username

The trade-off: ESO adds a CRD dependency and you need to manage the ClusterSecretStore configuration with proper IRSA (IAM Roles for Service Accounts) permissions. The docs have improved a lot but the first time you hit an auth error from the operator, the error messages are not always obvious about whether the problem is the IAM policy, the secret path, or the property name. Budget an afternoon for first-time setup.

Least-Privilege Matters More Than You Think Here

Your migration user should not be your application user, and it definitely should not be a superuser. I create a dedicated migration_user in PostgreSQL with exactly these permissions and nothing else:

CREATE USER migration_user WITH PASSWORD 'strong-password';
GRANT CONNECT ON DATABASE myapp TO migration_user;
GRANT USAGE ON SCHEMA public TO migration_user;
GRANT CREATE ON SCHEMA public TO migration_user;
-- For ALTER TABLE on existing tables owned by another user:
ALTER TABLE existing_table OWNER TO migration_user;

The specific permissions you need are CREATE TABLE, ALTER TABLE, CREATE INDEX, and DROP if your migrations clean up old structures. That’s it. If your migration job gets compromised or someone accidentally runs a migration against the wrong cluster, a least-privilege user limits the blast radius significantly. The gotcha with PostgreSQL specifically: ALTER TABLE requires ownership of the table, not just a privilege grant, so you either need to own all objects as the migration user or use a superuser to transfer ownership during initial setup. Plan for this before you’re staring at a permission denied error at 2am.

  • Use envFrom: secretRef — never put credentials directly in the Job spec or as plain env values you’ll accidentally commit
  • Enable KMS envelope encryption for Secrets at rest — EKS supports this natively through the cluster config, GKE does it by default
  • External Secrets Operator is the right call if you’re already using AWS Secrets Manager or Vault — the rotation story alone justifies the setup cost
  • Separate migration user from app user — if you’re using the same DB credentials for migrations and the running app, you’re giving your migration process way more access than it needs
  • Audit your existing Jobs right now — run kubectl get jobs -A -o yaml | grep -i password and see what comes back

Testing Your Migration Strategy Before It Matters

The Staging Environment Lie (And How to Fix It)

Most staging environments are a lie. You’ve got 10,000 rows in your orders table, your migration runs in 1.8 seconds, you ship it, and then it locks your production table with 47 million rows for 22 minutes on a Tuesday afternoon. I’ve watched this happen. The fix isn’t complicated — it’s just uncomfortable because restoring a production-sized dataset into staging takes time and disk space nobody wants to provision. Do it anyway. A pg_dump with --schema-only plus a sample of real data gets you 80% of the way there. For anything touching indexed columns, you need the full row count.

The specific gotcha I kept hitting: ALTER TABLE ... ADD COLUMN with a non-null default rewrites the entire table in Postgres versions below 11. On a 50M row table, that’s a full table rewrite under an ACCESS EXCLUSIVE lock. Modern Postgres handles this better with DEFAULT expressions, but your staging test will catch it immediately because the migration job will just… sit there. You want to catch that before your on-call gets paged at 2am. Here’s how I time migrations in staging:

psql $STAGING_DB_URL -c "\timing" -c "
  ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
"
-- Time: 847.221 ms (safe)

-- vs.

ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ NOT NULL DEFAULT now();
-- Time: 312847.003 ms (DO NOT SHIP THIS)

Flyway Validate in CI — Not Optional

The thing that caught me off guard was how easy it is to accidentally edit a committed migration file. Someone fixes a typo in V003__add_user_index.sql, the checksum changes, and Flyway refuses to run on any environment that already applied it. By the time you catch it, it’s blocking a deploy. Running flyway validate in CI costs you maybe 10 seconds and catches this immediately. Here’s what it looks like in a GitHub Actions step:

- name: Validate Flyway Migrations
  run: |
    flyway \
      -url="jdbc:postgresql://$STAGING_HOST:5432/$STAGING_DB" \
      -user=$STAGING_USER \
      -password=$STAGING_PASSWORD \
      -locations=filesystem:./db/migrations \
      validate
  env:
    STAGING_HOST: ${{ secrets.STAGING_HOST }}
    STAGING_USER: ${{ secrets.STAGING_USER }}
    STAGING_PASSWORD: ${{ secrets.STAGING_PASSWORD }}
    STAGING_DB: ${{ secrets.STAGING_DB }}

This runs against your actual staging database — not a fresh one — so it catches checksum mismatches on already-applied migrations. If you run it against a fresh DB every time, you miss the entire class of “someone edited a shipped migration” bugs. That distinction matters.

Chaos Test the Job Failure Path Before Production Does It For You

Here’s a test most teams skip entirely: manually kill your migration Job mid-run and verify your application keeps serving traffic. Your app should be running against the old schema at that point, with the new schema half-applied. If you haven’t tested this, you don’t actually know what happens — you’re just hoping. To simulate it:

# Start the migration job
kubectl apply -f k8s/migration-job.yaml

# Wait for it to be running
kubectl wait --for=condition=ready pod -l job-name=db-migration --timeout=30s

# Kill it hard
kubectl delete pod -l job-name=db-migration --grace-period=0 --force

# Now verify your app is still up
kubectl rollout status deployment/api
curl -f https://staging.yourapp.com/health

If your app crashes after the Job dies, your deployment strategy is broken regardless of how well the migration itself is written. This test also verifies that your restartPolicy: OnFailure actually kicks in and retries the Job, which is the behavior you want — Kubernetes should retry the migration, not just give up.

Assert Post-Migration State, Don’t Just Hope

After a migration runs, I want automated proof that the schema is what I think it is. pgTAP is the proper tool for this — it’s a full testing framework for PostgreSQL, and you write tests like this:

SELECT plan(3);

SELECT has_table('public', 'user_preferences', 'user_preferences table exists');
SELECT has_column('public', 'user_preferences', 'theme', 'theme column exists');
SELECT col_type_is('public', 'user_preferences', 'theme', 'text', 'theme is text type');

SELECT * FROM finish();

pgTAP is great but requires installation and setup time. If you want something you can wire up in 20 minutes, a psql schema diff script works fine for the common case:

#!/bin/bash
# post-migration-check.sh
EXPECTED_COLS="id,user_id,theme,created_at,updated_at"
ACTUAL_COLS=$(psql $DATABASE_URL -t -c "
  SELECT string_agg(column_name, ',' ORDER BY ordinal_position)
  FROM information_schema.columns
  WHERE table_name = 'user_preferences';
" | tr -d ' ')

if [ "$ACTUAL_COLS" != "$EXPECTED_COLS" ]; then
  echo "SCHEMA MISMATCH: expected $EXPECTED_COLS, got $ACTUAL_COLS"
  exit 1
fi
echo "Schema OK"

Run this as a step in your migration Job after Flyway finishes. If the schema doesn’t match expectations, the Job fails, your deployment pipeline stops, and nothing rolls forward. That’s the behavior you want — silent schema drift is how you end up with application code querying columns that don’t exist.

Quick Reference: What to Use When

The single rule I keep coming back to: your migration strategy should match your deployment infrastructure, not the other way around. I’ve seen teams bolt on Liquibase to a tiny three-person project because “it’s enterprise grade” and spend two weeks writing XML changelogs for a schema that Flyway would have handled in an afternoon. Pick the tool that fits the shape of your system right now, not the one you think you’ll need in two years.

Small team, Helm-based deploys → Flyway as a pre-upgrade hook Job

If you’re running Helm and you’ve got a handful of services, a Flyway Job wired up as a pre-upgrade hook is the lowest-friction path I’ve found. The migration runs, completes, and only then does Helm proceed with the actual rollout. Your templates/migration-job.yaml looks like this:

apiVersion: batch/v1
kind: Job
metadata:
  name: "{{ .Release.Name }}-flyway-migrate"
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: flyway
          image: flyway/flyway:10-alpine
          args:
            - -url=jdbc:postgresql://$(DB_HOST):5432/$(DB_NAME)
            - -user=$(DB_USER)
            - -password=$(DB_PASSWORD)
            - migrate
          envFrom:
            - secretRef:
                name: {{ .Release.Name }}-db-credentials

The hook-delete-policy: hook-succeeded annotation is the thing that catches people off guard — without it, completed Jobs pile up in your namespace and you’ll hit resource quota limits faster than you expect. The trade-off with this approach is that Helm will timeout if the migration takes too long (default is 300 seconds), so bump --timeout on your helm upgrade command if you’re running large backfills. I use helm upgrade --timeout 10m0s for anything non-trivial.

GitOps with Argo CD → sync waves, no exceptions

If Argo CD is managing your deployments, forget hooks — sync waves are the mechanism. The reason is that Argo CD doesn’t understand Helm hook annotations at the resource level the same way Helm does when you’re using its Application CRD. What actually works reliably is annotating your migration Job with argocd.argoproj.io/sync-wave: "-1" and your Deployment with argocd.argoproj.io/sync-wave: "0". Argo CD will reconcile resources in wave order, waiting for each wave to become healthy before proceeding.

# Migration Job
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded

# Your main Deployment
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"

The combination of PreSync hook and a negative sync wave is intentional — the PreSync hook ensures the Job runs before sync begins, and the wave ordering is your safety net if you ever have multiple migration jobs that need sequencing. I’ve been burned by teams that relied on wave ordering alone without the hook annotation, then hit a race condition during a sync retry after a partial failure. Use both.

Microservices with dozens of databases → one Flyway image per service, one Job per Helm chart

Shared migration containers across services sound like a good idea until service A’s new Flyway version breaks service B’s classpath expectations, or you need to pin a specific driver version for one database without affecting the others. I switched to baking a Flyway container image per service — it’s a few extra lines in each service’s Dockerfile but eliminates an entire class of dependency conflicts:

FROM flyway/flyway:10-alpine
COPY migrations/ /flyway/sql/
COPY flyway.conf /flyway/conf/flyway.conf

Each Helm chart owns its migration Job, its secrets, and its migration image. The operational overhead is real — you’re maintaining more images — but you get independent migration versioning, independent rollbacks, and no blast radius between services. Build these images in your same CI pipeline that builds the service image and tag them with the same commit SHA. That way you always know exactly which migration set corresponds to which service version.

Multi-vendor databases (Oracle + Postgres + MySQL) → Liquibase, and accept the verbosity

Flyway’s multi-vendor support technically exists but the docs get thin fast once you’re outside Postgres and MySQL. If you’re running Oracle in the mix — and I genuinely sympathize if you are — Liquibase is the better choice. Yes, the XML changelogs are verbose. A simple column rename that’s two lines in a Flyway SQL file turns into this in Liquibase:

<changeSet id="20240315-01" author="your-name">
  <renameColumn tableName="users"
    oldColumnName="fullname"
    newColumnName="full_name"
    columnDataType="VARCHAR(255)"/>
  <rollback>
    <renameColumn tableName="users"
      oldColumnName="full_name"
      newColumnName="fullname"
      columnDataType="VARCHAR(255)"/>
  </rollback>
</changeSet>

The real payoff is that the rollback block is built-in and the abstraction actually works across vendors. I’ve watched Liquibase run the same changeset against a Postgres dev database and an Oracle staging database without modification. That’s the only scenario where I’d recommend eating the config verbosity. If you’re running a single database vendor, stay with Flyway — the YAML format in Liquibase isn’t much cleaner than the XML, and you’re adding complexity for zero gain.

The init container trap: never use them for migrations on multi-replica services

This one gets teams into production incidents more than anything else in this space. Init containers run on every pod startup. If you have three replicas and you deploy, three migration init containers will attempt to run your migrations simultaneously. Flyway and Liquibase both use advisory locks to handle this, so usually one wins and the others wait — but “usually” is not a word you want associated with your production database migrations. The thing that actually breaks is when an init container on pod two is waiting on the lock, Kubernetes marks the pod as not ready, the readiness probe fails, and depending on your rollout strategy you can end up with all old pods terminated and all new pods stuck in init. I’ve seen this take down a service completely during a deployment that was supposed to be zero-downtime.

Use a Job. A single Job with completions: 1 and parallelism: 1 runs exactly once, in exactly one container, and your Deployment doesn’t start rolling out until the Job completes successfully. That’s the contract you want. Init containers feel convenient because they’re colocated with the app spec, but that convenience will cost you the first time you scale past one replica under load.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Lead AI Engineer & SaaS Strategist

Eric is a seasoned software architect specializing in LLM orchestration and autonomous agent systems. With over 15 years in Silicon Valley, he now focuses on scaling AI-first applications.

Leave a Comment