Crossplane + CI/CD: How I Stopped Fighting Kubernetes Config Drift and Actually Shipped Faster

The Problem: Your CI/CD Pipeline Is Doing Too Much Kubernetes Babysitting

The thing that finally broke me was watching our deploy pipeline sit at “Applying Terraform plan…” for 28 minutes straight. The actual application deploy β€” the thing the pipeline exists to do β€” took four minutes. We were spending 87% of our CI time provisioning infrastructure that, frankly, should have already been there waiting for us.

Here’s the mess most teams end up with after 18 months of organic growth: you’ve got a deploy-staging.yml workflow that runs kubectl apply -f k8s/staging/, a separate deploy-prod.yml that does something slightly different because someone patched it in a hurry last quarter, a provision-infra.yml that shells out to Terraform, and a setup-database.yml that someone wrote when the DB migration kept timing out. Four workflows, four different people touched them, zero shared state. The staging cluster has a 512MB memory limit on the API container. Prod has 1GB. Nobody remembers why. That’s environment drift, and it compounds quietly until one day your staging deploy succeeds and prod silently OOMs on startup.

The deeper problem with provisioning RDS instances, S3 buckets, and IAM roles inside pipeline steps is that your pipeline now owns infrastructure state. Not Terraform Cloud, not your platform team β€” your GitHub Actions runner, which is ephemeral, does. A failed run mid-provision leaves you with a half-created RDS instance and a state file that’s either locked or stale. I’ve personally spent three hours on a Friday night running terraform force-unlock 8a3b2c1d and praying, while a state file held hostage by a dead runner blocked every subsequent deploy. The blast radius of a bad Terraform run in CI isn’t just “the pipeline fails” β€” it can mean no one can deploy anything until you manually untangle the state.

# What a "simple" infra provisioning step actually looks like in CI
- name: Provision RDS
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    TF_VAR_db_password: ${{ secrets.DB_PASSWORD }}
    TF_VAR_environment: staging
  run: |
    cd infra/rds
    terraform init -backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}"
    terraform plan -out=tfplan
    terraform apply tfplan
# Secrets count: 4, and growing every sprint

That secrets sprawl is a real operational tax. Every new piece of infra your pipeline touches adds more env vars to manage, rotate, and audit. You end up with 15 secrets in your repo settings, half of which are AWS credentials with way more permissions than any single pipeline step should need. Terraform in CI also has this specific failure mode where plan looks fine but apply hits a race condition β€” two branches triggered simultaneously, both passed plan, both tried to create the same security group. One wins, one corrupts your state. I’ve seen teams add terraform apply -lock-timeout=10m as a band-aid and call it done. It isn’t done.

The realization worth sitting with is that CI/CD pipelines are deployment tools, not infrastructure management systems. When your pipeline is doing both, neither job gets done well. Your developers are blocked on slow infrastructure provisioning before their code even runs, and your infrastructure is getting reconciled inconsistently depending on which branch triggered which workflow. For a broader look at tools that can speed up your dev workflow overall, check out the Best AI Coding Tools in 2026 (thorough Guide). The real fix is separating these concerns at the architectural level β€” which is exactly where Crossplane’s approach changes the calculus.

What Crossplane Actually Does (One Paragraph, Then We Move On)

Crossplane flips how you think about infrastructure provisioning: instead of a pipeline step that runs terraform apply and hopes the state file is intact, your AWS RDS instance or GCP Cloud SQL database becomes a Kubernetes object sitting in etcd, continuously reconciled by a controller. That reconciliation loop is the key shift β€” if someone manually changes a security group in the AWS console, Crossplane drifts it back. Not on the next pipeline run. Continuously. The control plane is the state, which means no terraform.tfstate in an S3 bucket, no state locking issues, no “who ran apply last and did they push the state?” conversations in Slack.

The practical implication for CI/CD is that your pipeline stops being the source of truth for infrastructure and starts being the thing that submits declarations. You push a YAML file, kubectl apply it (or Argo CD does), and Crossplane takes it from there. The pipeline’s job is done. You’re not waiting for terraform apply to finish in the middle of your pipeline β€” you can separate the concern entirely. Here’s what a Crossplane-managed RDS instance actually looks like in a repo:

apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: prod-postgres
  namespace: infra
spec:
  forProvider:
    region: us-east-1
    dbInstanceClass: db.t3.medium
    engine: postgres
    engineVersion: "16.1"
    allocatedStorage: 100
    skipFinalSnapshotBeforeDeletion: false
  providerConfigRef:
    name: aws-provider-config
  writeConnectionSecretToRef:
    namespace: app
    name: prod-postgres-conn

The version that makes this actually workable in a CI/CD context is Crossplane v1.14+. Before that, Composition revisions were alpha β€” and Compositions are how you build reusable infrastructure abstractions (think: “a standard Postgres setup” that platform teams define and app teams consume). With stable Composition revisions in v1.14+, you can version your infrastructure abstractions the same way you version application APIs: roll out a new revision, keep the old one for existing claims, migrate gradually. That’s the feature that made me stop treating Crossplane as a prototype and start using it in production pipelines.

The honest distinction from Terraform isn’t that one is better β€” it’s that they optimize for different things. Terraform is better for one-shot provisioning scripts and organizations that live in HCL. Crossplane wins when your team is already Kubernetes-native and you want infrastructure to behave like a deployment: observable via kubectl get, patchable with kubectl patch, and garbage-collected when the namespace is deleted. The removal of a separate CLI and state backend is a genuine operational simplification, not a marketing bullet point β€” one less auth surface, one less thing to rotate credentials for, one less system to back up.

Setup: Getting Crossplane Running in Your Cluster

The thing that trips most people up first is the CRD validation problem. Kubernetes 1.26 and older have quirks around how they handle the structural schema validation that Crossplane’s CRDs rely on. I burned two hours on this before realizing the cluster version was the issue β€” upgrade to 1.27+ and half the weird errors disappear. Check your version with kubectl version --short before anything else.

Your prereqs checklist before running a single command:

  • kubectl configured and pointing at the right context β€” kubectl config current-context to verify
  • Helm 3.x (not 2.x β€” the chart structure is entirely different) β€” helm version --short
  • Kubernetes 1.27+ β€” anything older and CRD admission webhooks will reject certain Crossplane schemas silently
  • cluster-admin permissions β€” Crossplane installs CRDs and admission webhooks, it needs broad access during setup

The actual install is straightforward once prereqs are satisfied:

# Add the stable chart repo β€” not the master branch, that's for Crossplane contributors
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update

# Pin to 1.14.0 explicitly β€” "latest" in Helm is a footgun in CI/CD pipelines
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system \
  --create-namespace \
  --version 1.14.0 \
  --set metrics.enabled=true  # enable Prometheus metrics while you're at it

# Confirm the pods are actually running before moving on
kubectl get pods -n crossplane-system --watch

For the AWS provider, use the Upbound official provider, not the community one. The Upbound provider-aws ships with generated SDKs that track AWS API changes faster, and the update cadence is weekly rather than whenever a volunteer gets around to it. The community provider also lacks support for some newer services entirely.

# Install the Upbound AWS provider
kubectl apply -f https://marketplace.upbound.io/providers/upbound/provider-aws/v0.46.0

# Watch the provider reach healthy state β€” this is async and takes 2-4 minutes
kubectl get providers --watch
# You want: INSTALLED=True, HEALTHY=True before proceeding

Gotcha #1: the provider pod will hit CrashLoopBackOff and it’s almost always the IRSA annotation being on the wrong service account. Crossplane creates its own service account inside crossplane-system β€” the annotation needs to go there, not on your application service account. Debug it like this:

# Find the provider's pod name
kubectl get pods -n crossplane-system

# Pull logs from the provider pod specifically
kubectl logs -n crossplane-system provider-aws-XXXX --previous

# The error you're looking for:
# "failed to retrieve credentials: failed to refresh cached credentials"
# That's IRSA misconfiguration 90% of the time

# Check which service account the provider pod is actually using
kubectl get pod provider-aws-XXXX -n crossplane-system -o jsonpath='{.spec.serviceAccountName}'

For the ProviderConfig itself, IRSA is the only approach worth using in production. Static credentials mean rotating secrets manually and storing AWS keys as Kubernetes Secrets β€” which gets auditors nervous for good reason. Here’s both configs so you can see why IRSA wins:

# Static credentials β€” works, but don't do this in production
apiVersion: v1
kind: Secret
metadata:
  name: aws-creds
  namespace: crossplane-system
type: Opaque
stringData:
  credentials: |
    [default]
    aws_access_key_id = AKIAIOSFODNN7EXAMPLE
    aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
---
# IRSA β€” the right way
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: IRSA  # Crossplane reads the token from the pod's projected volume
  assumeRoleChain:
    - roleARN: "arn:aws:iam::123456789012:role/crossplane-provider-role"

Gotcha #2 is the one that bites people in CI pipelines hardest: CRD installation is asynchronous. When you apply a provider, Crossplane installs its CRDs in the background. If your pipeline immediately tries to apply a Composite Resource Definition or a managed resource, you’ll get no matches for kind "Bucket" in version "s3.aws.upbound.io/v1beta1" β€” the CRDs simply aren’t registered yet. The fix is a proper wait condition in your pipeline before any resource manifests get applied:

# Wait for provider to be healthy β€” poll with a timeout rather than a fixed sleep
kubectl wait provider/provider-aws \
  --for=condition=Healthy \
  --timeout=300s

# Then verify the CRDs actually landed
kubectl get crds | grep aws.upbound.io | wc -l
# Should return 900+ for the full provider-aws package

Defining Your Infrastructure as Compositions

The thing that caught me off guard when I first started with Crossplane was how much thought you need to put into your XRD design before writing a single line of YAML. Get the parameters wrong and your app teams will constantly file tickets asking you to add new fields β€” which means a new XRD version, migration headaches, and broken Compositions. I learned this the hard way on a staging environment rollout.

Writing the CompositeResourceDefinition

An XRD is the contract between your platform team and your app teams. You’re defining what levers they get to pull β€” nothing more. For a standard app environment (VPC, RDS, S3, IAM role bundled together), you want to expose exactly the parameters that vary per team and lock down everything else. Here’s a real XRD that covers the common cases:

apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xappenvironments.platform.example.com
spec:
  group: platform.example.com
  names:
    kind: XAppEnvironment
    plural: xappenvironments
  claimNames:
    kind: AppEnvironment       # this is what app teams actually create
    plural: appenvironments    # in their own namespace
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                parameters:
                  type: object
                  required:
                    - instanceClass
                    - region
                    - environment
                  properties:
                    instanceClass:
                      type: string
                      enum: ["db.t3.medium", "db.t3.large", "db.r6g.xlarge"]
                      description: "RDS instance size β€” t3.medium for staging, r6g.xlarge for prod"
                    region:
                      type: string
                      default: "us-east-1"
                    environment:
                      type: string
                      enum: ["staging", "production"]
                    deletionPolicy:
                      type: string
                      enum: ["Delete", "Orphan"]
                      default: "Delete"

The claimNames block is critical. Without it, app teams have to create Composite Resources at the cluster scope, which means they need cluster-level RBAC. Claims live in a namespace, so your devs can own them without touching cluster-wide resources. Every app environment definition should have claims unless you’re building something truly cluster-wide like a shared VPC.

Wiring Parameters with Patches

The Composition is where the real work happens. Each managed resource inside it gets its field values either hardcoded or patched from the XR’s spec. The gotcha nobody warns you about: the docs show you patches that map a field straight through, but production resource naming requires transforms, and the transform syntax is more finicky than it looks. Here’s a real example that builds the RDS identifier app-staging-db from the claim’s metadata:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: appenvironments.platform.example.com
  labels:
    provider: aws
spec:
  compositeTypeRef:
    apiVersion: platform.example.com/v1
    kind: XAppEnvironment
  resources:
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: Instance
        spec:
          forProvider:
            engine: postgres
            engineVersion: "16.1"
            skipFinalSnapshot: false
            publiclyAccessible: false
      patches:
        # patch the instance class directly from XR parameter
        - type: FromCompositeFieldPath
          fromFieldPath: spec.parameters.instanceClass
          toFieldPath: spec.forProvider.instanceClass

        # patch the region
        - type: FromCompositeFieldPath
          fromFieldPath: spec.parameters.region
          toFieldPath: spec.forProvider.region

        # THIS is the one that trips everyone up β€” building a name from parts
        - type: FromCompositeFieldPath
          fromFieldPath: metadata.name   # this is the XR/claim name, e.g. "myapp"
          toFieldPath: spec.forProvider.identifier
          transforms:
            - type: string
              string:
                type: Format
                # claim name + environment param requires a CombineFromComposite patch
                fmt: "%s-db"   # produces "myapp-db" β€” good start

        # To get "app-staging-db" you need CombineFromComposite instead:
        - type: CombineFromComposite
          combine:
            variables:
              - fromFieldPath: metadata.name
              - fromFieldPath: spec.parameters.environment
            strategy: string
            string:
              fmt: "%s-%s-db"   # produces "myapp-staging-db"
          toFieldPath: spec.forProvider.identifier

        # tag every resource with its environment β€” useful for cost allocation
        - type: FromCompositeFieldPath
          fromFieldPath: spec.parameters.environment
          toFieldPath: spec.forProvider.tags.environment

The CombineFromComposite patch type is buried in the docs and most blog posts don’t mention it. If you try to use a single FromCompositeFieldPath with a fmt transform to combine two fields, it silently does nothing because fmt there only formats one input. You’ll spend an hour wondering why your RDS identifier is still blank. Use CombineFromComposite any time you need more than one XR field to produce a single output string.

Composition Revisions and Zero-Downtime Infra Updates

Composition Revisions are the feature that separates Crossplane from naive “apply this Terraform” pipelines. Every time you update a Composition, Crossplane creates a new revision but doesn’t immediately apply it to existing XRs. You control the rollout explicitly using compositionUpdatePolicy on each XR or Claim. This means you can test a Composition change on one team’s staging environment before it touches anyone’s production resources.

# In your Composition, pin the revision policy
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: appenvironments-v2.platform.example.com
  annotations:
    # increment this manually so you can reference it
    crossplane.io/composition-revision: "2"
spec:
  compositeTypeRef:
    apiVersion: platform.example.com/v1
    kind: XAppEnvironment
  # ... resources

---
# In the Claim, opt into automatic updates OR pin to a specific revision
apiVersion: platform.example.com/v1
kind: AppEnvironment
metadata:
  name: myapp
  namespace: team-payments
spec:
  compositionUpdatePolicy: Automatic   # or "Manual" to stay on current revision
  compositionRevisionSelector:
    matchLabels:
      channel: stable                  # only pick up revisions tagged "stable"
  parameters:
    instanceClass: db.t3.medium
    region: us-east-1
    environment: staging

The workflow I actually use: set all production claims to Manual update policy with a stable label selector. When I’m ready to roll out, I label the new revision channel: stable and bump claims one namespace at a time using kubectl patch. No deleting and recreating managed resources β€” Crossplane reconciles the diff. For RDS this matters enormously because a delete/recreate cycle means downtime and potential data loss even with snapshots. Revisions give you a proper promotion path: dev β†’ staging β†’ production, each controlled independently.

# Promote a new revision to stable after testing
kubectl label compositionrevision appenvironments-abc123 channel=stable

# Check which revision each XR is currently using
kubectl get xappenvironment -o custom-columns=\
  NAME:.metadata.name,\
  REVISION:.spec.compositionRevisionRef.name,\
  SYNCED:.status.conditions[0].status

One real gotcha with revisions: if you change a patch that affects an immutable field on an AWS resource β€” like an RDS subnet group or a VPC CIDR β€” Crossplane will error out and the resource will get stuck in a failed state. The revision system protects you from accidental deletion, but it can’t work around AWS API constraints. Always check whether the fields you’re patching are mutable before cutting a new Composition revision, or you’ll end up manually deleting the managed resource and accepting the brief downtime anyway.

Wiring Crossplane Into Your CI/CD Pipeline

The shift I made that actually improved pipeline reliability: CI pipelines shouldn’t provision infrastructure. That sounds obvious until you realize most teams have their pipelines doing exactly that β€” running terraform apply mid-job, waiting for RDS to spin up, then deploying the app, all in one fragile 20-minute chain. With Crossplane, you split the contract cleanly. Infrastructure is declared in Git, Crossplane reconciles it continuously, and CI’s only job is to check that infra is ready and then deploy the app. Your pipeline goes from being an imperative script to being a gatekeeper.

The GitHub Actions Job That Actually Works

Here’s a real workflow that applies a Crossplane Claim and waits for it before touching the app deployment. The kubectl wait call is the entire handoff point between infrastructure and application:

name: deploy-staging

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.29.0'

      - name: Apply Crossplane Claim
        run: |
          kubectl apply -f infra/claims/staging-postgres.yaml
          kubectl apply -f infra/claims/staging-redis.yaml

      # This is the actual gate β€” don't move past this until infra is confirmed ready
      - name: Wait for composite resources to be Ready
        run: |
          kubectl wait xrc/my-app-staging \
            --for=condition=Ready \
            --timeout=300s \
            --namespace=staging

      - name: Deploy application
        run: |
          kubectl apply -f k8s/staging/
          kubectl rollout status deployment/my-app -n staging --timeout=120s

When kubectl wait times out at 300s, the pipeline fails with an exit code β€” good, that’s what you want. But the failure message alone tells you nothing. Immediately after a timeout you need to run two things: kubectl describe xrc/my-app-staging to see the composite’s events, and kubectl get managed -l crossplane.io/composite=my-app-staging to see which individual managed resource is stuck. Nine times out of ten it’s a single managed resource with a provider error that the composite status obscures completely. The composite says “not ready” but the actual error is sitting on the RDSInstance object three layers down.

ArgoCD: Treating Claims Like Regular Manifests

The cleanest ArgoCD setup I’ve run puts Crossplane Claims in the same GitOps repo as application manifests, in a dedicated infra/ directory alongside app/. One ArgoCD Application per environment, with a sync policy that handles both. Here’s the Application spec that gets infra and app deploying in the right order:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app-staging
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/my-app
    targetRevision: main
    path: deploy/staging
  destination:
    server: https://kubernetes.default.svc
    namespace: staging
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
  # Without this, ArgoCD fights Crossplane constantly
  ignoreDifferences:
    - group: apiextensions.crossplane.io
      kind: CompositeResourceClaim
      jsonPointers:
        - /status
    - group: "*"
      kind: "*"
      managedFieldsManagers:
        - crossplane
        - crossplane-rbac-manager

The ignoreDifferences block is not optional. Crossplane writes back to status fields and sometimes annotates resources during reconciliation. Without ignoring those, ArgoCD will perpetually mark your Claims as OutOfSync because it sees a diff between what’s in Git (no status) and what’s live in the cluster (Crossplane-populated status). You’ll get a Slack notification every 3 minutes telling you something is drifted when nothing is actually wrong. I’ve seen teams spend two days debugging “drift” that was just ArgoCD vs Crossplane writing to the same object.

Sync Waves: Getting the Order Right

ArgoCD sync waves are the mechanism that solves “my app deployment started before the database existed.” Put this annotation on every Crossplane Claim:

apiVersion: database.example.io/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: my-app-db
  namespace: staging
  annotations:
    argocd.argoproj.io/sync-wave: "-1"  # Reconcile before wave 0 (default)
    argocd.argoproj.io/hook: Sync
spec:
  parameters:
    storageGB: 20
  compositionRef:
    name: postgresql-aws
  writeConnectionSecretToRef:
    name: my-app-db-conn

Wave -1 means ArgoCD applies and waits for these resources to reach a healthy state before moving on to wave 0, where your Deployment and Service manifests sit. The catch: ArgoCD’s health check for custom resources needs to know what “healthy” means for a Crossplane Claim. You need a custom health check in your ArgoCD config, otherwise it considers the Claim healthy the moment it’s applied β€” not when Crossplane has finished provisioning the actual database:

# In argocd-cm ConfigMap
resource.customizations.health.apiextensions.crossplane.io_CompositeResourceClaim: |
  hs = {}
  if obj.status ~= nil then
    if obj.status.conditions ~= nil then
      for i, condition in ipairs(obj.status.conditions) do
        if condition.type == "Ready" and condition.status == "True" then
          hs.status = "Healthy"
          hs.message = "Claim is ready"
          return hs
        end
      end
    end
  end
  hs.status = "Progressing"
  hs.message = "Waiting for claim to be ready"
  return hs

Without that Lua health check, sync waves give you false ordering guarantees. ArgoCD moves to wave 0 and your app pod crashes immediately trying to connect to a database that won’t exist for another four minutes. The health check is what makes the wave actually block. This is the piece that’s buried in ArgoCD docs under “resource health customization” and most Crossplane tutorials never mention it.

The Performance Wins You Actually Get

The 12-minute β†’ 4-minute improvement surprised me more than anything else when I first migrated an environment to Crossplane. The reason isn’t magic β€” it’s that Crossplane’s reconciler dispatches all your managed resources simultaneously. Terraform, by contrast, walks a dependency graph sequentially unless you manually tune -parallelism and even then, it still blocks on state locks. When you have an RDS instance, an S3 bucket, an IAM role, and a security group that don’t actually depend on each other, Crossplane just fires them all off and watches for convergence. Your CI doesn’t wait on Terraform’s execution plan β€” it’s already done provisioning while Terraform would still be initializing providers.

The pipeline shape change is where you feel the operational difference every day. Before Crossplane, a typical infra+deploy pipeline looked something like this:

# Old pipeline β€” infra steps inside CI
- terraform init -backend-config=env/prod.hcl
- terraform workspace select feature-branch-42
- terraform plan -out=plan.tfplan
- terraform apply plan.tfplan        # blocks for 8-12 mins
- kubectl apply -f k8s/deployment.yaml

# New pipeline β€” CI only owns the app
- kubectl apply -f claim.yaml        # submit the request
- kubectl wait --for=condition=Ready xrd/my-env --timeout=300s
- kubectl apply -f k8s/deployment.yaml

That’s not a simplification for presentation purposes β€” that’s literally what the CI job becomes. The infra reconciliation happens in the Crossplane control plane, not in your runner. Your runner is now a thin client that submits a Claim and waits. Pipeline duration drops not because provisioning got faster in isolation, but because provisioning is no longer on the critical path of your CI job if the environment already exists from a previous run.

State lock contention is the problem nobody talks about until it’s causing incidents. The moment two feature branches try to terraform apply against the same workspace or the same backend simultaneously, one of them hangs waiting for the lock to release β€” and if the first job gets killed mid-run (which happens constantly in preemptible runners), you’re manually running terraform force-unlock at 11pm. With Crossplane, each feature branch applies its own Claim YAML. These are Kubernetes objects. Kubernetes handles concurrent object updates through optimistic concurrency, not file-based locking. Two branches can both submit Claims simultaneously with zero contention.

# Branch A submits this β€” completely independent
apiVersion: platform.example.com/v1alpha1
kind: AppEnvironment
metadata:
  name: feature-payments-env
spec:
  parameters:
    size: small
    region: us-east-1
---
# Branch B submits this at the exact same time β€” no conflict
apiVersion: platform.example.com/v1alpha1
kind: AppEnvironment
metadata:
  name: feature-auth-env
spec:
  parameters:
    size: small
    region: us-east-1

Two things you get for free that you’d otherwise build yourself: drift detection and cleaner cache hit rates. The Crossplane reconciler runs on a continuous loop β€” typically every 60 seconds by default β€” comparing actual cloud state to desired state and correcting divergence automatically. That eliminates the “scheduled Terraform plan” cron job that most teams run nightly to catch manual console changes. It’s not just convenience; it means your CI assumptions about infrastructure state are actually reliable when a job starts. On the caching side, because infra is no longer managed inside the pipeline, your CI configuration simplifies dramatically. Your .github/workflows or .gitlab-ci.yml only needs to reason about Docker layer caches and dependency caches like npm or Maven β€” no Terraform provider downloads, no plugin caching, no backend initialization. Cache key strategies become straightforward: hash your package-lock.json or Dockerfile, done.

Observability: Knowing When Things Go Wrong

The thing that trips up most teams early on is treating Crossplane like a black box β€” they push a Claim, wait, and have no idea why their RDS instance hasn’t appeared after 8 minutes. The answer is almost always sitting right there in the events section of the composite resource, and most people skip past it.

# This single command saves you 20 minutes of log-diving every time
kubectl describe composite my-app-staging

# Scroll to the Events section at the bottom β€” you'll see something like:
# Warning  CannotComposeManagedResource  2m  composite/my-app-staging
#   cannot apply composed resource "my-app-staging-rds-xzk9q":
#   cannot patch object: admission webhook denied: db.t3.micro not available in us-east-1a

# Or the more common one β€” a missing secret reference:
# Warning  ReconcileError  45s  composite/my-app-staging
#   cannot get referenced secret "prod-db-creds" in namespace "crossplane-system": not found

I switched to checking kubectl describe composite before looking at anything else because it surfaces the exact managed resource that’s stuck, not just a generic reconcile error. You don’t need to grep through controller logs for a resource hash. The events tell you which composed resource failed, what the actual error was, and when it started failing. Combined with kubectl get managed to see overall resource state, you can diagnose most issues in under two minutes.

The Two Prometheus Metrics That Actually Matter

Crossplane’s controller exposes metrics at :8080/metrics on the pod β€” you need to either scrape it directly or add a ServiceMonitor if you’re running the Prometheus Operator. Out of the ~30 metrics it exposes, two are worth building alerts on immediately:

# Add this ServiceMonitor to scrape Crossplane controller metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: crossplane-metrics
  namespace: crossplane-system
spec:
  selector:
    matchLabels:
      app: crossplane
  endpoints:
    - port: metrics
      interval: 30s

# Alert rule for managed resources stuck in non-ready state
# Fire if any managed resource is !ready for more than 10 minutes
- alert: CrossplaneManagedResourceNotReady
  expr: crossplane_managed_resource_ready{ready="False"} > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Managed resource {{ $labels.name }} stuck in not-ready state"

# Alert if a managed resource Crossplane thinks exists actually doesn't
# This usually means out-of-band deletion or a permission error
- alert: CrossplaneManagedResourceMissing
  expr: crossplane_managed_resource_exists{exists="False"} > 0
  for: 5m
  labels:
    severity: critical

crossplane_managed_resource_ready flipping to False for extended periods is the slow failure β€” the resource got created but never converged. crossplane_managed_resource_exists going to False is the fast failure β€” something got deleted out of band, permissions changed, or the provider credentials expired. I’ve seen the latter happen silently when AWS IAM keys rotate and nobody updated the ProviderConfig secret. No alert means you find out when a developer opens a ticket wondering why their S3 bucket vanished.

Grafana Dashboard for Reconciler Queue Depth

The metric you want here is workqueue_depth labeled with the Crossplane controller name. When this climbs above ~50, you’re almost always hitting cloud provider API rate limits β€” AWS and GCP both throttle describe/get calls aggressively, and Crossplane’s controller will start backing off and stacking work. The queue depth is a leading indicator: it spikes before you start seeing TooManyRequests errors in the logs.

# Grafana panel query β€” reconciler queue depth per controller
sum by (name) (
  workqueue_depth{
    namespace="crossplane-system",
    name=~"managed/.*"
  }
)

# Panel thresholds to configure:
# Green: 0-15 (normal operation)
# Yellow: 15-50 (elevated, worth watching)
# Red: 50+ (you're being rate-limited, pipeline will slow down)

# Also worth graphing alongside it:
rate(workqueue_retries_total{namespace="crossplane-system"}[5m])
# Retries spiking in sync with queue depth confirms rate limiting, not a bug

If the queue is consistently above 15 during a CI run that provisions 30+ resources in parallel, consider staggering your Claim creation with a short sleep between batches. I know that feels wrong β€” why not provision everything at once? β€” but you’ll actually get faster end-to-end times because you avoid the backoff loop. The controller’s exponential backoff on rate-limit errors means a pile-up at minute 2 can cost you 4 minutes of retry delay.

Failing Fast in CI When Claims Stay Stuck

The default behavior of most CI pipelines is to wait forever or use a hard timeout that’s way too generous. A better pattern is polling with an exponential-aware deadline that matches your known convergence window. For most managed resources (RDS, GKE node pools, etc.) the happy path is 4-7 minutes. If you’re not Synced=True, Ready=True by minute 12, something is structurally wrong and you should fail the job, not let it run to a 30-minute timeout.

#!/bin/bash
# wait-for-claim.sh β€” fail CI fast when Crossplane claims stall
# Usage: ./wait-for-claim.sh my-app-staging AppClaim default 720

CLAIM_NAME=$1
CLAIM_KIND=$2
CLAIM_NS=${3:-default}
MAX_WAIT_SECONDS=${4:-720}   # 12 minutes default
POLL_INTERVAL=15
ELAPSED=0

while [ $ELAPSED -lt $MAX_WAIT_SECONDS ]; do
  SYNCED=$(kubectl get ${CLAIM_KIND} ${CLAIM_NAME} -n ${CLAIM_NS} \
    -o jsonpath='{.status.conditions[?(@.type=="Synced")].status}')
  READY=$(kubectl get ${CLAIM_KIND} ${CLAIM_NAME} -n ${CLAIM_NS} \
    -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')

  echo "[${ELAPSED}s] Synced=${SYNCED} Ready=${READY}"

  if [[ "$SYNCED" == "True" && "$READY" == "True" ]]; then
    echo "Claim converged in ${ELAPSED}s"
    exit 0
  fi

  # Synced=False means a hard error, not just "still working"
  # Fail immediately instead of burning the rest of the timeout
  if [[ "$SYNCED" == "False" ]]; then
    echo "FATAL: Claim entered Synced=False β€” dumping events:"
    kubectl describe ${CLAIM_KIND} ${CLAIM_NAME} -n ${CLAIM_NS} | tail -30
    exit 1
  fi

  sleep $POLL_INTERVAL
  ELAPSED=$((ELAPSED + POLL_INTERVAL))
done

echo "TIMEOUT: Claim did not converge within ${MAX_WAIT_SECONDS}s"
kubectl describe ${CLAIM_KIND} ${CLAIM_NAME} -n ${CLAIM_NS} | tail -30
exit 1

The key distinction in that script is treating Synced=False as a hard failure versus Ready=False as a transient state. Synced=False means Crossplane hit an error it couldn’t recover from on its own β€” bad composition, permission denied, invalid field value. There’s no point waiting another 10 minutes for that to resolve. The script dumps the last 30 lines of describe output directly into the CI log so whoever is looking at the failed job gets the error context without needing cluster access.

When This Setup Is Wrong for You

The most honest thing I can tell you after running Crossplane in production is that I’ve also removed it from two projects. The abstraction is genuinely powerful in the right context, but it carries real weight β€” and that weight crushes small setups.

Small Teams With One Environment

If you have five engineers and a single staging + production setup, Crossplane’s Compositions and XRDs are solving a problem you don’t have yet. I’ve watched teams spend three weeks building a CompositeResourceDefinition for RDS instances when a 40-line Terraform file would’ve shipped in an afternoon. The abstraction layer earns its keep when you’re managing dozens of environments or letting product teams self-serve infrastructure without touching cloud consoles. Below that threshold, you’re just writing YAML to generate more YAML.

Heavy Terraform Module Investment

This is the one that actually hurts. If your org has 50 well-tested Terraform modules β€” modules with real state, real test coverage, known edge cases baked in β€” migrating them to Crossplane Compositions isn’t a weekend project. It’s months of work, and the parity isn’t 1:1. Crossplane Compositions don’t have an equivalent to for_each on arbitrary objects without Composition Functions, which only stabilized in Crossplane 1.14. Before you blow up your Terraform investment, consider this instead:

# Trigger a Terraform Cloud workspace run from CI
curl -s \
  --header "Authorization: Bearer $TFC_TOKEN" \
  --header "Content-Type: application/vnd.api+json" \
  --request POST \
  --data '{"data":{"attributes":{"message":"triggered from CI","auto-apply":true},"type":"runs","relationships":{"workspace":{"data":{"type":"workspaces","id":"ws-YOURWORKSPACEID"}}}}}' \
  https://app.terraform.io/api/v2/runs

API-driven Terraform Cloud runs from your CI pipeline give you programmatic control without abandoning the module library you’ve already battle-tested. That’s often the right call.

Clusters You Don’t Actually Control

Crossplane is a set of Kubernetes operators. If your platform team has locked down the cluster β€” no cluster-admin, no ability to install CRDs, strict OPA/Gatekeeper policies on what controllers can run β€” you’re blocked before you start. I’ve seen this in regulated industries where the security team owns the cluster and app teams get namespaces with tight RBAC. You can’t negotiate Crossplane into that environment, and trying to work around it with hacky privilege escalation is worse than just using a different approach entirely.

The Debugging Ceiling Is Real

Crossplane errors are famously indirect. A misconfigured Composition won’t tell you “field X is wrong” β€” you’ll get a reconciliation failure buried three levels deep in controller logs. The debugging workflow looks like this:

# The surface-level claim looks fine
kubectl get claim my-database -n team-namespace
# STATUS: False β€” not helpful

# You have to chase the composite resource
kubectl describe composite my-database-xr | grep -A 20 "Events:"

# Then the managed resource underneath
kubectl describe rdsinstance my-database-xr-randomsuffix | grep -A 30 "Events:"

# And sometimes the provider pod logs themselves
kubectl logs -n crossplane-system \
  -l pkg.crossplane.io/revision=provider-aws-XXXX \
  --tail=100 | grep ERROR

If your team isn’t already comfortable reading Kubernetes events and controller reconciliation loops, this debugging chain adds real cognitive overhead under production pressure. That’s not a knock on the tool β€” it’s just the operational contract you’re signing. Teams that thrive with Crossplane are usually the ones who already treat kubectl describe and controller logs as natural reflexes, not unfamiliar territory.

Quick Reference: Crossplane CLI Commands You’ll Actually Use

The command that changed my debugging workflow the most was crossplane beta trace, shipped in v1.14. Before it existed, figuring out why a composite resource was stuck meant manually describing every managed resource in the tree β€” XR, XRC, Composition, each individual managed resource. Now you get the entire hierarchy in one shot:

# Install the crossplane CLI first if you haven't
curl -sL https://raw.githubusercontent.com/crossplane/crossplane/master/install.sh | sh

# Trace a composite resource claim β€” shows full tree with Ready/Synced conditions
crossplane beta trace xrc/my-app-staging

# Output looks like:
# NAME                          SYNCED   READY   STATUS
# XRMyApp/my-app-staging-xr    True     True    Available
# β”œβ”€ RDSInstance/my-app-pg     True     True    Available
# β”œβ”€ S3Bucket/my-app-assets    True     False   Creating
# └─ IAMRole/my-app-role       True     True    Available

The S3Bucket showing False on Ready while Synced is True is a key distinction β€” Synced means Crossplane sent the API call, Ready means AWS confirmed the resource is actually usable. That difference trips up almost everyone early on. Without crossplane beta trace, spotting that mismatch across 8 managed resources in a complex composition could take 10 minutes of copy-pasting. Now it’s one command.

# Lists every cloud resource Crossplane owns across ALL providers
# The SYNCED and READY columns are your first health signal
kubectl get managed

# Filter to just RDS instances if the list gets noisy
kubectl get managed -l crossplane.io/composite=my-app-staging-xr

kubectl get managed is most useful in CI β€” I pipe it into a health check script after a deployment that waits for all managed resources to reach READY=True before marking the pipeline green. The gotcha is that some AWS resources genuinely take 8–12 minutes (RDS, ElastiCache), so if your pipeline times out at 5 minutes, you’ll chase false failures. Set your timeout to at least 20 minutes for infrastructure-heavy compositions.

# This one finds the silent failures β€” resources where Crossplane
# lost its reference to what's actually in AWS (usually after manual
# console changes or credential rotation)
kubectl get events --field-selector reason=CannotObserveExternalResource

# Pair it with a watch to catch transient issues during reconciliation
kubectl get events \
  --field-selector reason=CannotObserveExternalResource \
  --watch

The CannotObserveExternalResource event is what fires when someone went into the AWS console and renamed or deleted a resource that Crossplane owns. Crossplane can’t find it by the external-name annotation anymore and just keeps logging that event in a loop β€” it won’t recreate the resource automatically by default. You’ll discover this event exists the hard way after a production incident where Crossplane appeared healthy but was silently managing nothing.

# Pause reconciliation on ONE managed resource for emergency manual surgery
# This is surgical β€” it doesn't touch the rest of the composition
kubectl annotate managed aws-rds-instance \
  crossplane.io/paused=true \
  --overwrite

# Do your manual changes in AWS console here...

# Unpause when done β€” Crossplane will re-observe and reconcile state
kubectl annotate managed aws-rds-instance \
  crossplane.io/paused=true- \
  --overwrite

That trailing dash on crossplane.io/paused=true- removes the annotation entirely β€” that’s standard kubectl syntax, not a typo. The pause annotation is genuinely your escape hatch for emergencies. Without it, your only options were to scale down the Crossplane provider pod (affects everything) or delete the managed resource and hope the composition recreates it cleanly. Pausing at the individual resource level means your app keeps running while you fix the specific thing that’s broken in AWS, and when you unpause, Crossplane diffs the current AWS state against desired state and reconciles from there.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Lead AI Engineer & SaaS Strategist

Eric is a seasoned software architect specializing in LLM orchestration and autonomous agent systems. With over 15 years in Silicon Valley, he now focuses on scaling AI-first applications.

Leave a Comment