Zero-Downtime Deploys: A Blue–Green Playbook for Lean Teams - Parth Shah

Why Zero-Downtime Still Feels Like Black Magic

Nothing kills demo day like a 502. Yet many startups still “ship” by SSH’ing into a lone EC2 box. Blue–green deployments remove that single point of failure: run two identical environments, flip traffic when green is healthy, and your customers never see a blip.

TL;DR: Blue = current prod, Green = new version. Cut over when the health checks sing.

Architecture at a Glance

Blue-Green Deployment Architecture

Works the same on Cloud Run, Fly.io, or bare-metal K8s—just replace ALB with your load balancer.

1 | GitHub Actions Workflow

name: blue-green-deploy
on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-qemu-action@v3
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ env.ECR_REGISTRY }}/todo:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
      - name: "Helm upgrade to green"
        run: |
          helm upgrade todo chart/ \
            --set image.tag=${{ github.sha }} \
            --set color=green
      - name: "Smoke test green"
        run: ./scripts/smoke.sh
      - name: "Shift traffic"
        run: |
          kubectl patch ingress todo \
            -p '{"metadata":{"annotations":{"alb.ingress.kubernetes.io/target-group-attributes":"blue:0,green:100"}}}'

Key bits

color value triggers Helm’s Deployment.todo-green.
Smoke test hits /health on the green ingress hostname.
Traffic shift happens only if smoke passes + Prometheus SLOs are green.

2 | Database Schema Without Downtime

Step	Pattern	Reason
1	Add new column nullable	Old code ignores it.
2	Deploy green reading + writing both columns	Green stays backward-compatible.
3	Migrate data in background	Cran-job → 10 k rows/s.
4	Drop column from blue & flip	All traffic on green; old column cold.
5	Delete legacy column	After 1–2 weeks, post-logs confirm zero reads.

Pro tip: wrap each migration in a [transactionally idempotent] step; failures roll back gracefully.

3 | Feature Flags Keep You Honest

Blue–green solves infra rollbacks; feature flags solve product rollbacks.

Use GrowthBook or LaunchDarkly; DIY with a flags table if broke.
Default flag OFF in both blue & green → ramp once green is live.
Combine with userId % 100 < 1 to expose new features to 1 % traffic first.

4 | Observability: The Three Lights

Layer	Tool	SLO
Infra	Prometheus ➝ Grafana	Pod restart rate < 2 / h
App	OpenTelemetry traces	p95 latency < 250 ms
User	Sentry + Rum	JS error rate < 0.5 %

Automation: GitHub Action checks Alertmanager silence before traffic shift; any open severity=page alert aborts deploy.

5 | Cost Check (Side Projects Scale)

Resource	Blue	Green	Total / month
EKS t3.small (2 nodes each)	$24	$24	$48
ALB	—	—	$18
ECR storage	—	—	$4
Grand total			$70

Tight budget? Use Kubernetes horizontalPodAutoscaler to scale BLUE to zero replicas ~5 min after cutover.

6 | When Things Go Sideways

Smoke test fails → Action cancels, green pods scale → 0, incident Slack ping.
Post-cutover error rate spikes → kubectl patch ingress … blue:100,green:0 (one-liner rollback).
DB migration stuck → Pause traffic shift; blue keeps running.

Mean time to recovery in prod: < 3 min over nine incidents last year.

Highlight Reel

KPI	Before (classic in-place)	After blue-green
Deploys / week	2	14
User-visible errors during deploy	7–12 504s	0
Avg. rollback time	38 min	1.2 min

Final Checklist

Health probes on /ready & /live endpoints
Idempotent schema migrations with feature flags
Automated rollback cut-over script
Alert block gate in the CI workflow
Post-deploy smoke + synthetic user tests

Zero downtime is less about magic and more about boring, repeatable scripts. Bake them once, and shipping becomes as routine as "git push".