
Zero-Downtime Deploys: A Blue–Green Playbook for Lean Teams
Why Zero-Downtime Still Feels Like Black Magic
Nothing kills demo day like a 502. Yet many startups still “ship” by SSH’ing into a lone EC2 box. Blue–green deployments remove that single point of failure: run two identical environments, flip traffic when green is healthy, and your customers never see a blip.
TL;DR: Blue = current prod, Green = new version. Cut over when the health checks sing.
Architecture at a Glance
Works the same on Cloud Run, Fly.io, or bare-metal K8s—just replace ALB with your load balancer.
1 | GitHub Actions Workflow
name: blue-green-deploy
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-qemu-action@v3
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ env.ECR_REGISTRY }}/todo:${{ github.sha }}
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- uses: aws-actions/configure-aws-credentials@v4
- name: "Helm upgrade to green"
run: |
helm upgrade todo chart/ \
--set image.tag=${{ github.sha }} \
--set color=green
- name: "Smoke test green"
run: ./scripts/smoke.sh
- name: "Shift traffic"
run: |
kubectl patch ingress todo \
-p '{"metadata":{"annotations":{"alb.ingress.kubernetes.io/target-group-attributes":"blue:0,green:100"}}}'
Key bits
colorvalue triggers Helm’sDeployment.todo-green.- Smoke test hits /health on the green ingress hostname.
- Traffic shift happens only if smoke passes + Prometheus SLOs are green.
2 | Database Schema Without Downtime
| Step | Pattern | Reason |
|---|---|---|
| 1 | Add new column nullable | Old code ignores it. |
| 2 | Deploy green reading + writing both columns | Green stays backward-compatible. |
| 3 | Migrate data in background | Cran-job → 10 k rows/s. |
| 4 | Drop column from blue & flip | All traffic on green; old column cold. |
| 5 | Delete legacy column | After 1–2 weeks, post-logs confirm zero reads. |
Pro tip: wrap each migration in a [transactionally idempotent] step; failures roll back gracefully.
3 | Feature Flags Keep You Honest
Blue–green solves infra rollbacks; feature flags solve product rollbacks.
- Use GrowthBook or LaunchDarkly; DIY with a
flagstable if broke. - Default flag OFF in both blue & green → ramp once green is live.
- Combine with
userId % 100 < 1to expose new features to 1 % traffic first.
4 | Observability: The Three Lights
| Layer | Tool | SLO |
|---|---|---|
| Infra | Prometheus ➝ Grafana | Pod restart rate < 2 / h |
| App | OpenTelemetry traces | p95 latency < 250 ms |
| User | Sentry + Rum | JS error rate < 0.5 % |
Automation: GitHub Action checks Alertmanager silence before traffic shift; any open severity=page alert aborts deploy.
5 | Cost Check (Side Projects Scale)
| Resource | Blue | Green | Total / month |
|---|---|---|---|
| EKS t3.small (2 nodes each) | $24 | $24 | $48 |
| ALB | — | — | $18 |
| ECR storage | — | — | $4 |
| Grand total | $70 |
Tight budget? Use Kubernetes horizontalPodAutoscaler to scale BLUE to zero replicas ~5 min after cutover.
6 | When Things Go Sideways
- Smoke test fails → Action cancels, green pods scale → 0, incident Slack ping.
- Post-cutover error rate spikes →
kubectl patch ingress … blue:100,green:0(one-liner rollback). - DB migration stuck → Pause traffic shift; blue keeps running.
Mean time to recovery in prod: < 3 min over nine incidents last year.
Highlight Reel
| KPI | Before (classic in-place) | After blue-green |
|---|---|---|
| Deploys / week | 2 | 14 |
| User-visible errors during deploy | 7–12 504s | 0 |
| Avg. rollback time | 38 min | 1.2 min |
Final Checklist
- Health probes on /ready & /live endpoints
- Idempotent schema migrations with feature flags
- Automated rollback cut-over script
- Alert block gate in the CI workflow
- Post-deploy smoke + synthetic user tests
Zero downtime is less about magic and more about boring, repeatable scripts. Bake them once, and shipping becomes as routine as "git push".