Skip to content
Muhammet Şafak
tr
Asked by: Emir Answered:

How do I design a self-healing deploy that auto-rolls back when the error rate spikes?


Question

The new version went live successfully via DeployerPHP/Kubernetes and the pipeline went green. But 5 minutes later the 5xx error rate in production passed 10%. Instead of an engineer watching a screen, how do I design a self-healing setup that watches Prometheus/Grafana or New Relic metrics, catches the anomaly, and automatically rolls back to the last stable version when a threshold is breached?

Answer

Short answer: don’t make a human watch a dashboard — wire the metrics directly into the deploy. The right pattern is progressive delivery driven by automated analysis.

What you’re seeing is clear: the pipeline goes green, but “green” is only the technical success of the deploy, not real health in production. You need to tie the two together.

  1. Hold at a canary weight for a “bake” period after deploy. Don’t open the new version to 100% right away; keep it on partial traffic for a while and, during that window, query Prometheus/New Relic for 5xx rate, latency, and key business metrics against a threshold/baseline. The anomaly is caught in this window.
  2. Auto-roll back to the last stable version on breach. When the health-gate breaches, the pipeline triggers rollback without waiting for a human and alerts the team. The human isn’t on night watch — just informed.
  3. Keep the previous release ready so rollback is instant. Deployer keeps releases, Kubernetes keeps the old ReplicaSet — rolling back isn’t a new deploy, it’s a switch to what’s already ready. Tooling: on k8s, Argo Rollouts / Flagger run this canary analysis + rollback automatically; for Deployer, add a post-deploy health-gate step that polls metrics and calls deploy:rollback on breach.
  4. Set the guardrails right, or you’ll get flapping. Pick a sane threshold (so noise doesn’t trigger constant rollbacks), require a minimum sample size, and roll back only the deploy — not data migrations. That’s why migrations must be backward-compatible; otherwise a rollback leaves the schema inconsistent.

Bottom line: I’d set up canary + automated metric analysis + auto-rollback to last-good, and make the human the alerted party, not the watcher — the deploy self-heals, not a person. Automated rollback stops the bleeding but doesn’t understand the cause; pair it with a blameless post-mortem culture — that’s where the durable fix comes from.

Related Reading

Tags: #ci-cd#resilience#observability
Share:

Comments

Sign in with your GitHub account to join the discussion. Comments are stored in GitHub Discussions.

More Questions

All questions
Bora

Canary or Blue-Green deployment for a fintech API?

Make Canary your default for routine releases (smallest blast radius), and keep Blue-Green for big cutovers where you want an instant flip/rollback. The real pivot is the database: both demand backward-compatible (expand/contract) schema changes.

#ci-cd#deploy#resilience

Search the site

Start typing to search posts, projects and pages.

Esc to close Powered by Pagefind