Zero-Downtime Deployments: A Complete Guide

Zero-downtime deployments are essential for maintaining high availability and user satisfaction. This comprehensive guide covers various strategies to achieve seamless deployments without service interruption.

Deployment Strategies Overview

Rolling Deployments

The default Kubernetes strategy that gradually replaces old pods with new ones:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    spec:
      containers:
      - name: web
        image: myapp:v2.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Blue-Green Deployments

Maintain two identical production environments:

# Blue environment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-blue
  labels:
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
      version: blue
  template:
    metadata:
      labels:
        app: web-app
        version: blue
    spec:
      containers:
      - name: web
        image: myapp:v1.0
---
# Green environment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-green
  labels:
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
      version: green
  template:
    metadata:
      labels:
        app: web-app
        version: green
    spec:
      containers:
      - name: web
        image: myapp:v2.0

Canary Deployments

Gradually shift traffic to the new version:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-app-canary
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - setWeight: 25
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 75
      - pause: {duration: 10m}
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web
        image: myapp:v2.0

Health Checks and Readiness

Comprehensive Health Checks

Implement thorough health checks for reliable deployments:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 30

Health Check Endpoints

Implement proper health check endpoints in your application:

// Express.js example
app.get('/health/live', (req, res) => {
  // Check if the application is alive
  res.status(200).json({ status: 'alive', timestamp: Date.now() });
});

app.get('/health/ready', async (req, res) => {
  try {
    // Check database connectivity
    await database.ping();
    // Check external dependencies
    await redis.ping();

    res.status(200).json({
      status: 'ready',
      dependencies: { database: 'ok', redis: 'ok' }
    });
  } catch (error) {
    res.status(503).json({
      status: 'not ready',
      error: error.message
    });
  }
});

Database Migration Strategies

Backward-Compatible Migrations

Ensure database changes are backward compatible:

-- Safe: Adding a new column with a default value
ALTER TABLE users ADD COLUMN preferences JSON DEFAULT '{}';

-- Safe: Creating a new table
CREATE TABLE user_sessions (
  id SERIAL PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  session_token VARCHAR(255),
  created_at TIMESTAMP DEFAULT NOW()
);

-- Unsafe: Dropping a column (do this after deployment)
-- ALTER TABLE users DROP COLUMN old_column;

Multi-Phase Deployments

Phase 1: Deploy schema changes
Phase 2: Deploy application code
Phase 3: Clean up old schema (if needed)

# Migration job
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v2-0
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: myapp-migrator:v2.0
        command: ["./migrate.sh"]
        env:
        - name: DB_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
      restartPolicy: Never

Service Mesh and Traffic Management

Istio Traffic Splitting

Use Istio for sophisticated traffic management:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-app-vs
spec:
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: web-app
        subset: v2
  - route:
    - destination:
        host: web-app
        subset: v1
      weight: 90
    - destination:
        host: web-app
        subset: v2
      weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-app-dr
spec:
  host: web-app
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Circuit Breakers

Implement circuit breakers for resilience:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-app-circuit-breaker
spec:
  host: web-app
  trafficPolicy:
    outlierDetection:
      consecutiveErrors: 3
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        maxRequestsPerConnection: 10

Monitoring During Deployments

Key Metrics to Monitor

Track these metrics during deployments:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: deployment-monitoring
spec:
  groups:
  - name: deployment
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "High error rate during deployment"

    - alert: HighLatency
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "High latency during deployment"

Automated Rollback

Set up automated rollback based on metrics:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-app
spec:
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: web-app
      steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
          args:
          - name: service-name
            value: web-app
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 10s
    count: 3
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(
            http_requests_total{job="{{args.service-name}}",status!~"5.."}[5m]
          )) /
          sum(rate(
            http_requests_total{job="{{args.service-name}}"}[5m]
          ))

Feature Flags

Implementing Feature Flags

Use feature flags for safer deployments:

// Feature flag implementation
const featureFlags = {
  newCheckoutFlow: process.env.FEATURE_NEW_CHECKOUT === 'true',
  enhancedSearch: process.env.FEATURE_ENHANCED_SEARCH === 'true'
};

app.post('/checkout', (req, res) => {
  if (featureFlags.newCheckoutFlow) {
    return handleNewCheckout(req, res);
  }
  return handleLegacyCheckout(req, res);
});

Dynamic Feature Flags

Use external services for runtime flag management:

const LaunchDarkly = require('launchdarkly-node-server-sdk');

const ldClient = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

app.post('/checkout', async (req, res) => {
  const user = { key: req.user.id, email: req.user.email };
  const useNewFlow = await ldClient.variation('new-checkout-flow', user, false);

  if (useNewFlow) {
    return handleNewCheckout(req, res);
  }
  return handleLegacyCheckout(req, res);
});

Pre-deployment Validation

Integration Tests

Run comprehensive tests before deployment:

apiVersion: batch/v1
kind: Job
metadata:
  name: pre-deployment-tests
spec:
  template:
    spec:
      containers:
      - name: test-runner
        image: myapp-tests:latest
        command: ["npm", "run", "test:integration"]
        env:
        - name: TEST_ENV
          value: "staging"
      restartPolicy: Never
  backoffLimit: 0

Smoke Tests

Implement quick smoke tests:

#!/bin/bash
# smoke-test.sh

set -e

echo "Running smoke tests..."

# Test health endpoint
curl -f http://web-app:8080/health || exit 1

# Test critical API endpoints
curl -f http://web-app:8080/api/users || exit 1
curl -f http://web-app:8080/api/products || exit 1

# Test database connectivity
curl -f http://web-app:8080/health/db || exit 1

echo "All smoke tests passed!"

Best Practices Summary

Planning Phase

Design backward-compatible APIs
Plan database migrations carefully
Set up comprehensive monitoring
Prepare rollback procedures

Implementation Phase

Use proper health checks
Implement gradual rollouts
Monitor key metrics continuously
Have automated rollback triggers

Post-deployment Phase

Verify all systems are healthy
Monitor for any anomalies
Clean up old resources
Document lessons learned

Troubleshooting Common Issues

Slow Rollouts

Check resource constraints
Verify health check configurations
Monitor pod startup times

Failed Health Checks

Review application startup sequence
Check dependency availability
Validate health check endpoints

Database Lock Issues

Use shorter migration transactions
Implement migration timeouts
Consider read replicas for zero-downtime reads

Conclusion

Zero-downtime deployments require careful planning, robust monitoring, and the right tools. Start with rolling deployments and gradually adopt more sophisticated strategies like canary releases as your operations mature.

Remember that the goal is not just zero downtime, but also maintaining system reliability and user experience throughout the deployment process. Test your deployment strategies thoroughly in staging environments before applying them to production.