Monitoring and Observability in Cloud-Native Applications

Effective monitoring and observability are crucial for maintaining reliable cloud-native applications. This guide covers the essential components of a comprehensive observability strategy.

The Three Pillars of Observability

Metrics

Quantitative measurements of your system’s behavior over time.

Logs

Discrete events that provide context about what happened in your application.

Traces

Show the path of requests through your distributed system.

Setting Up Prometheus and Grafana

Prometheus Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Grafana Dashboard

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ]
      }
    ]
  }
}

Application Metrics

Custom Metrics in Node.js

const prometheus = require('prom-client');

// Create metrics
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status']
});

const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status']
});

// Middleware to collect metrics
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;

    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);

    httpRequestsTotal
      .labels(req.method, route, res.statusCode)
      .inc();
  });

  next();
});

Structured Logging

JSON Logging with Winston

const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'app.log' })
  ]
});

// Usage
logger.info('User logged in', {
  userId: user.id,
  email: user.email,
  ip: req.ip,
  userAgent: req.get('User-Agent')
});

Distributed Tracing

OpenTelemetry Setup

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: 'http://jaeger:14268/api/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

Alerting Rules

Critical Alerts

groups:
- name: critical-alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"

  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"

Service Level Objectives (SLOs)

Defining SLOs

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: web-app-slo
spec:
  service: "web-app"
  labels:
    team: "platform"
  slos:
  - name: "availability"
    objective: 99.9
    description: "99.9% availability"
    sli:
      events:
        error_query: sum(rate(http_requests_total{job="web-app",code=~"5.."}[5m]))
        total_query: sum(rate(http_requests_total{job="web-app"}[5m]))
    alerting:
      name: WebAppAvailability
      labels:
        category: availability

This monitoring setup provides comprehensive visibility into your cloud-native applications, enabling proactive issue detection and resolution.