Effective monitoring and observability are crucial for maintaining reliable cloud-native applications. This guide covers the essential components of a comprehensive observability strategy.
The Three Pillars of Observability
Metrics
Quantitative measurements of your system’s behavior over time.
Logs
Discrete events that provide context about what happened in your application.
Traces
Show the path of requests through your distributed system.
Setting Up Prometheus and Grafana
Prometheus Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Grafana Dashboard
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
}
]
}
}
Application Metrics
Custom Metrics in Node.js
const prometheus = require('prom-client');
// Create metrics
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status']
});
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
// Middleware to collect metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route ? req.route.path : req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestsTotal
.labels(req.method, route, res.statusCode)
.inc();
});
next();
});
Structured Logging
JSON Logging with Winston
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'app.log' })
]
});
// Usage
logger.info('User logged in', {
userId: user.id,
email: user.email,
ip: req.ip,
userAgent: req.get('User-Agent')
});
Distributed Tracing
OpenTelemetry Setup
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
}),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
Alerting Rules
Critical Alerts
groups:
- name: critical-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
Service Level Objectives (SLOs)
Defining SLOs
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: web-app-slo
spec:
service: "web-app"
labels:
team: "platform"
slos:
- name: "availability"
objective: 99.9
description: "99.9% availability"
sli:
events:
error_query: sum(rate(http_requests_total{job="web-app",code=~"5.."}[5m]))
total_query: sum(rate(http_requests_total{job="web-app"}[5m]))
alerting:
name: WebAppAvailability
labels:
category: availability
This monitoring setup provides comprehensive visibility into your cloud-native applications, enabling proactive issue detection and resolution.