Monitoring and Observability - Visualizing System Health

15 min read | 2025.12.02

What is Observability

Observability is the ability to understand a system’s internal state from its external outputs. It enables understanding not just “what is happening” but “why it’s happening.”

Difference from Monitoring: Monitoring detects known problems, but observability enables investigating unknown issues.

The Three Pillars

PillarQuestionPurpose
Metrics”What’s happening”Numerical data over time
Logs”What details”Event details
Traces”How it flowed”Request path tracking

Metrics

Record numerical data over time.

Types of Metrics

TypeDescriptionExample
CounterOnly increasesRequest count, error count
GaugeIncreases and decreasesCPU usage, memory usage
HistogramDistributionResponse time
SummaryStatistical valuesPercentiles

Prometheus Example

const prometheus = require('prom-client');

// Counter example
const requestCounter = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status']
});

// Histogram example
const responseTime = new prometheus.Histogram({
  name: 'http_response_time_seconds',
  help: 'HTTP response time in seconds',
  buckets: [0.1, 0.5, 1, 2, 5]
});

// Measurement
app.use((req, res, next) => {
  const end = responseTime.startTimer();
  res.on('finish', () => {
    requestCounter.inc({
      method: req.method,
      path: req.path,
      status: res.statusCode
    });
    end();
  });
  next();
});

Important Metrics (RED/USE)

MethodForMetrics
REDServicesRate (Request rate), Errors (Error rate), Duration (Response time)
USEResourcesUtilization (Usage rate), Saturation (Saturation level), Errors

Logs

Record event details.

Structured Logs

// Unstructured log (hard to search)
console.log(`User ${userId} logged in from ${ip}`);

// Structured log (easy to search and analyze)
const log = {
  timestamp: new Date().toISOString(),
  level: 'info',
  message: 'User logged in',
  userId: '123',
  ip: '192.168.1.1',
  userAgent: 'Mozilla/5.0...',
  requestId: 'req_abc123'
};
console.log(JSON.stringify(log));

Log Levels

LevelUse
ERRORErrors, abnormal situations
WARNWarnings, potential issues
INFOImportant events
DEBUGDebug information
TRACEDetailed trace information

Log Management Stacks

StackFlow
ELKApp → Filebeat → Logstash → Elasticsearch → Kibana
LokiApp → Promtail → Loki → Grafana

Traces

Track the path of requests through the system.

Distributed Tracing

Trace ID: abc123

flowchart TB
    subgraph Gateway["API Gateway (50ms)"]
        Auth["Auth Service (10ms)"]
        subgraph User["User Service (30ms)"]
            DB["Database Query (15ms)"]
        end
    end

OpenTelemetry Example

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('my-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      await validateOrder(orderId);
      await processPayment(orderId);
      span.setStatus({ code: 1 }); // OK
    } catch (error) {
      span.setStatus({ code: 2, message: error.message }); // ERROR
      throw error;
    } finally {
      span.end();
    }
  });
}

Major Tools

ToolFeatures
JaegerCNCF project, developed by Uber
ZipkinDeveloped by Twitter, simple
AWS X-RayAWS integration
Datadog APMSaaS, rich features

Alerting

Send notifications based on metrics.

Alert Design

# Prometheus Alertmanager example
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} for the last 5 minutes"

      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, rate(http_response_time_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning

Preventing Alert Fatigue

Good alert conditions:

  • ✓ Only for events requiring action
  • ✓ Appropriate thresholds (reduce noise)
  • ✓ Clear response procedures
  • ✓ Escalation policies

Alerts to avoid:

  • ✗ Information only (check dashboard instead)
  • ✗ Events that auto-recover quickly
  • ✗ Late-night alerts that can’t be addressed

Dashboards

Grafana Dashboard Structure

SectionPanels
Service OverviewRequest Rate, Response Time (p95)
HealthError Rate, Success Rate
Resource UtilizationCPU, Memory, Disk, Network
DebuggingRecent Error Logs

Summary

Observability is essential for understanding the health of complex systems. By combining the three pillars of metrics, logs, and traces, enable smooth problem detection and root cause investigation. Proper alerting configuration and dashboards enable rapid incident response.

← Back to list