Monitoring and Observability - Visualizing System Health | Concepts

What is Observability

Observability is the ability to understand a system’s internal state from its external outputs. It enables understanding not just “what is happening” but “why it’s happening.”

Difference from Monitoring: Monitoring detects known problems, but observability enables investigating unknown issues.

The Three Pillars

Pillar	Question	Purpose
Metrics	”What’s happening”	Numerical data over time
Logs	”What details”	Event details
Traces	”How it flowed”	Request path tracking

Metrics

Record numerical data over time.

Types of Metrics

Type	Description	Example
Counter	Only increases	Request count, error count
Gauge	Increases and decreases	CPU usage, memory usage
Histogram	Distribution	Response time
Summary	Statistical values	Percentiles

Prometheus Example

const prometheus = require('prom-client');

// Counter example
const requestCounter = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status']
});

// Histogram example
const responseTime = new prometheus.Histogram({
  name: 'http_response_time_seconds',
  help: 'HTTP response time in seconds',
  buckets: [0.1, 0.5, 1, 2, 5]
});

// Measurement
app.use((req, res, next) => {
  const end = responseTime.startTimer();
  res.on('finish', () => {
    requestCounter.inc({
      method: req.method,
      path: req.path,
      status: res.statusCode
    });
    end();
  });
  next();
});

Important Metrics (RED/USE)

Method	For	Metrics
RED	Services	Rate (Request rate), Errors (Error rate), Duration (Response time)
USE	Resources	Utilization (Usage rate), Saturation (Saturation level), Errors

Logs

Record event details.

Structured Logs

// Unstructured log (hard to search)
console.log(`User ${userId} logged in from ${ip}`);

// Structured log (easy to search and analyze)
const log = {
  timestamp: new Date().toISOString(),
  level: 'info',
  message: 'User logged in',
  userId: '123',
  ip: '192.168.1.1',
  userAgent: 'Mozilla/5.0...',
  requestId: 'req_abc123'
};
console.log(JSON.stringify(log));

Log Levels

Level	Use
ERROR	Errors, abnormal situations
WARN	Warnings, potential issues
INFO	Important events
DEBUG	Debug information
TRACE	Detailed trace information

Log Management Stacks

Stack	Flow
ELK	App → Filebeat → Logstash → Elasticsearch → Kibana
Loki	App → Promtail → Loki → Grafana

Traces

Track the path of requests through the system.

Distributed Tracing

Trace ID: abc123

flowchart TB
    subgraph Gateway["API Gateway (50ms)"]
        Auth["Auth Service (10ms)"]
        subgraph User["User Service (30ms)"]
            DB["Database Query (15ms)"]
        end
    end

OpenTelemetry Example

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('my-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      await validateOrder(orderId);
      await processPayment(orderId);
      span.setStatus({ code: 1 }); // OK
    } catch (error) {
      span.setStatus({ code: 2, message: error.message }); // ERROR
      throw error;
    } finally {
      span.end();
    }
  });
}

Major Tools

Tool	Features
Jaeger	CNCF project, developed by Uber
Zipkin	Developed by Twitter, simple
AWS X-Ray	AWS integration
Datadog APM	SaaS, rich features

Alerting

Send notifications based on metrics.

Alert Design

# Prometheus Alertmanager example
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} for the last 5 minutes"

      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, rate(http_response_time_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning

Preventing Alert Fatigue

Good alert conditions:

✓ Only for events requiring action
✓ Appropriate thresholds (reduce noise)
✓ Clear response procedures
✓ Escalation policies

Alerts to avoid:

✗ Information only (check dashboard instead)
✗ Events that auto-recover quickly
✗ Late-night alerts that can’t be addressed

Dashboards

Grafana Dashboard Structure

Section	Panels
Service Overview	Request Rate, Response Time (p95)
Health	Error Rate, Success Rate
Resource Utilization	CPU, Memory, Disk, Network
Debugging	Recent Error Logs

Summary

Observability is essential for understanding the health of complex systems. By combining the three pillars of metrics, logs, and traces, enable smooth problem detection and root cause investigation. Proper alerting configuration and dashboards enable rapid incident response.

← Back to list