What is Observability
Observability is the ability to understand a system’s internal state from its external outputs. It enables understanding not just “what is happening” but “why it’s happening.”
Difference from Monitoring: Monitoring detects known problems, but observability enables investigating unknown issues.
The Three Pillars
| Pillar | Question | Purpose |
|---|---|---|
| Metrics | ”What’s happening” | Numerical data over time |
| Logs | ”What details” | Event details |
| Traces | ”How it flowed” | Request path tracking |
Metrics
Record numerical data over time.
Types of Metrics
| Type | Description | Example |
|---|---|---|
| Counter | Only increases | Request count, error count |
| Gauge | Increases and decreases | CPU usage, memory usage |
| Histogram | Distribution | Response time |
| Summary | Statistical values | Percentiles |
Prometheus Example
const prometheus = require('prom-client');
// Counter example
const requestCounter = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
});
// Histogram example
const responseTime = new prometheus.Histogram({
name: 'http_response_time_seconds',
help: 'HTTP response time in seconds',
buckets: [0.1, 0.5, 1, 2, 5]
});
// Measurement
app.use((req, res, next) => {
const end = responseTime.startTimer();
res.on('finish', () => {
requestCounter.inc({
method: req.method,
path: req.path,
status: res.statusCode
});
end();
});
next();
});
Important Metrics (RED/USE)
| Method | For | Metrics |
|---|---|---|
| RED | Services | Rate (Request rate), Errors (Error rate), Duration (Response time) |
| USE | Resources | Utilization (Usage rate), Saturation (Saturation level), Errors |
Logs
Record event details.
Structured Logs
// Unstructured log (hard to search)
console.log(`User ${userId} logged in from ${ip}`);
// Structured log (easy to search and analyze)
const log = {
timestamp: new Date().toISOString(),
level: 'info',
message: 'User logged in',
userId: '123',
ip: '192.168.1.1',
userAgent: 'Mozilla/5.0...',
requestId: 'req_abc123'
};
console.log(JSON.stringify(log));
Log Levels
| Level | Use |
|---|---|
| ERROR | Errors, abnormal situations |
| WARN | Warnings, potential issues |
| INFO | Important events |
| DEBUG | Debug information |
| TRACE | Detailed trace information |
Log Management Stacks
| Stack | Flow |
|---|---|
| ELK | App → Filebeat → Logstash → Elasticsearch → Kibana |
| Loki | App → Promtail → Loki → Grafana |
Traces
Track the path of requests through the system.
Distributed Tracing
Trace ID: abc123
flowchart TB
subgraph Gateway["API Gateway (50ms)"]
Auth["Auth Service (10ms)"]
subgraph User["User Service (30ms)"]
DB["Database Query (15ms)"]
end
end
OpenTelemetry Example
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function processOrder(orderId) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
await validateOrder(orderId);
await processPayment(orderId);
span.setStatus({ code: 1 }); // OK
} catch (error) {
span.setStatus({ code: 2, message: error.message }); // ERROR
throw error;
} finally {
span.end();
}
});
}
Major Tools
| Tool | Features |
|---|---|
| Jaeger | CNCF project, developed by Uber |
| Zipkin | Developed by Twitter, simple |
| AWS X-Ray | AWS integration |
| Datadog APM | SaaS, rich features |
Alerting
Send notifications based on metrics.
Alert Design
# Prometheus Alertmanager example
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} for the last 5 minutes"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(http_response_time_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
Preventing Alert Fatigue
Good alert conditions:
- ✓ Only for events requiring action
- ✓ Appropriate thresholds (reduce noise)
- ✓ Clear response procedures
- ✓ Escalation policies
Alerts to avoid:
- ✗ Information only (check dashboard instead)
- ✗ Events that auto-recover quickly
- ✗ Late-night alerts that can’t be addressed
Dashboards
Grafana Dashboard Structure
| Section | Panels |
|---|---|
| Service Overview | Request Rate, Response Time (p95) |
| Health | Error Rate, Success Rate |
| Resource Utilization | CPU, Memory, Disk, Network |
| Debugging | Recent Error Logs |
Summary
Observability is essential for understanding the health of complex systems. By combining the three pillars of metrics, logs, and traces, enable smooth problem detection and root cause investigation. Proper alerting configuration and dashboards enable rapid incident response.
← Back to list