
Logs, Metrics, Traces—Oh My! Observability That Matters!!
Why “It Works on Prod” Isn’t Funny Anymore
When your entire company fits on a Zoom grid, the pager often routes straight to the developer who wrote the code. Good observability means you fix bugs at 2 AM without a crystal ball—or better yet, you spot them before users do.
Three Pillars, One Budget
| Pillar | Tooling (lean stack) | When to invest first |
|---|---|---|
| Metrics | Prometheus → Grafana | Immediate: uptime & latency |
| Logs | Loki (Grafana Cloud) | Sprint 2: debug traces |
| Traces | OpenTelemetry → Tempo | After you hit micro-service count ≥ 3 |
If money is tight, start with Prometheus + Grafana Cloud free tier—you get 10 k series and alerts out of the box.
1 | Instrument the Golden Signals
“Everything” is unmaintainable; track RED + USE:
- Rate – requests per second
- Errors – 4xx / 5xx per second
- Duration – latency percentiles
- Utilization – CPU, memory
- Saturation – queue length
- Error – hardware or disk faults
Below is a minimal Spring Boot actuator scrape:
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus
server:
servlet:
context-path: /api
Prometheus job:
- job_name: "todo-api"
metrics_path: "/api/actuator/prometheus"
static_configs:
- targets: ["todo-blue:8080","todo-green:8080"]
2 | Alert Only on What Hurts Users
Bad:
High CPU usage on node-12
Good:
SLO violation: p95 latency > 250 ms for route /tasks
GrafanaAlert YAML:
expr: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, route)) > 0.25
for: 3m
labels:
severity: page
annotations:
runbook: https://runbooks.todo.com/p95-latency
3 | Add Traces When Logs Aren’t Enough
OpenTelemetry Java agent—no code change:
java -javaagent:opentelemetry-javaagent.jar -Dotel.resource.attributes=service.name=todo-api -Dotel.exporter.otlp.endpoint=http://tempo:4317 -jar todo-api.jar
Now you can click a slow request in Grafana and jump straight into its trace waterfall. The first time you catch an N+1 SQL query in 15 s, you’ll wonder how you lived without it.
4 | Dashboards That Fit on One Screen
| Panel | Why it earns real estate |
|---|---|
| p95 Latency (per route) | Performance health |
| Error Rate (4xx, 5xx) | Immediate alarms |
| DB Connection Pool Usage | Early deadlock signal |
| Queue Lag (Kafka/SQS) | Back-pressure insights |
| Release Marker Annotations | Correlate deploys with spikes |
Rule of thumb: < 6 graphs; if you need to squint, you need another dashboard.
5 | Cost Hacks
- Sample traces at 10 % except when
X-Debug: true. - Loki log labels sparingly:
{level="error", app="todo-api"}. Unbounded label cardinality → $$$. - Ship metrics to Cloud but store logs locally on a 7-day retention EBS; archive to S3 Glacier after.
6 | Post-Incident Review Template
- Timeline – UTC timestamps, who did what.
- User impact – errors served, revenue lost.
- Root cause – single sentence, no blame.
- Detection gap – why alerts fired late / never.
- Next steps – patch + monitoring upgrade PR numbers.
Keep it in your repo under /postmortems/YYYY-MM-DD-incident.md so newcomers learn fast.
Key Takeaways
- Start with metrics + alerts today; add logs and traces as your service map grows.
- Alert only on user-facing SLIs; silence the noisy CPU charts.
- Keep dashboards minimal—engineers remember pictures, not 42 graphs.
- Observability is insurance for lean teams: small investment now, big payout when prod burns.
Your future self at 2 AM will thank you for every histogram bucket you add today.