Logs, Metrics, Traces-Oh My! Observability That Matters!! - Parth Shah

Why “It Works on Prod” Isn’t Funny Anymore

When your entire company fits on a Zoom grid, the pager often routes straight to the developer who wrote the code. Good observability means you fix bugs at 2 AM without a crystal ball-or better yet, you spot them before users do.

Three Pillars, One Budget

Pillar	Tooling (lean stack)	When to invest first
Metrics	Prometheus → Grafana	Immediate: uptime & latency
Logs	Loki (Grafana Cloud)	Sprint 2: debug traces
Traces	OpenTelemetry → Tempo	After you hit micro-service count ≥ 3

If money is tight, start with Prometheus + Grafana Cloud free tier-you get 10 k series and alerts out of the box.

1 | Instrument the Golden Signals

“Everything” is unmaintainable; track RED + USE:

Rate – requests per second
Errors – 4xx / 5xx per second
Duration – latency percentiles
Utilization – CPU, memory
Saturation – queue length
Error – hardware or disk faults

Below is a minimal Spring Boot actuator scrape:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus
server:
  servlet:
    context-path: /api

Prometheus job:

- job_name: "todo-api"
  metrics_path: "/api/actuator/prometheus"
  static_configs:
    - targets: ["todo-blue:8080","todo-green:8080"]

2 | Alert Only on What Hurts Users

Bad:

High CPU usage on node-12

Good:

SLO violation: p95 latency > 250 ms for route /tasks

GrafanaAlert YAML:

expr: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, route)) > 0.25
for: 3m
labels:
  severity: page
annotations:
  runbook: https://runbooks.todo.com/p95-latency

3 | Add Traces When Logs Aren’t Enough

OpenTelemetry Java agent-no code change:

java -javaagent:opentelemetry-javaagent.jar       -Dotel.resource.attributes=service.name=todo-api       -Dotel.exporter.otlp.endpoint=http://tempo:4317       -jar todo-api.jar

Now you can click a slow request in Grafana and jump straight into its trace waterfall. The first time you catch an N+1 SQL query in 15 s, you’ll wonder how you lived without it.

4 | Dashboards That Fit on One Screen

Panel	Why it earns real estate
p95 Latency (per route)	Performance health
Error Rate (4xx, 5xx)	Immediate alarms
DB Connection Pool Usage	Early deadlock signal
Queue Lag (Kafka/SQS)	Back-pressure insights
Release Marker Annotations	Correlate deploys with spikes

Rule of thumb: < 6 graphs; if you need to squint, you need another dashboard.

5 | Cost Hacks

Sample traces at 10 % except when X-Debug: true.
Loki log labels sparingly: {level="error", app="todo-api"}. Unbounded label cardinality → $$$.
Ship metrics to Cloud but store logs locally on a 7-day retention EBS; archive to S3 Glacier after.

6 | Post-Incident Review Template

Timeline – UTC timestamps, who did what.
User impact – errors served, revenue lost.
Root cause – single sentence, no blame.
Detection gap – why alerts fired late / never.
Next steps – patch + monitoring upgrade PR numbers.

Keep it in your repo under /postmortems/YYYY-MM-DD-incident.md so newcomers learn fast.

Key Takeaways

Start with metrics + alerts today; add logs and traces as your service map grows.
Alert only on user-facing SLIs; silence the noisy CPU charts.
Keep dashboards minimal-engineers remember pictures, not 42 graphs.
Observability is insurance for lean teams: small investment now, big payout when prod burns.

Your future self at 2 AM will thank you for every histogram bucket you add today.