Back to Blog
Logs, Metrics, Traces—Oh My! Observability That Matters!!

Logs, Metrics, Traces—Oh My! Observability That Matters!!

Parth Shah

Why “It Works on Prod” Isn’t Funny Anymore

When your entire company fits on a Zoom grid, the pager often routes straight to the developer who wrote the code. Good observability means you fix bugs at 2 AM without a crystal ball—or better yet, you spot them before users do.


Three Pillars, One Budget

PillarTooling (lean stack)When to invest first
MetricsPrometheus → GrafanaImmediate: uptime & latency
LogsLoki (Grafana Cloud)Sprint 2: debug traces
TracesOpenTelemetry → TempoAfter you hit micro-service count ≥ 3

If money is tight, start with Prometheus + Grafana Cloud free tier—you get 10 k series and alerts out of the box.


1 | Instrument the Golden Signals

“Everything” is unmaintainable; track RED + USE:

  • Rate – requests per second
  • Errors – 4xx / 5xx per second
  • Duration – latency percentiles
  • Utilization – CPU, memory
  • Saturation – queue length
  • Error – hardware or disk faults

Below is a minimal Spring Boot actuator scrape:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus
server:
  servlet:
    context-path: /api

Prometheus job:

- job_name: "todo-api"
  metrics_path: "/api/actuator/prometheus"
  static_configs:
    - targets: ["todo-blue:8080","todo-green:8080"]

2 | Alert Only on What Hurts Users

Bad:

High CPU usage on node-12

Good:

SLO violation: p95 latency > 250 ms for route /tasks

GrafanaAlert YAML:

expr: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, route)) > 0.25
for: 3m
labels:
  severity: page
annotations:
  runbook: https://runbooks.todo.com/p95-latency

3 | Add Traces When Logs Aren’t Enough

OpenTelemetry Java agent—no code change:

java -javaagent:opentelemetry-javaagent.jar       -Dotel.resource.attributes=service.name=todo-api       -Dotel.exporter.otlp.endpoint=http://tempo:4317       -jar todo-api.jar

Now you can click a slow request in Grafana and jump straight into its trace waterfall. The first time you catch an N+1 SQL query in 15 s, you’ll wonder how you lived without it.


4 | Dashboards That Fit on One Screen

PanelWhy it earns real estate
p95 Latency (per route)Performance health
Error Rate (4xx, 5xx)Immediate alarms
DB Connection Pool UsageEarly deadlock signal
Queue Lag (Kafka/SQS)Back-pressure insights
Release Marker AnnotationsCorrelate deploys with spikes

Rule of thumb: < 6 graphs; if you need to squint, you need another dashboard.


5 | Cost Hacks

  • Sample traces at 10 % except when X-Debug: true.
  • Loki log labels sparingly: {level="error", app="todo-api"}. Unbounded label cardinality → $$$.
  • Ship metrics to Cloud but store logs locally on a 7-day retention EBS; archive to S3 Glacier after.

6 | Post-Incident Review Template

  1. Timeline – UTC timestamps, who did what.
  2. User impact – errors served, revenue lost.
  3. Root cause – single sentence, no blame.
  4. Detection gap – why alerts fired late / never.
  5. Next steps – patch + monitoring upgrade PR numbers.

Keep it in your repo under /postmortems/YYYY-MM-DD-incident.md so newcomers learn fast.


Key Takeaways

  • Start with metrics + alerts today; add logs and traces as your service map grows.
  • Alert only on user-facing SLIs; silence the noisy CPU charts.
  • Keep dashboards minimal—engineers remember pictures, not 42 graphs.
  • Observability is insurance for lean teams: small investment now, big payout when prod burns.

Your future self at 2 AM will thank you for every histogram bucket you add today.

ObservabilityMonitoringStartups