Back to Blog
Logs, Metrics, Traces-Oh My! Observability That Matters!!

Logs, Metrics, Traces-Oh My! Observability That Matters!!

Parth Shah

Why “It Works on Prod” Isn’t Funny Anymore

When your entire company fits on a Zoom grid, the pager often routes straight to the developer who wrote the code. Good observability means you fix bugs at 2 AM without a crystal ball-or better yet, you spot them before users do.


Three Pillars, One Budget

PillarTooling (lean stack)When to invest first
MetricsPrometheus → GrafanaImmediate: uptime & latency
LogsLoki (Grafana Cloud)Sprint 2: debug traces
TracesOpenTelemetry → TempoAfter you hit micro-service count ≥ 3

If money is tight, start with Prometheus + Grafana Cloud free tier-you get 10 k series and alerts out of the box.


1 | Instrument the Golden Signals

“Everything” is unmaintainable; track RED + USE:

  • Rate – requests per second
  • Errors – 4xx / 5xx per second
  • Duration – latency percentiles
  • Utilization – CPU, memory
  • Saturation – queue length
  • Error – hardware or disk faults

Below is a minimal Spring Boot actuator scrape:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus
server:
  servlet:
    context-path: /api

Prometheus job:

- job_name: "todo-api"
  metrics_path: "/api/actuator/prometheus"
  static_configs:
    - targets: ["todo-blue:8080","todo-green:8080"]

2 | Alert Only on What Hurts Users

Bad:

High CPU usage on node-12

Good:

SLO violation: p95 latency > 250 ms for route /tasks

GrafanaAlert YAML:

expr: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, route)) > 0.25
for: 3m
labels:
  severity: page
annotations:
  runbook: https://runbooks.todo.com/p95-latency

3 | Add Traces When Logs Aren’t Enough

OpenTelemetry Java agent-no code change:

java -javaagent:opentelemetry-javaagent.jar       -Dotel.resource.attributes=service.name=todo-api       -Dotel.exporter.otlp.endpoint=http://tempo:4317       -jar todo-api.jar

Now you can click a slow request in Grafana and jump straight into its trace waterfall. The first time you catch an N+1 SQL query in 15 s, you’ll wonder how you lived without it.


4 | Dashboards That Fit on One Screen

PanelWhy it earns real estate
p95 Latency (per route)Performance health
Error Rate (4xx, 5xx)Immediate alarms
DB Connection Pool UsageEarly deadlock signal
Queue Lag (Kafka/SQS)Back-pressure insights
Release Marker AnnotationsCorrelate deploys with spikes

Rule of thumb: < 6 graphs; if you need to squint, you need another dashboard.


5 | Cost Hacks

  • Sample traces at 10 % except when X-Debug: true.
  • Loki log labels sparingly: {level="error", app="todo-api"}. Unbounded label cardinality → $$$.
  • Ship metrics to Cloud but store logs locally on a 7-day retention EBS; archive to S3 Glacier after.

6 | Post-Incident Review Template

  1. Timeline – UTC timestamps, who did what.
  2. User impact – errors served, revenue lost.
  3. Root cause – single sentence, no blame.
  4. Detection gap – why alerts fired late / never.
  5. Next steps – patch + monitoring upgrade PR numbers.

Keep it in your repo under /postmortems/YYYY-MM-DD-incident.md so newcomers learn fast.


Key Takeaways

  • Start with metrics + alerts today; add logs and traces as your service map grows.
  • Alert only on user-facing SLIs; silence the noisy CPU charts.
  • Keep dashboards minimal-engineers remember pictures, not 42 graphs.
  • Observability is insurance for lean teams: small investment now, big payout when prod burns.

Your future self at 2 AM will thank you for every histogram bucket you add today.

ObservabilityMonitoringStartups