Observability — Logs, Metrics, and Traces Explained

“The service is slow.” Three words that start every production investigation. But without observability, you’re guessing. Is it the database? The network? A downstream dependency? Memory pressure? You check logs — nothing obvious. You check a dashboard — CPU looks fine. You’re flying blind.

Observability isn’t about collecting data. It’s about being able to ask arbitrary questions about your system’s behavior and getting answers — without deploying new code or adding new instrumentation.

1. The Three Pillars

Logs, metrics, and traces. Each answers a different question. You need all three because no single pillar gives the complete picture. Logs without metrics means you can’t see trends. Metrics without traces means you can’t isolate bottlenecks. Traces without logs means you can’t see details.

The Three Pillars

📝Logs

Discrete events: "User 123 logged in at 14:32:07"Debugging specific errors, audit trails, complianceELK, Loki, CloudWatch Logs

📊Metrics

Numeric time-series: "request_count=1,247 at 14:32"Dashboards, alerting, capacity planning, SLO trackingPrometheus, Datadog, CloudWatch Metrics

🔗Traces

Request journey: "API → Auth → DB → Cache → Response (342ms)"Latency debugging, bottleneck identification, distributed systemsJaeger, Tempo, Honeycomb, Datadog APM

Logs tell you WHAT happened. Metrics tell you HOW MUCH. Traces tell you WHERE and HOW LONG. You need all three to debug production issues effectively.

The correlation between them is what creates observability. A high error rate metric triggers an alert. The alert links to traces showing slow database calls. The traces link to logs showing connection pool exhaustion. From alert to root cause in 3 clicks — that’s the goal.

2. What to Measure

“Instrument everything” is bad advice. You end up with 50,000 metrics that cost a fortune to store and nobody looks at. Instead, use frameworks: RED for your services, USE for your infrastructure. These cover 90% of debugging needs with minimal instrumentation.

RED and USE — Two Methods for What to Measure

RED (for services)

RRate — requests per second

EErrors — failed requests per second

DDuration — latency histogram (p50, p95, p99)

Use for: APIs, microservices, web servers

USE (for resources)

UUtilization — % of resource capacity being used

SSaturation — queue depth, backlog

EErrors — hardware/resource error count

Use for: CPU, memory, disk, network

The practical starting point: instrument your API gateway with RED metrics (request rate, error rate, p50/p95/p99 latency). That single data source tells you when something is wrong. Then drill into traces to find where. Then drill into logs to find why. Start broad, go narrow.

3. The Stack — Build vs Buy

The tooling landscape has consolidated. OpenTelemetry won the instrumentation war — it’s the only standard you need to learn. For the backend, you choose between self-hosted open source (Prometheus + Loki + Tempo + Grafana) or SaaS (Datadog, Honeycomb, New Relic).

The Modern Stack — 2026

LayerOSS OptionSaaS Option

CollectionOpenTelemetryOpenTelemetry

MetricsPrometheus + GrafanaDatadog / New Relic

LogsLoki + GrafanaDatadog / Splunk

TracesTempo + GrafanaHoneycomb / Datadog

AlertingAlertmanagerPagerDuty / OpsGenie

The convergence: OpenTelemetry is the universal standard for instrumentation. Regardless of backend choice, instrument with OTel. You can swap backends without re-instrumenting code.

The honest tradeoff: SaaS is easier but expensive at scale. Datadog bills by host and log volume — at 100+ hosts, costs hit $50K+/year. The Grafana stack (LGTM: Loki, Grafana, Tempo, Mimir) is free but requires operational expertise. If you have a platform team, self-host. If you’re a small team that needs answers fast, buy SaaS and revisit when the bill becomes painful.