Claude
Skills
Sign in
Back

observability

Included with Lifetime
$97 forever

Observability discipline: structured logging, metrics instrumentation, distributed tracing, and signal correlation. Invoke whenever task involves any interaction with observability concerns — adding logging, designing metrics, instrumenting traces, correlating signals, reviewing instrumentation, or understanding when to use which pillar.

Productivity

What this skill does


# Observability

**If you cannot ask arbitrary questions about your system's behavior from the outside, your system is not observable —
it is merely monitored.**

---

## The Three Pillars

<pillars>

### Logs — What Happened

Logs are timestamped, discrete event records. They capture **what happened** at a specific moment: an error thrown, a
user action, a configuration loaded, a connection refused.

**Use logs when you need:**

- Rich diagnostic context for a specific event
- Debugging information with full error details and stack traces
- Audit trails of who did what and when
- Record of discrete state transitions

**Logs are poor at:**

- Showing aggregate system health (use metrics)
- Tracing request flow across services (use traces)
- High-frequency numeric trends (too expensive at volume)

### Metrics — How Is It Doing

Metrics are numeric measurements aggregated over time. They capture **how the system is performing** as quantitative
time series: request rates, error percentages, latencies, queue depths, resource utilization.

**Use metrics when you need:**

- Real-time health signals and alerting
- Trend analysis over hours, days, weeks
- Capacity planning and saturation monitoring
- Pre-aggregated data that scales cheaply regardless of traffic

**Metrics are poor at:**

- Explaining _why_ something is broken (use logs)
- Showing the path of a single request (use traces)
- Storing per-event detail (cardinality explosion)

### Traces — How Did It Flow

Traces record the causal chain of operations that make up a single request as it propagates through distributed
components. A trace is a tree of **spans**, where each span represents one unit of work (an HTTP call, a database query,
a queue publish).

**Use traces when you need:**

- End-to-end latency breakdown across services
- Dependency mapping and bottleneck identification
- Understanding the path a failing request took
- Correlating work across process and network boundaries

**Traces are poor at:**

- Aggregate health monitoring (use metrics)
- Detailed per-event diagnostics on a single node (use logs)
- Cheap, long-term trend storage (traces are expensive at 100% sampling)

</pillars>

### Choosing the Right Signal

- **"Is the system healthy right now?"** — Metrics
- **"Why did this specific request fail?"** — Traces + Logs
- **"What happened at 03:14 on node-7?"** — Logs
- **"Where is the bottleneck in checkout flow?"** — Traces
- **"Are error rates increasing over the last hour?"** — Metrics
- **"What was the full stack trace of that exception?"** — Logs
- **"Which downstream service is slow?"** — Traces
- **"How much headroom does the database have?"** — Metrics

---

## Structured Logging

<structured-logging>

### Always Structured

Emit logs as structured records (JSON or equivalent key-value format) with a consistent schema. Unstructured string logs
are for local development only. Structured logs are machine-parseable, indexable, and filterable at scale.

### Log Levels

Use levels consistently. Agree on what each level means across the team.

- **FATAL/CRITICAL** — Process cannot continue; about to crash. Alerting: Page immediately
- **ERROR** — Operation failed; requires investigation. Alerting: Alert / ticket
- **WARN** — Unexpected condition; system compensated. Alerting: Monitor trend
- **INFO** — Significant business or lifecycle event. Alerting: Dashboard
- **DEBUG** — Diagnostic detail for developers. Alerting: Never in production by default
- **TRACE** — Extremely verbose step-by-step flow. Alerting: Never in production

Rules:

- Production defaults to INFO or above. DEBUG/TRACE are off unless explicitly enabled for a bounded investigation
  window.
- WARN is not a dumping ground. If it never leads to action, it is noise — downgrade to DEBUG or remove it.
- ERROR means something is broken. Expected conditions (404 for missing resources, validation failures from bad input)
  are not errors — log at INFO with a status field.
- Log level must be configurable at runtime without restarts.

### Structured Fields

Every log record should include these baseline fields:

- `timestamp`: ISO 8601, UTC
- `level`: Severity (ERROR, WARN, INFO, ...)
- `message`: Human-readable summary of the event
- `service`: Service name emitting the log
- `version`: Service version / build / commit SHA
- `trace_id`: Distributed trace ID (if in request context)
- `span_id`: Current span ID (if in request context)

Add contextual fields relevant to the event:

- `user_id`: User-initiated actions
- `request_id`: Per-request correlation
- `duration_ms`: Timed operations
- `error.type`: Error class/name
- `error.message`: Error description
- `error.stack`: Stack trace (ERROR level only)
- `http.method`, `http.path`, `http.status`: HTTP request/response
- `db.operation`, `db.duration_ms`: Database calls

### Sensitive Data

Never log:

- Passwords, tokens, API keys, secrets
- Full credit card numbers, SSNs, or equivalent PII
- Session tokens or authentication cookies
- Request/response bodies containing user-submitted personal data

When user identifiers are needed, log opaque IDs (user_id), not email addresses or names. If regulations (GDPR, HIPAA)
apply, verify logged fields comply. When in doubt, omit the field.

### Logging at Boundaries

**At application startup:**

- INFO: service name, version, loaded configuration (without secrets), listen address
- WARN: degraded mode (e.g., fallback to local cache because Redis is unreachable)
- ERROR/FATAL: unrecoverable startup failures

**Per incoming request:**

- INFO: method, path (scrubbed of PII), status code, duration, request dimensions (tenant, region)
- WARN/ERROR: only for unexpected exceptions; catch at the top-level handler

**Per outgoing dependency call:**

- INFO or DEBUG: target service, operation, status, duration
- ERROR: failures in dependent services (Redis, database, queue, etc.)

### Log Once, at the Right Level

Log a raised exception **once**. Do not catch-log-rethrow at every layer. Let exceptions propagate to the top-level
handler, which logs with full context. Log and rethrow only when adding context that would otherwise be lost.

</structured-logging>

---

## Metrics

<metrics>

### Metric Types

- **Counter** — Monotonically increasing; resets on restart. Use for totals: requests, errors, bytes sent
- **Gauge** — Arbitrary value; goes up and down. Use for snapshots: queue depth, memory usage, connections
- **Histogram** — Client-side aggregation into buckets. Use for distributions: request latency, payload size
- **Summary** — Client-side quantile calculation. Use for pre-computed percentiles (less flexible than histogram)

Rules:

- Use counters for events that accumulate. Derive rates with `rate()` / `increase()` — never store pre-computed rates.
- Use gauges for current-state snapshots. Never `rate()` a gauge.
- Use histograms for latency and size distributions. Histograms enable percentile calculation across instances;
  summaries do not aggregate.
- Export timestamps as Unix epoch seconds, not "time since" values.
- Initialize all metrics with zero at startup to avoid missing-metric problems.

### What to Measure

#### The Four Golden Signals (Google SRE)

For every user-facing service, measure these four:

- **Latency** — Time to serve a request. Example: `http_request_duration_seconds` histogram
- **Traffic** — Demand on the system. Example: `http_requests_total` counter by method/path
- **Errors** — Rate of failed requests. Example: `http_requests_total{status=~"5.."}`
- **Saturation** — How "full" the service is. Example: CPU usage, memory, queue depth, thread pool

Distinguish **successful latency from error latency**. A fast 500 is not good latency. Track both.

#### RED Method (Request-Centric)

For every microservice:

- **R**ate — requests per second
- **E**rrors — failed requests per second
- **D**uration — distribution of request latency

RED is a focused subset of the golden signals, optimized for request-dr

Related in Productivity