observability

Included with Lifetime

$97 forever

Observability discipline: structured logging, metrics instrumentation, distributed tracing, and signal correlation. Invoke whenever task involves any interaction with observability concerns — adding logging, designing metrics, instrumenting traces, correlating signals, reviewing instrumentation, or understanding when to use which pillar.

Productivity

What this skill does

# Observability

**If you cannot ask arbitrary questions about your system's behavior from the outside, your system is not observable —
it is merely monitored.**

---

## The Three Pillars

### Logs — What Happened

Logs are timestamped, discrete event records. They capture **what happened** at a specific moment: an error thrown, a
user action, a configuration loaded, a connection refused.

**Use logs when you need:**

- Rich diagnostic context for a specific event
- Debugging information with full error details and stack traces
- Audit trails of who did what and when
- Record of discrete state transitions

**Logs are poor at:**

- Showing aggregate system health (use metrics)
- Tracing request flow across services (use traces)
- High-frequency numeric trends (too expensive at volume)

### Metrics — How Is It Doing

Metrics are numeric measurements aggregated over time. They capture **how the system is performing** as quantitative
time series: request rates, error percentages, latencies, queue depths, resource utilization.

**Use metrics when you need:**

- Real-time health signals and alerting
- Trend analysis over hours, days, weeks
- Capacity planning and saturation monitoring
- Pre-aggregated data that scales cheaply regardless of traffic

**Metrics are poor at:**

- Explaining _why_ something is broken (use logs)
- Showing the path of a single request (use traces)
- Storing per-event detail (cardinality explosion)

### Traces — How Did It Flow

Traces record the causal chain of operations that make up a single request as it propagates through distributed
components. A trace is a tree of **spans**, where each span represents one unit of work (an HTTP call, a database query,
a queue publish).

**Use traces when you need:**

- End-to-end latency breakdown across services
- Dependency mapping and bottleneck identification
- Understanding the path a failing request took
- Correlating work across process and network boundaries

**Traces are poor at:**

- Aggregate health monitoring (use metrics)
- Detailed per-event diagnostics on a single node (use logs)
- Cheap, long-term trend storage (traces are expensive at 100% sampling)

</pillars>

### Choosing the Right Signal

- **"Is the system healthy right now?"** — Metrics
- **"Why did this specific request fail?"** — Traces + Logs
- **"What happened at 03:14 on node-7?"** — Logs
- **"Where is the bottleneck in checkout flow?"** — Traces
- **"Are error rates increasing over the last hour?"** — Metrics
- **"What was the full stack trace of that exception?"** — Logs
- **"Which downstream service is slow?"** — Traces
- **"How much headroom does the database have?"** — Metrics

---

## Structured Logging

<structured-logging>

### Always Structured

Emit logs as structured records (JSON or equivalent key-value format) with a consistent schema. Unstructured string logs
are for local development only. Structured logs are machine-parseable, indexable, and filterable at scale.

### Log Levels

Use levels consistently. Agree on what each level means across the team.

- **FATAL/CRITICAL** — Process cannot continue; about to crash. Alerting: Page immediately
- **ERROR** — Operation failed; requires investigation. Alerting: Alert / ticket
- **WARN** — Unexpected condition; system compensated. Alerting: Monitor trend
- **INFO** — Significant business or lifecycle event. Alerting: Dashboard
- **DEBUG** — Diagnostic detail for developers. Alerting: Never in production by default
- **TRACE** — Extremely verbose step-by-step flow. Alerting: Never in production

Rules:

- Production defaults to INFO or above. DEBUG/TRACE are off unless explicitly enabled for a bounded investigation
window.
- WARN is not a dumping ground. If it never leads to action, it is noise — downgrade to DEBUG or remove it.
- ERROR means something is broken. Expected conditions (404 for missing resources, validation failures from bad input)
are not errors — log at INFO with a status field.
- Log level must be configurable at runtime without restarts.

### Structured Fields

Every log record should include these baseline fields:

- `timestamp`: ISO 8601, UTC
- `level`: Severity (ERROR, WARN, INFO, ...)
- `message`: Human-readable summary of the event
- `service`: Service name emitting the log
- `version`: Service version / build / commit SHA
- `trace_id`: Distributed trace ID (if in request context)
- `span_id`: Current span ID (if in request context)

Add contextual fields relevant to the event:

- `user_id`: User-initiated actions
- `request_id`: Per-request correlation
- `duration_ms`: Timed operations
- `error.type`: Error class/name
- `error.message`: Error description
- `error.stack`: Stack trace (ERROR level only)
- `http.method`, `http.path`, `http.status`: HTTP request/response
- `db.operation`, `db.duration_ms`: Database calls

### Sensitive Data

Never log:

- Passwords, tokens, API keys, secrets
- Full credit card numbers, SSNs, or equivalent PII
- Session tokens or authentication cookies
- Request/response bodies containing user-submitted personal data

When user identifiers are needed, log opaque IDs (user_id), not email addresses or names. If regulations (GDPR, HIPAA)
apply, verify logged fields comply. When in doubt, omit the field.

### Logging at Boundaries

**At application startup:**

- INFO: service name, version, loaded configuration (without secrets), listen address
- WARN: degraded mode (e.g., fallback to local cache because Redis is unreachable)
- ERROR/FATAL: unrecoverable startup failures

**Per incoming request:**

- INFO: method, path (scrubbed of PII), status code, duration, request dimensions (tenant, region)
- WARN/ERROR: only for unexpected exceptions; catch at the top-level handler

**Per outgoing dependency call:**

- INFO or DEBUG: target service, operation, status, duration
- ERROR: failures in dependent services (Redis, database, queue, etc.)

### Log Once, at the Right Level

Log a raised exception **once**. Do not catch-log-rethrow at every layer. Let exceptions propagate to the top-level
handler, which logs with full context. Log and rethrow only when adding context that would otherwise be lost.

</structured-logging>

---

## Metrics

### Metric Types

- **Counter** — Monotonically increasing; resets on restart. Use for totals: requests, errors, bytes sent
- **Gauge** — Arbitrary value; goes up and down. Use for snapshots: queue depth, memory usage, connections
- **Histogram** — Client-side aggregation into buckets. Use for distributions: request latency, payload size
- **Summary** — Client-side quantile calculation. Use for pre-computed percentiles (less flexible than histogram)

Rules:

- Use counters for events that accumulate. Derive rates with `rate()` / `increase()` — never store pre-computed rates.
- Use gauges for current-state snapshots. Never `rate()` a gauge.
- Use histograms for latency and size distributions. Histograms enable percentile calculation across instances;
summaries do not aggregate.
- Export timestamps as Unix epoch seconds, not "time since" values.
- Initialize all metrics with zero at startup to avoid missing-metric problems.

### What to Measure

#### The Four Golden Signals (Google SRE)

For every user-facing service, measure these four:

- **Latency** — Time to serve a request. Example: `http_request_duration_seconds` histogram
- **Traffic** — Demand on the system. Example: `http_requests_total` counter by method/path
- **Errors** — Rate of failed requests. Example: `http_requests_total{status=~"5.."}`
- **Saturation** — How "full" the service is. Example: CPU usage, memory, queue depth, thread pool

Distinguish **successful latency from error latency**. A fast 500 is not good latency. Track both.

#### RED Method (Request-Centric)

For every microservice:

- **R**ate — requests per second
- **E**rrors — failed requests per second
- **D**uration — distribution of request latency

RED is a focused subset of the golden signals, optimized for request-dr

Files: 2

Size: 24.2 KB

Complexity: 31/100

Category: Productivity

Source: https://github.com/xobotyi/cc-foundry/tree/main/plugins/backend/skills/observability

Related in Productivity

gitea-workflow

Included

Orchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.

Productivityscripts

microsoft-graph-gateway

Included

Route Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.

Productivityscripts

copilotkit

Included

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Provides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.

Productivityscripts

swain-session

Included

Session management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.

Productivityscripts

gandi

Included

Comprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.

Productivityscripts

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Productivityscripts

swain-session

Included

Productivityscripts

gandi

Included

Productivityscripts