prometheus

Included with Lifetime

$97 forever

Prometheus instrumentation discipline: right metric type, right name, right labels. Invoke whenever task involves any interaction with Prometheus metrics — instrumenting application code, writing PromQL queries, defining alerting or recording rules, choosing metric types, managing label cardinality, building exporters, or reviewing monitoring configuration.

Productivity

What this skill does


# Prometheus

Choose the right metric type, name it clearly, label it sparingly. Prometheus is a pull-based monitoring system built on
a dimensional data model — every metric is a time series identified by a name and key-value label pairs. Getting this
right at instrumentation time prevents rework later.

## References

- **Metric types** — [`${CLAUDE_SKILL_DIR}/references/metric-types.md`] Extended type comparison, histogram bucket
  tuning, summary configuration
- **Naming** — [`${CLAUDE_SKILL_DIR}/references/naming.md`] Full naming examples, base units table, character rules,
  label best practices
- **Instrumentation** — [`${CLAUDE_SKILL_DIR}/references/instrumentation.md`] Code patterns per system type, library
  instrumentation, performance tuning
- **PromQL** — [`${CLAUDE_SKILL_DIR}/references/promql.md`] Full operator catalog, vector matching, over-time
  aggregation, operator precedence
- **Alerting and rules** — [`${CLAUDE_SKILL_DIR}/references/alerting-and-rules.md`] Alert design, recording rule naming,
  aggregation patterns, anti-patterns
- **Exporters** — [`${CLAUDE_SKILL_DIR}/references/exporters.md`] Exporter architecture, collectors, help strings,
  push-based sources

## Metric Type Selection

Choose correctly at instrumentation time — changing later requires migrating dashboards, alerts, and recording rules.

| Question                                                                       | Answer | Type      |
| ------------------------------------------------------------------------------ | ------ | --------- |
| Can the value decrease?                                                        | No     | Counter   |
| Is it a snapshot of current state?                                             | Yes    | Gauge     |
| Observing a distribution needing cross-instance aggregation?                   | Yes    | Histogram |
| Need accurate quantiles from a single instance, known at instrumentation time? | Yes    | Summary   |
| None of the above                                                              | —      | Gauge     |

### Counter

Monotonically increasing value — resets to zero only on restart.

- Use for: requests served, errors occurred, bytes transferred, tasks completed
- API: `inc()`, `inc(v)` where v >= 0
- Always suffix with `_total`: `http_requests_total`
- Always apply `rate()` or `increase()` in queries — raw values are meaningless
- Prometheus handles counter resets automatically in `rate()`
- Never use a counter for values that can decrease — that is a gauge

### Gauge

Value that goes up and down arbitrarily.

- Use for: temperature, memory usage, in-progress requests, queue depth, timestamps
- API: `inc()`, `dec()`, `set(v)`, `set_to_current_time()`
- No `_total` suffix
- Never apply `rate()` to a gauge — use `deriv()` or `delta()`
- **Timestamp pattern:** store Unix epoch seconds as `myapp_last_success_timestamp_seconds`; compute elapsed time with
  `time() - metric` in PromQL
- **Info metric pattern:** `myapp_build_info{version="1.2.3", commit="abc"} 1` — metadata as labels with constant value
  1

### Histogram

Samples observations into configurable buckets. Produces `_bucket{le="..."}`, `_sum`, `_count`.

- Use for: request latencies, response sizes, any distribution needing percentiles or cross-instance aggregation
- API: `observe(v)`
- Buckets are cumulative — `le="0.5"` includes all observations <= 0.5
- Must include `+Inf` bucket (equal to `_count`)
- Choose buckets matching expected value range; place more buckets near SLO boundaries
- Buckets cannot be changed after metric creation
- Use `histogram_quantile()` in PromQL to calculate percentiles
- Aggregatable across instances — the primary advantage over summary

### Summary

Calculates streaming quantiles on the client side. Produces `{quantile="..."}`, `_sum`, `_count`.

- **Cannot be aggregated across instances** — `avg(x{quantile="0.95"})` is statistically invalid
- `_sum` and `_count` without quantiles is a valid and useful configuration

**Use summary over histogram only when ALL of these are true:**

1. You need accurate quantiles (not approximate)
2. From a single instance (no cross-instance aggregation)
3. You know the exact quantiles at instrumentation time
4. You accept that adding new quantiles requires code changes

**Default choice: histogram.** See `${CLAUDE_SKILL_DIR}/references/metric-types.md` for detailed comparison.

## Naming

Format: `<namespace>_<subsystem>_<name>_<unit>_<suffix>`. Not all parts required — minimum is namespace + meaningful
name + unit/suffix.

### Naming Rules

1. Use `snake_case` — lowercase with underscores, matching `[a-zA-Z_:][a-zA-Z0-9_:]*`
2. Colons (`:`) are reserved for recording rules — never use in direct instrumentation
3. Double underscore prefix (`__`) is reserved for Prometheus internals
4. Every metric MUST have a namespace prefix identifying its origin
5. Always use base units — seconds not milliseconds, bytes not megabytes. Let visualization tools handle conversion.
6. Append unit to metric name in plural form: `http_request_duration_seconds`
7. Suffix counters with `_total`, counter-with-unit as `_<unit>_total` (e.g., `process_cpu_seconds_total`)
8. Suffix info metrics with `_info`, timestamps with `_timestamp_seconds`
9. A metric MUST represent the same logical thing across all its label dimensions — `sum()` or `avg()` across all
   dimensions should be meaningful. If nonsensical, split into separate metrics.

See `${CLAUDE_SKILL_DIR}/references/naming.md` for base units table, component ordering, and full examples.

## Labels

Use labels for dimensions you will filter or aggregate by: `http_requests_total{method="GET", status="200"}` — not
separate metrics per status. Do not put label names in metric names.

### When NOT to Use Labels

- Unbounded values — user IDs, email addresses, full URLs, query strings
- High cardinality — anything above ~100 unique values per metric

### Cardinality

Every unique label combination is a new time series — each costs RAM, CPU, disk, and network. **Cardinality math:**
total series = metric cardinality x number of targets.

- `< 10`: Safe for most metrics
- `10-100`: Acceptable, monitor growth
- `100-1000`: Investigate alternatives
- `> 1000`: Move analysis out of Prometheus

### Label Best Practices

1. Start with no labels. Add as concrete use cases emerge.
2. Keep cardinality below 10 per metric as a default target.
3. Initialize all label combinations you know upfront to avoid missing metrics — export 0 for known label sets.
4. Use stable label values. Avoid labels that change frequently.
5. Never include a "total" label value — rely on Prometheus `sum()`.

## Instrumentation Patterns

### Online-Serving Systems (HTTP servers, APIs, databases)

Key metrics: request rate (`_total`), error rate, latency (histogram), in-progress (gauge).

- Count requests at completion (not start) — aligns with error and latency stats
- Always have a total requests counter alongside error counters (for ratio calculation)

### Offline Processing (queues, pipelines, ETL)

Key metrics per stage: items in (`_total`), items out (`_total`), in progress (gauge), last processed timestamp (gauge),
processing duration (histogram).

- Export heartbeat timestamps to detect stalled processing

### Batch Jobs (cron, scheduled tasks)

Key metrics (push to Pushgateway): last success timestamp (gauge), last completion timestamp (gauge), duration (gauge —
single run, not distribution).

- Batch job durations are gauges (single event), not histograms
- Jobs running more often than every 15 minutes should be converted to daemons

### Libraries

Instrument transparently — users get metrics without configuration. Minimum for external resource access: request count
(counter), error count (counter), latency (histogram).

### Subsystem Patterns

- **Logging:** maintain `log_messages_total{level="..."}` counter per log level
- **Failures:** always pair failure counter with total attempts counter

Files: 8

Size: 46.9 KB

Complexity: 52/100

Category: Productivity

Source: https://github.com/xobotyi/cc-foundry/tree/main/plugins/backend/skills/prometheus

Related in Productivity

gitea-workflow

Included

Orchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.

Productivityscripts

microsoft-graph-gateway

Included

Route Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.

Productivityscripts

copilotkit

Included

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Provides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.

Productivityscripts

swain-session

Included

Session management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.

Productivityscripts

gandi

Included

Comprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.

Productivityscripts

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Productivityscripts

swain-session

Included

Productivityscripts

gandi

Included

Productivityscripts