alerting
Grafana unified alerting: rule types, evaluation lifecycle, state transitions, notification policies, contact points, silences, mute timings, recording rules, and alert/notification templates. Invoke whenever task involves any interaction with Grafana alerting — creating or editing alert rules, configuring contact points and routing, templating annotations or notifications, debugging alert state, or reviewing alerting configuration.
What this skill does
# Grafana Alerting
Alert on symptoms, route to the right team, suppress noise without losing signal. Grafana unified alerting splits the
job in two halves: a **rule evaluator** raises alert instances; an **Alertmanager** turns them into notifications. Get
the split right at design time to avoid notification storms, missed pages, and silent failures.
## References
- **Rule evaluation** — [`${CLAUDE_SKILL_DIR}/references/rule-evaluation.md`] Rule types (Grafana-managed vs
data-source-managed), evaluation groups, pending period, keep firing for, full state lifecycle, No Data / Error
configuration, `grafana_state_reason` annotation, worked example
- **Notification routing** — [`${CLAUDE_SKILL_DIR}/references/notification-routing.md`] Alertmanager architecture,
contact points, policy tree, routing algorithm, inheritance, label matchers, grouping/timing, silences, mute timings,
inhibition rules, worked routing example
- **Templates** — [`${CLAUDE_SKILL_DIR}/references/templates.md`] Annotation/label templates vs notification templates,
Go `text/template` essentials, available variables (`$labels`, `$values`, `.Alerts`), notification data shape,
built-in functions (collection, data, template, time), worked examples
- **Data-source-managed rules** — [`${CLAUDE_SKILL_DIR}/references/recording-rules.md`] Mimir/Loki/Prometheus alert and
recording rules: prerequisites, UI workflow, YAML rule shape, namespaces and groups, sequential evaluation pattern,
restrictions vs Grafana-managed, naming, alerting-on-recorded-metrics alignment, Loki rule files + lokitool, Mimir
managed rules
## Alert rule design
### Choose the rule type
- **Grafana-managed** (default) — multi-data-source queries, server-side expressions (reduce, math, threshold, classic),
images in notifications, No Data / Error state handling, multi-dimensional alerts
- **Data source-managed** — Prometheus-compatible (Mimir, Loki, Prometheus). Rules stored in the data source. Use when
rules must live alongside data (multi-tenant Mimir/Loki) or for Prometheus migration
### Alert on symptoms, not causes
- **Online-serving** — alert on high error rate, high latency, request availability
- **Offline processing** — alert on slow throughput, stuck queues, missing heartbeat timestamps
- **Batch jobs** — alert when job has not succeeded recently enough (≥ 2× normal cycle)
- **Capacity** — alert when resource exhaustion is imminent
Page on user-visible impact at one point in the stack. Don't page on a slow sub-component if overall user latency is
fine — use dashboards to localize causes after a page fires. Remove alerts with no actionable response.
### Set evaluation parameters intentionally
- **Evaluation interval** — match query cost and incident response speed. Common: `30s`–`5m`.
- **Pending period** — long enough to absorb transient breaches, short enough to fire before the user notices. Usually
3–10× the evaluation interval.
- **Keep firing for** — non-zero only when flapping is a real problem. Defers resolved notifications.
- **Sequential evaluation** when an alert depends on a recording rule in the same group.
### Configure No Data and Error explicitly
Grafana-managed rules control what happens when the query returns no data or fails. Don't accept the default silently —
pick one:
- **Set No Data / Error** (default) — creates synthetic `DatasourceNoData` / `DatasourceError` alerts with labels
`alertname`, `datasource_uid`, `rulename`. Route through dedicated notification policies.
- **Set Alerting** — treat as a fire. Use when no-data means broken instrumentation that itself is the alert.
- **Set Normal** — treat as healthy. Use only when no-data is expected.
- **Keep last state** — preserve previous state. Mitigates transient data source flakiness; risks missing real outages
if the data source stays down.
Synthetic alerts have **different labels** than the original rule's alerts. Silences, mute timings, and policies that
match the original by labels won't match synthetic alerts without explicit `alertname` matchers.
### Multi-dimensional alerts
One rule generates one alert instance per series returned by the query. Different instances of the same rule can be in
different states simultaneously. Plan label cardinality before deploying — high cardinality multiplies notifications.
## State lifecycle
Six states for any alert instance:
- **Normal** — no condition met
- **Pending** — condition met, pending period not elapsed
- **Alerting** — condition met past the pending period (notifications routed)
- **Recovering** — was Alerting, condition no longer met, keep-firing-for not elapsed
- **No Data** — query returned no series past the pending period (Grafana-managed only)
- **Error** — query failed past the pending period (Grafana-managed only)
Only two transitions emit notifications:
- → **Alerting** (entering firing)
- → **Normal** marked Resolved (leaving firing via Recovering or directly)
If a rule is modified (except annotations / evaluation interval / internal fields), all instances reset to Normal and
re-evaluate on the next cycle. Templated labels that change value orphan the previous instance as stale.
## Notification routing
### Contact points
Notification destinations. One contact point may have multiple receivers (integrations: `email`, `slack`, `pagerduty`,
`opsgenie`, `webhook`, etc.).
- Receiver type and settings determine the integration
- Put secrets in `secure_settings` (encrypted at rest), config in `settings`
- Use `disableResolveMessage: true` on receivers that shouldn't get resolved notifications
### Policy tree
The policy tree decides which contact point handles each alert, how alerts are grouped, and when notifications are sent.
- Single tree per Alertmanager — provisioning overwrites it entirely
- Root is the **default policy**: matches all alerts, has a contact point, no matchers, no mute timings
- Each non-root policy has zero or more **label matchers**: `=`, `!=`, `=~`, `!~` (multiple combine with AND)
- Routing is top-down, deepest-match-wins. Once a policy matches, siblings are skipped unless **Continue matching
subsequent sibling nodes** is enabled.
- Children **inherit** contact point, grouping, and timing from their parent. Override per child as needed.
- Mute timings are **not inherited** — declare on every level that needs them.
### Grouping and timing
- `group_by` — labels that partition alerts into notification groups. `['...']` disables grouping. Default groups by
alert rule.
- `group_wait` — first-notification delay for a new group (default `30s`)
- `group_interval` — delay before sending a follow-up for the same group when it changes (default `5m`)
- `repeat_interval` — delay before re-sending an unchanged group (default `4h`)
Keep `repeat_interval` ≥ `group_interval` to avoid re-notification collisions.
### Silences vs mute timings
- **Silence** — fixed start/end time, label matchers. For incident-specific or maintenance-window suppression.
Auto-deleted 5 days after expiry. Cannot be deleted before expiry; only Unsilenced (ends immediately).
- **Mute timing** — recurring time intervals attached to a notification policy. For predictable schedules (weekends,
after-hours). Shape: `times`, `weekdays`, `months`, `years`, `days_of_month`, `location`.
Silences don't stop rule evaluation — only notification creation. The alert still appears in the UI and history.
## Templates
Two kinds. Don't confuse them — variables and scope differ.
### Annotation and label templates (alert rule level)
Inline Go template expressions in the rule's `annotations` and `labels` maps. Evaluated each rule evaluation, per alert
instance.
- `$labels` — alert's label set
- `$values` — query/expression values keyed by refId (e.g., `$values.A.Value`)
- `$value` — value of the condition expression
- Visible in the Grafana UI, alert history, **and** notifications
### Notification templates (Alertmanager level)
Reusable `{{ define "Related in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.