airflow-debugging
Investigate Mozilla Airflow DAG failures. Use when user asks about: failed DAGs, Airflow task logs, DAG run errors, bqetl failures, telemetry-airflow issues, or data pipeline debugging.
What this skill does
# Airflow DAG Failure Investigation You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes. ## Helper Scripts ### Discovering failures Use `get-triage-data` from the airflow-triage skill to discover failures: ```bash ../airflow-triage/scripts/get-triage-data # Last 24 hours ../airflow-triage/scripts/get-triage-data --since 3d # Last 3 days ``` ### fetch-task-log Fetch and explore task logs from GCS (`gs://airflow-remote-logs-prod-prod`). ```bash # List recent runs for a DAG scripts/fetch-task-log <dag_id> --list-runs # List tasks in a specific run scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id> # Fetch a task log scripts/fetch-task-log <dag_id> <task_id> <run_id> # Fetch only the last N lines scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100 ``` ## Related Repositories When investigating failures, check these repos (all checked out locally): - `bigquery-etl` - Query definitions, metadata.yaml, DAG generation - `private-bigquery-etl` - Confidential ETL code - `telemetry-airflow` - DAGs, operators, GKEPodOperator - `dataservices-infra` - Infrastructure (GKE, Helm, logging config) ## Where DAGs Are Defined Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source. ### Task ID Pattern: `<dataset>__<table>__<version>` Example task ID: `telemetry_derived__clients_daily__v6` Source query location: ``` bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/ ├── query.sql # The SQL query ├── metadata.yaml # Scheduling config, owner, tags └── schema.yaml # Table schema ``` For the example above: ``` bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/ ``` ### DAG ID Pattern: `bqetl_<name>` DAGs starting with `bqetl_` are auto-generated. The DAG configuration is in `bigquery-etl/dags.yaml`. ### Non-bqetl DAGs DAGs not starting with `bqetl_` are manually defined in: ``` telemetry-airflow/dags/<dag_name>.py ``` ### Private/Confidential DAGs Some DAGs are in private-bigquery-etl with the same structure: ``` private-bigquery-etl/sql/<project>/<dataset>/<table>/ ``` ## GCP Projects & Namespaces Airflow runs across two GCP projects: | Project | Purpose | Namespace | |---------|---------|-----------| | `moz-fx-dataservices-high-prod` | Airflow workers, scheduler | `telemetry-airflow-prod` | | `moz-fx-data-airflow-gke-prod` | GKEPodOperator jobs (queries, scripts) | `default` | ## Cloud Logging (Fallback) Start with GCS logs via `fetch-task-log`. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete. | Aspect | GCS (`fetch-task-log`) | Cloud Logging | |--------|------------------------|---------------| | Content | Complete Airflow task logs (same as UI) | Raw container stdout/stderr | | Retention | 360 days | 30 days | | Best for | Task failures (SQL errors, exceptions) | Pod-level issues (OOM kills, scheduling failures) | Airflow scheduler/worker logs: ```bash gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \ --project=moz-fx-dataservices-high-prod \ --limit=200 ``` GKEPodOperator job logs (query execution errors): ```bash gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \ --project=moz-fx-data-airflow-gke-prod \ --limit=200 ``` ## Useful Links - **Airflow UI**: `https://workflow.telemetry.mozilla.org/home` - DAG detail: `https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid` - Task logs: `https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<run_id>&task_id=<task_id>` - **Grafana dashboards**: - Airflow overview: `https://earthangel-b40313e5.influxcloud.net/d/airflow-overview` - Task duration: `https://earthangel-b40313e5.influxcloud.net/d/airflow-task-duration` - **Bugzilla** (airflow-triage bugs): [buglist](https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&status_whiteboard=%5Bairflow-triage%5D) - **Runbooks**: Search Confluence for `<dag_id>` or the pipeline area (e.g. "main_summary runbook"). If a Confluence MCP tool is available, use it to search directly. Otherwise, suggest the user check: - `https://mozilla-hub.atlassian.net/wiki/search?text=<dag_id>` - `https://mozilla-hub.atlassian.net/wiki/search?text=airflow+runbook` ## Bugzilla Search Before investigating, search Bugzilla for existing tickets related to the failing DAG. This avoids duplicate work and surfaces prior context. Use `scripts/search-bugzilla` — it is pinned to `bugzilla.mozilla.org` and defaults to the `[airflow-triage]` whiteboard filter: ```bash scripts/search-bugzilla <dag_id> # Open [airflow-triage] bugs matching DAG name scripts/search-bugzilla <dag_id> --status resolved # Recently resolved matches scripts/search-bugzilla <dag_id> --status all # Both scripts/search-bugzilla "<task_id>" --limit 10 # Narrow by task name scripts/search-bugzilla --all "<keywords>" # Drop the whiteboard filter scripts/search-bugzilla <dag_id> --json # Structured output ``` To read full context (change history + comments) for a bug ID the script returned, call `mcp__plugin_mozdata_moz__get_bugzilla_bug` with the bug ID. That goes through the Mozilla-hosted MCP server — no direct network access needed from the skill. If a match is found: - Link to the existing bug: `https://bugzilla.mozilla.org/show_bug.cgi?id=<bug_id>` - Note whether it's a known/ongoing issue - Include any context from the bug summary/comments in your analysis ## GitHub PR Investigation When investigating a failure, search for recent PRs that may have introduced the issue. Check these repositories: | Repo | What it contains | |------|-----------------| | `mozilla/bigquery-etl` | Query definitions, metadata.yaml, DAG generation | | `mozilla/private-bigquery-etl` | Confidential ETL code | | `mozilla/telemetry-airflow` | DAGs, operators, GKEPodOperator | | `mozilla/lookml-generator` | LookML generation from bigquery-etl | | `mozilla/probe-scraper` | Probe/metric definitions scraping | ### How to search Use the `gh` CLI to find PRs merged around the time of the failure. Focus on files related to the failing DAG/task: ```bash # Search for recently merged PRs touching a path gh pr list --repo mozilla/bigquery-etl --state merged --limit 10 \ --search "merged:>2026-04-10" --json number,title,mergedAt,url # Search PRs mentioning a DAG or table name gh search prs --repo mozilla/bigquery-etl --merged ">2026-04-10" "<dag_id or table_name>" # View a specific PR's changed files gh pr view --repo mozilla/bigquery-etl <PR_NUMBER> --json files,title,body ``` ### What to look for - PRs merged shortly before the first failure timestamp - Changes to the failing task's query, metadata, or schema - Changes to shared dependencies (UDFs, views, upstream tables) - Changes to DAG definitions or operator configuration - Infrastructure changes (Helm, Docker, Airflow version bumps) If a suspect PR is found, include it in the investigation report with a link and a note on which changed files are relevant. ## Investigation Workflow If the user provides a DAG name, skip straight to step 2. Only discover failures when no DAG name is given. 1. Run `../airflow-triage/scripts/get-triage-data` to discover failures (skip if DAG name is already known) 2. **Search Bugzilla** for existing tickets matching the DAG/task name using `scripts/search-bugzilla`. If a promising bug ID comes back, hydrate it with `mcp__plugin_mozdata_moz__get_bugzilla_bug` for history and comments. Link it and note prior context before continuing. 3. Run `scripts/fetch-task-log <dag_id> --list-runs` to find recent runs 4. Run `scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>` to list tasks in the fa
Related in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.