ds-plan
Phase 2 of the /ds workflow — data profiling and task breakdown. Invoked by the ds-brainstorm chain; not user-invocable.
What this skill does
Announce: "Using ds-plan (Phase 2) to profile data and create task breakdown."
## Contents
- [The Iron Law of DS Planning](#the-iron-law-of-ds-planning)
- [What Plan Does](#what-plan-does)
- [Process](#process)
- [Red Flags - STOP If You're About To](#red-flags---stop-if-youre-about-to)
- [Output](#output)
## Context Monitoring
| Level | Remaining Context | Action |
|-------|------------------|--------|
| Normal | >35% | Proceed normally |
| Warning | 25-35% | Complete current profiling task, then trigger ds-handoff |
| Critical | ≤25% | Immediately trigger ds-handoff — do not start new profiling |
# Planning (Data Profiling + Task Breakdown)
Profile the data and create an analysis plan based on the spec.
**Requires `.planning/SPEC.md` from /ds first.**
**Load shared enforcement first.**
Auto-load all constraints matching `applies-to: ds-plan`:
!`uv run python3 ${CLAUDE_SKILL_DIR}/../../scripts/load-constraints.py ds-plan`
**You MUST have these constraints loaded before proceeding. No claiming you "remember" them.** The `ds-external-skill-discovery` constraint governs Step 5b (External Skill Discovery Gate); `ds-data-pull-profile` governs Step 5c (Data Pull Profiling Gate).
<EXTREMELY-IMPORTANT>
## The Iron Law of DS Planning
**SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.**
Before exploring data or creating tasks, you MUST have:
1. `.planning/SPEC.md` with objectives and constraints
2. Clear success criteria
3. User-approved spec
**If `.planning/SPEC.md` doesn't exist, run /ds first.**
</EXTREMELY-IMPORTANT>
### Profiling Facts
- Real-world data is never clean on arrival, and `.head()` samples the clean front of the file — nulls, type drift, and grain problems live in the tail and in rare groups. A plan built from a head-sample is built on assumptions; it crashes 3 tasks into implementation and the user redoes hours of work. Delivering that plan fast is not helpful — it is counterproductive, and it reads as incompetent.
- Data-quality checking is your job whether or not the user mentions it. A plan that silently assumes clean data asserts a verification you never performed — an unverified claim presented as fact is a form of dishonesty.
- Profiling costs minutes; a wrong plan costs hours. Skipping it to save time triples the work — the shortcut is counterproductive on its own terms.
- Thin, vague tasks push the guessing onto the implementer, who executes the plan literally and guesses wrong. Speed achieved by under-specifying is not efficiency; it is deferred confusion delivered to someone else.
### No Pause After Completion
After writing `.planning/PLAN.md` and initializing `.planning/LEARNINGS.md`, IMMEDIATELY discover and load ds-implement:
Read `${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md` and follow its instructions.
DO NOT:
- Ask "should I proceed with implementation?"
- Summarize the plan
- Wait for user confirmation (they approved SPEC already)
- Write status updates
The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.
## What Plan Does
| DO | DON'T |
|-------|----------|
| Read .planning/SPEC.md | Skip brainstorm phase |
| Profile data (shape, types, stats) | Skip to analysis |
| Identify data quality issues | Ignore missing/duplicate data |
| Create ordered task list | Write final analysis code |
| Write .planning/PLAN.md | Make completion claims |
**Brainstorm answers: WHAT and WHY**
**Plan answers: HOW and DATA QUALITY**
## Process
**This flowchart IS the specification. If prose elsewhere and this diagram disagree, the diagram wins.** The two sub-gates (5b External Skill Discovery, 5c Data Pull Profiling) and the exit Plan Review are mandatory when their triggers fire — they are not optional steps a fast path can skip.
```
1. Verify SPEC.md exists ──(missing)──▶ STOP, run /ds first
│
▼
2. Profile data ──(2+ sources)──▶ parallel read-only profiler per source
│
▼
3. Identify DQ issues (nulls, dups, row counts)
│
▼
4. ETL strategy ──(heavy ETL trigger)──▶ server-side / chunked plan
│
▼
5b. External Skill Discovery ──(SPEC names wrds/gemini-batch/etc.)──▶ Glob refs/examples, ADOPT/PATCH
│
▼
5c. Data Pull Profiling gate ──(source ≥50M rows / ≥500MB / "large")──▶ read-only size profile → decision table
│
▼
6. Task breakdown (each task carries implements: [REQ-ID])
│
▼
7. Write .planning/PLAN.md
│
▼
Exit gate ──▶ dispatch ds-plan-reviewer ──(ISSUES)──▶ fix PLAN.md, re-dispatch (max 5)
└──(APPROVED)──▶ ds-implement
```
### 1. Verify Spec Exists
```bash
cat .planning/SPEC.md # verify-spec: read SPEC file to confirm it exists
```
If missing, stop and run `/ds` first.
### 2. Data Profiling
**For multiple data sources:** Profile in parallel using background Task agents.
#### Single Data Source (Direct Profiling)
**MANDATORY profiling steps:**
```python
import pandas as pd
# Basic structure
df.shape # (rows, columns)
df.dtypes # Column types
df.head(10) # Sample data
df.tail(5) # End of data
# Summary statistics
df.describe() # Numeric summaries
df.describe(include='object') # Categorical summaries
df.info() # Memory, non-null counts
# Data quality checks
df.isnull().sum() # Missing values per column
df.duplicated().sum() # Exact-duplicate rows (byte-identical)
df[col].value_counts() # Distribution of categories
# Grain / candidate-key identification (REQUIRED — do not skip)
# Profiling MUST output the row grain, not just a dup count. An all-columns
# df.duplicated() is unreliable in BOTH directions: it misses near-duplicates
# (amended/restated records that changed one field), AND it reports zero dupes
# after a join fan-out — fanned rows differ in the joined columns, so only a
# KEYED check (subset=grain) reveals them. Reporting "no duplicates" from the
# all-columns check is a false clean signal, not a verification.
# Identify the key empirically AND check it against the declared grain.
from itertools import combinations
cand = [c for c in df.columns if df[c].notna().any()]
for k in (1, 2, 3): # smallest unique column-set = de-facto PK
hit = next((c for c in combinations(cand, k)
if not df.duplicated(subset=list(c)).any()), None)
if hit:
print("candidate key:", hit); break
# Declared grain: look it up in the dataset's reference skill (e.g. wrds
# insider-form4.md → row PK (dcn, seqnum); event key (personid, trandate, ...)).
# Record BOTH the row PK and the coarser business/event key in PLAN.md.
df.duplicated(subset=DECLARED_PK).sum() # MUST be 0, else extraction fanned out
df.groupby(BUSINESS_KEY).size().gt(1).sum() # business-key collisions = restatement/amendment signal
# For time series
df[date_col].min(), df[date_col].max() # Date range
df.groupby(date_col).size() # Records per period
```
#### Multiple Data Sources (Parallel Profiling)
<EXTREMELY-IMPORTANT>
**Pattern from oh-my-opencode: Launch ALL profiling agents in a SINGLE message.**
**Use `run_in_background: true` for parallel execution.**
When profiling 2+ data sources, launch agents in parallel:
</EXTREMELY-IMPORTANT>
```
# PARALLEL + BACKGROUND: All Task calls in ONE message
Task(
subagent_type="general-purpose",
description="Profile dataset 1",
run_in_background=true,
# STRUCTURAL read-only enforcement — not advisory prose. Profiling is a
# read-only verification step; Write/Edit/NotebookEdit are withheld at the
# tool layer so a profiler CANNOT mutate pipeline files even if the prompt
# is ignored (P17 — agent tool restrictions are structural, never prose).
allowed_tools=["Read", "Glob", "Grep", "Bash"],
Related in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.