Claude
Skills
Sign in
Back

ds-plan

Included with Lifetime
$97 forever

Phase 2 of the /ds workflow — data profiling and task breakdown. Invoked by the ds-brainstorm chain; not user-invocable.

Productivity

What this skill does


Announce: "Using ds-plan (Phase 2) to profile data and create task breakdown."

## Contents

- [The Iron Law of DS Planning](#the-iron-law-of-ds-planning)
- [What Plan Does](#what-plan-does)
- [Process](#process)
- [Red Flags - STOP If You're About To](#red-flags---stop-if-youre-about-to)
- [Output](#output)

## Context Monitoring

| Level | Remaining Context | Action |
|-------|------------------|--------|
| Normal | >35% | Proceed normally |
| Warning | 25-35% | Complete current profiling task, then trigger ds-handoff |
| Critical | ≤25% | Immediately trigger ds-handoff — do not start new profiling |

# Planning (Data Profiling + Task Breakdown)

Profile the data and create an analysis plan based on the spec.
**Requires `.planning/SPEC.md` from /ds first.**

**Load shared enforcement first.**

Auto-load all constraints matching `applies-to: ds-plan`:

!`uv run python3 ${CLAUDE_SKILL_DIR}/../../scripts/load-constraints.py ds-plan`

**You MUST have these constraints loaded before proceeding. No claiming you "remember" them.** The `ds-external-skill-discovery` constraint governs Step 5b (External Skill Discovery Gate); `ds-data-pull-profile` governs Step 5c (Data Pull Profiling Gate).

<EXTREMELY-IMPORTANT>
## The Iron Law of DS Planning

**SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.**

Before exploring data or creating tasks, you MUST have:
1. `.planning/SPEC.md` with objectives and constraints
2. Clear success criteria
3. User-approved spec

**If `.planning/SPEC.md` doesn't exist, run /ds first.**
</EXTREMELY-IMPORTANT>

### Profiling Facts

- Real-world data is never clean on arrival, and `.head()` samples the clean front of the file — nulls, type drift, and grain problems live in the tail and in rare groups. A plan built from a head-sample is built on assumptions; it crashes 3 tasks into implementation and the user redoes hours of work. Delivering that plan fast is not helpful — it is counterproductive, and it reads as incompetent.
- Data-quality checking is your job whether or not the user mentions it. A plan that silently assumes clean data asserts a verification you never performed — an unverified claim presented as fact is a form of dishonesty.
- Profiling costs minutes; a wrong plan costs hours. Skipping it to save time triples the work — the shortcut is counterproductive on its own terms.
- Thin, vague tasks push the guessing onto the implementer, who executes the plan literally and guesses wrong. Speed achieved by under-specifying is not efficiency; it is deferred confusion delivered to someone else.

### No Pause After Completion

After writing `.planning/PLAN.md` and initializing `.planning/LEARNINGS.md`, IMMEDIATELY discover and load ds-implement:
Read `${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md` and follow its instructions.

DO NOT:
- Ask "should I proceed with implementation?"
- Summarize the plan
- Wait for user confirmation (they approved SPEC already)
- Write status updates

The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.

## What Plan Does

| DO | DON'T |
|-------|----------|
| Read .planning/SPEC.md | Skip brainstorm phase |
| Profile data (shape, types, stats) | Skip to analysis |
| Identify data quality issues | Ignore missing/duplicate data |
| Create ordered task list | Write final analysis code |
| Write .planning/PLAN.md | Make completion claims |

**Brainstorm answers: WHAT and WHY**
**Plan answers: HOW and DATA QUALITY**

## Process

**This flowchart IS the specification. If prose elsewhere and this diagram disagree, the diagram wins.** The two sub-gates (5b External Skill Discovery, 5c Data Pull Profiling) and the exit Plan Review are mandatory when their triggers fire — they are not optional steps a fast path can skip.

```
 1. Verify SPEC.md exists ──(missing)──▶ STOP, run /ds first
            │
            ▼
 2. Profile data ──(2+ sources)──▶ parallel read-only profiler per source
            │
            ▼
 3. Identify DQ issues (nulls, dups, row counts)
            │
            ▼
 4. ETL strategy ──(heavy ETL trigger)──▶ server-side / chunked plan
            │
            ▼
 5b. External Skill Discovery ──(SPEC names wrds/gemini-batch/etc.)──▶ Glob refs/examples, ADOPT/PATCH
            │
            ▼
 5c. Data Pull Profiling gate ──(source ≥50M rows / ≥500MB / "large")──▶ read-only size profile → decision table
            │
            ▼
 6. Task breakdown (each task carries implements: [REQ-ID])
            │
            ▼
 7. Write .planning/PLAN.md
            │
            ▼
 Exit gate ──▶ dispatch ds-plan-reviewer ──(ISSUES)──▶ fix PLAN.md, re-dispatch (max 5)
                                          └──(APPROVED)──▶ ds-implement
```

### 1. Verify Spec Exists

```bash
cat .planning/SPEC.md  # verify-spec: read SPEC file to confirm it exists
```

If missing, stop and run `/ds` first.

### 2. Data Profiling

**For multiple data sources:** Profile in parallel using background Task agents.

#### Single Data Source (Direct Profiling)

**MANDATORY profiling steps:**

```python
import pandas as pd

# Basic structure
df.shape                    # (rows, columns)
df.dtypes                   # Column types
df.head(10)                 # Sample data
df.tail(5)                  # End of data

# Summary statistics
df.describe()               # Numeric summaries
df.describe(include='object')  # Categorical summaries
df.info()                   # Memory, non-null counts

# Data quality checks
df.isnull().sum()           # Missing values per column
df.duplicated().sum()       # Exact-duplicate rows (byte-identical)
df[col].value_counts()      # Distribution of categories

# Grain / candidate-key identification (REQUIRED — do not skip)
# Profiling MUST output the row grain, not just a dup count. An all-columns
# df.duplicated() is unreliable in BOTH directions: it misses near-duplicates
# (amended/restated records that changed one field), AND it reports zero dupes
# after a join fan-out — fanned rows differ in the joined columns, so only a
# KEYED check (subset=grain) reveals them. Reporting "no duplicates" from the
# all-columns check is a false clean signal, not a verification.
# Identify the key empirically AND check it against the declared grain.
from itertools import combinations
cand = [c for c in df.columns if df[c].notna().any()]
for k in (1, 2, 3):                              # smallest unique column-set = de-facto PK
    hit = next((c for c in combinations(cand, k)
                if not df.duplicated(subset=list(c)).any()), None)
    if hit:
        print("candidate key:", hit); break
# Declared grain: look it up in the dataset's reference skill (e.g. wrds
# insider-form4.md → row PK (dcn, seqnum); event key (personid, trandate, ...)).
# Record BOTH the row PK and the coarser business/event key in PLAN.md.
df.duplicated(subset=DECLARED_PK).sum()          # MUST be 0, else extraction fanned out
df.groupby(BUSINESS_KEY).size().gt(1).sum()      # business-key collisions = restatement/amendment signal

# For time series
df[date_col].min(), df[date_col].max()  # Date range
df.groupby(date_col).size()              # Records per period
```

#### Multiple Data Sources (Parallel Profiling)

<EXTREMELY-IMPORTANT>
**Pattern from oh-my-opencode: Launch ALL profiling agents in a SINGLE message.**

**Use `run_in_background: true` for parallel execution.**

When profiling 2+ data sources, launch agents in parallel:
</EXTREMELY-IMPORTANT>

```
# PARALLEL + BACKGROUND: All Task calls in ONE message

Task(
    subagent_type="general-purpose",
    description="Profile dataset 1",
    run_in_background=true,
    # STRUCTURAL read-only enforcement — not advisory prose. Profiling is a
    # read-only verification step; Write/Edit/NotebookEdit are withheld at the
    # tool layer so a profiler CANNOT mutate pipeline files even if the prompt
    # is ignored (P17 — agent tool restrictions are structural, never prose).
    allowed_tools=["Read", "Glob", "Grep", "Bash"],
   
Files: 1
Size: 49.6 KB
Complexity: 38/100
Category: Productivity

Related in Productivity