skill-system-eda
Exploratory Data Analysis skill for CSV and parquet datasets with deterministic profiling, drift/anomaly scans, contract generation and validation, and optional memory writeback into skill-system-memory. The implementation is Polars-first (lazy scan for large files and early `--sample` head), includes high-cardinality guards for profile/importance/contract flows, and supports categorical correlation with Cramer's V. Use when building or reviewing tabular fraud/risk/data-quality workflows, profiling new datasets, checking leakage or drift, or saving/validating data contracts.
What this skill does
# Skill System EDA
Use `scripts/eda.py` for deterministic EDA artifacts. The current stable backend is tabular EDA, and the multimodal entrypoint is `explore`.
## Core Commands
```bash
python3 scripts/eda.py detect-modality --input data_root
python3 scripts/eda.py explore --input data_or_folder
python3 scripts/eda.py detect-modality --input data_root
python3 scripts/eda.py explore --input data_or_folder --modality graph
python3 scripts/eda.py graph-viz --input data.csv --features amount,score --id-column account_id --label Class --edge-mode knn --topk 50 --normalize l2 --similarity cosine --output /tmp/eda_graph
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml
```
## Output Model
- `profile-dataset` creates `profile.yaml` and `report.md`
- `explore` reports detected modalities and selected backend; tabular mode may immediately route into existing tabular profiling
- `graph-viz` emits `graph_viz/index.html`, `graph_viz/graph.json`, `graph_viz/sliders.json`, and appends graph-viz references into `profile.yaml` and `report.md`
- `graph-viz` records `renderer_hint`, `preview_applied`, and browser-edge limits so large-graph fallback is explicit rather than silent
- later commands update `profile.yaml` and append sections to `report.md`
- `save-contract` emits `contract.yaml`
- `validate-contract` prints JSON `PASS` / `FAIL` with a violation list
## Analysis Rules
- Use Polars (not pandas) for data IO/aggregation/profiling flows.
- Keep sampling deterministic with lazy `.head(N)` when `--sample` is used.
- Treat `profile.yaml` as the machine-readable source of truth; `report.md` is the human-readable companion.
- Graph visualization artifacts must stay reusable: no Esun-specific paths, feature names, or binary fraud-only assumptions in the skill contract.
- Large graph behavior is first-class: when full edge count exceeds browser-safe thresholds, the viewer loads preview edges by default and reports the full edge count separately.
- Use Polars + numpy + scipy for profiling, shifts, correlations, KS tests, and Cramer's V.
- Use sklearn feature ranking only when available; otherwise keep tree-based importance explicitly skipped.
- Use lazy scan strategy for large CSV/parquet inputs (`scan_csv`/`scan_parquet`), with materialization delayed until needed.
- Apply high-cardinality guards: `>50` unique skips one-hot in feature importance, and profile truncates categorical columns (`>100` unique or `>50%` row cardinality) to top-20 values.
## Memory Integration
- By default, commands write a summary memory plus one memory per warning/critical finding.
- Prefer `skill-system-memory/scripts/mem.py store` when available.
- If memory writes fail or `EDA_DISABLE_MEM_PY=1` is set, write fallback payloads under `.memory/pending/`.
- Use `--no-memory` for deterministic tests or when no writeback is desired.
## Contract Lifecycle
- `save-contract` derives column requirements from `profile.yaml`.
- Numeric ranges use observed bounds for tiny datasets and profile-derived percentile bounds for larger datasets.
- Truncated categorical columns produce `cardinality_range` rules instead of `allowed_values`.
- `validate-contract` fails closed and returns machine-readable violations.
## Graph Viz Notes
- `graph-viz` is a tabular-to-graph visualization flow, not a replacement for graph-native modality EDA.
- `graph.json` is the viewer payload authority; `sliders.json` is the UI-control authority.
- `renderer_hint=canvas` means the full interactive force layout is expected to be browser-safe.
- `renderer_hint=webgl` means dataset scale or edge volume exceeded the canvas-friendly threshold; the shipped viewer still loads, but preview edges are preferred by default.
- `--max-browser-edges` controls when preview fallback is applied. Raising it may crash Chromium on very large graphs.
Example: Esun-style feature-bank payload generalized into EDA input/output conventions:
```bash
python3 scripts/eda.py graph-viz \
--input Work/Study/GNN/FraudDetect/esun_data/combined_features.csv \
--features senior28_01,senior28_02,senior28_03,senior28_04 \
--id-column account_id \
--label is_fraud \
--edge-mode knn \
--topk 50 \
--normalize l2 \
--similarity cosine \
--max-browser-edges 60000 \
--output /tmp/esun_graph_viz
```
Example: generic customer risk dataset with a multi-class label column:
```bash
python3 scripts/eda.py graph-viz \
--input data/customer_risk.parquet \
--features amount,velocity_score,merchant_entropy,geo_distance \
--id-column customer_id \
--label segment \
--edge-mode knn \
--topk 25 \
--normalize l2 \
--similarity cosine \
--output /tmp/customer_graph_viz
```
```skill-manifest
{
"schema_version": "2.0",
"id": "skill-system-eda",
"version": "1.1.0",
"capabilities": [
"eda-detect",
"eda-graph-viz",
"eda-profile",
"eda-distribution",
"eda-correlation",
"eda-anomaly",
"eda-feature-importance",
"eda-leakage",
"eda-contract-save",
"eda-contract-validate"
],
"effects": ["fs.read", "fs.write", "proc.exec"],
"operations": {
"profile-dataset": {
"description": "Profile a CSV/parquet dataset and generate profile.yaml plus report.md.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"output": { "type": "string", "required": true },
"sample": { "type": "integer", "required": false },
"no_memory": { "type": "boolean", "required": false }
},
"output": {
"description": "Artifact paths for the generated EDA profile",
"fields": { "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
}
},
"detect-modality": {
"description": "Detect dataset modality and return all matching modality tags.",
"input": {
"input": { "type": "string", "required": true }
},
"output": {
"description": "Detected modalities",
"fields": { "modalities": "array", "path": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "detect-modality", "--input", "{input}"]
}
},
"explore-dataset": {
"description": "Detect dataset modality and route to the appropriate EDA backend.",
"input": {
"input": { "type": "string", "required": true },
"modality": { "type": "string", "required": false },
"output": { "type": "string", "required": false }
},
"output": {
"description": "Detected modalities, selected modality, and backend routing result",
"fields": { "detected_modalities": "array", "selected_modality": "string", "status": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "explore", "--input", "{input}"]
}
},
"graph-viz": {
"description": "Build reusable graph visualization artifacts for tabular or graph datasets.",
"input": {
"input": { "type": "string", "required": true },
"features": { "type": "string", "required": false },
"id_columnRelated in Data & Analytics
clawarr-suite
IncludedComprehensive management for self-hosted media stacks (Sonarr, Radarr, Lidarr, Readarr, Prowlarr, Bazarr, Overseerr, Plex, Tautulli, SABnzbd, Recyclarr, Unpackerr, Notifiarr, Maintainerr, Kometa, FlareSolverr). Deep library exploration, analytics, dashboard generation, content management, request handling, subtitle management, indexer control, download monitoring, quality profile sync, library cleanup automation, notification routing, collection/overlay management, and media tracker integration (Trakt, Letterboxd, Simkl).
querying-soql
IncludedSOQL query generation, optimization, and analysis with 100-point scoring. Use this skill when the user needs SOQL/SOSL authoring or optimization: natural-language-to-query generation, relationship queries, aggregates, query-plan analysis, and performance or safety improvements for Salesforce queries. TRIGGER when: user writes, optimizes, or debugs SOQL/SOSL queries, touches .soql files, or asks about relationship queries, aggregates, or query performance. DO NOT TRIGGER when: bulk data operations (use handling-sf-data), Apex DML logic (use generating-apex), or report/dashboard queries.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
habit-flow
IncludedAI-powered atomic habit tracker with natural language logging, streak tracking, smart reminders, and coaching. Use for creating habits, logging completions naturally ("I meditated today"), viewing progress, and getting personalized coaching.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
visualizing-data
IncludedBuilds dashboards, reports, and data-driven interfaces requiring charts, graphs, or visual analytics. Provides systematic framework for selecting appropriate visualizations based on data characteristics and analytical purpose. Includes 24+ visualization types organized by purpose (trends, comparisons, distributions, relationships, flows, hierarchies, geospatial), accessibility patterns (WCAG 2.1 AA compliance), colorblind-safe palettes, and performance optimization strategies. Use when creating visualizations, choosing chart types, displaying data graphically, or designing data interfaces.