Claude
Skills
Sign in
Back

ssmd-dq-run

Included with Lifetime
$97 forever

How to run ssmd DQ checks locally and in-cluster, interpret scores, trigger email reports, and verify results. Use when running data quality checks, re-sending DQ emails, or verifying pipeline health after deployments or backfills.

Productivity

What this skill does


# ssmd-dq-run

Procedures for running ssmd Data Quality checks and interpreting results.

## Source Files

| File | Purpose |
|------|---------|
| `data/dq.py` | DQRunner engine — 13 checks, scoring, CLI |
| `data/dq_email.py` | Email report wrapper — runs all feeds, HTML output |
| `data/Dockerfile` | DQ image: python:3.12-slim + duckdb + gcloud monitoring |

## Running DQ Locally

Requires `gcloud auth application-default login` for GCS access.

```bash
# Single feed
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto

# With verbose progress
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --verbose

# JSON output (for programmatic use)
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --json

# Non-default prefix (when GCS prefix differs from feed name)
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket
```

### All Three Feeds

Run all feeds in parallel for full pipeline verification:

```bash
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket
```

### Feed Parameters

| Feed | `--feed` | `--stream` | `--prefix` |
|------|----------|------------|------------|
| Kalshi | `kalshi` | `crypto` | (default: `kalshi`) |
| Kraken Futures | `kraken-futures` | `futures` | `kraken-futures` |
| Polymarket | `polymarket` | `markets` | `polymarket` |

## Running DQ In-Cluster

The DQ CronJob runs at **03:30 UTC daily** (after parquet-gen at 02:00 UTC).

**Manifest**: `clusters/gke-prod/apps/ssmd/cronjobs/dq-daily.yaml`

### Trigger a manual DQ email run

```bash
kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-manual-MMDD -n ssmd
```

### Watch progress

```bash
kubectl logs -n ssmd job/ssmd-dq-manual-MMDD -f
```

### Re-run for a specific date

The CronJob defaults to yesterday. To override:

```bash
kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-rerun-MMDD -n ssmd --dry-run=client -o yaml | \
  sed 's|dq_email.py|dq_email.py --date 2026-02-17|' | \
  kubectl apply -f -
```

## Interpreting Scores

### Grades

| Grade | Score Range | Meaning |
|-------|-----------|---------|
| GREEN | >= 98 | Pipeline healthy, all checks passing |
| YELLOW | >= 85 | Minor issues, investigate when convenient |
| RED | < 85 | Significant issues, investigate promptly |

### Check Statuses

| Status | Weight | Meaning |
|--------|--------|---------|
| pass | 1.0 | Check passed |
| warn | 0.7 | Threshold exceeded but not critical |
| fail | 0.0 | Check failed |
| skip | excluded | Not enough data to run, excluded from score |

Score = average of weights * 100.

### Exit Codes

- `dq.py` exits 1 if any check has status `fail`
- `dq_email.py` always exits 0 (email is the alert mechanism)

## Notebook / Programmatic Usage

```python
from dq import DQRunner

runner = DQRunner(bucket="ssmd-data", feed="kalshi", stream="crypto")
results = runner.run("2026-02-12")
results.summary()       # print human-readable report
results.score()         # float 0-100
results.to_json()       # JSON string

# Ad-hoc queries via the shared DuckDB connection
runner.con.execute(
    "SELECT * FROM read_parquet('gcs://ssmd-data/kalshi/crypto/2026-02-12/ticker_*.parquet') LIMIT 10"
).fetchdf()

# Date range
all_results = runner.run_range("2026-02-10", "2026-02-17")
```

## Email Report

`dq_email.py` runs all 3 feeds, generates an HTML email with per-feed grades and check details, and sends via SMTP.

**Required env vars**: `SMTP_USER`, `SMTP_PASS`, `SMTP_TO`
**Optional**: `SMTP_HOST` (default: smtp.gmail.com), `SMTP_PORT` (default: 587)

These are provided in-cluster via the `ssmd-smtp-credentials` Secret.

## Post-Deploy / Post-Backfill Verification

After deploying a new DQ version or backfilling parquet data:

1. Run DQ locally for all 3 feeds (see commands above)
2. Verify target checks show PASS
3. Optionally trigger in-cluster email: `kubectl create job --from=cronjob/ssmd-dq-daily ...`
4. Verify email arrives with corrected scores

## Image Build

DQ image is built from `data/Dockerfile`, triggered by `dq-v*` tags in the **899bushwick** repo (not ssmd).

See the `ssmd-deploy` skill for full deployment procedure.
Files: 1
Size: 4.6 KB
Complexity: 9/100
Category: Productivity

Related in Productivity