data-research
Structured data research: search sources, extract structured data, archive raw sources, maintain canonical tracker pages, deduplicate. Parameterized via YAML recipes for investor updates, donations, company updates, or any email-to-structured-data pipeline.
What this skill does
# Data Research
Structured research pipeline: search sources, extract structured data,
archive raw, deduplicate, update canonical trackers, backlink entities.
## Contract
One skill for any email-to-structured-data pipeline. The only differences
between tracking investor updates, expenses, and company metrics
are the **search queries**, **extraction schemas**, and **tracker page format**.
All three use the same 7-phase pipeline with parameterized recipes.
## When to Use
- User wants to track structured data from email, web, or API sources
- User says "research", "track", "extract from email", "build a tracker"
- User mentions investor updates, donations, company metrics, filings
- User wants to set up recurring data collection (with cron recipe)
## Phases
### Phase 1: Define Research Recipe
Ask the user what they want to track. Either:
- Pick a built-in recipe: investor-updates, expense-tracker, company-updates
- Define a custom recipe with: source queries, classification rules, extraction schema,
tracker page path, tracker format
Recipes are YAML files at `~/.gbrain/recipes/{name}.yaml`. Use `gbrain research init`
to scaffold a new one.
### Phase 2: Search Sources
Brain first (maybe we already have this data). Then:
- **Email** via credential gateway: windowed queries (quarterly, monthly if truncated)
- **Web** via search: public filings, press releases, regulatory data
- **APIs**: any structured data source the recipe defines
- **Attachments**: PDF extraction, HTML stripping
### Phase 3: Classify
Deterministic first (regex patterns from recipe), LLM fallback.
Log every LLM fallback for future regex improvement (fail-improve loop).
Skip marketing, newsletters, noise based on recipe's classification rules.
### Phase 4: Extract Structured Data
**EXTRACTION INTEGRITY RULE:**
1. Save raw source immediately (before any extraction)
2. Extract fields using deterministic regex first, LLM fallback
3. When summarizing batch results: **re-read from saved files**
4. Never trust LLM working memory after batch processing
This prevents a known hallucination bug where batch-processed amounts were
13/13 wrong from LLM working memory while saved files were correct.
### Phase 5: Archive Raw Sources
- `put_raw_data` for email bodies, API responses
- `file_upload` for PDF attachments, documents
- Create `.redirect.yaml` pointers for large files in storage
- Every tracker entry must link back to its raw source
### Phase 6: Deduplicate
Before adding to tracker:
- Exact match (same key fields) → skip
- Fuzzy match (same entity + date + similar amount within tolerance) → flag for review
- Different amount for same entity+date → add with note (could be correction)
### Phase 7: Update Canonical Tracker + Backlink
- Parse existing tracker page (markdown table)
- Append new entries in correct section (grouped by year/quarter/entity)
- Compute running totals
- Backlink every mentioned entity (person → people/ page, company → companies/ page)
- Uses enrichment service for entity pages
## Built-In Recipes
Three example recipes ship with GBrain (see `~/.gbrain/recipes/`):
1. **investor-updates** — extract MRR, ARR, growth, burn, runway, headcount from investor update emails
2. **expense-tracker** — extract amounts, recipients, platforms from receipt emails (subscriptions, services, recurring charges)
3. **company-updates** — extract revenue, users, key metrics from portfolio company update emails
## Anti-Patterns
- Trusting LLM working memory for amounts after batch processing (use extraction integrity rule)
- Creating tracker entries without raw source links
- Running without deduplication (leads to double-counted entries)
- Hardcoding source-specific patterns in the pipeline code (use recipes)
## Output Format
Brain page at the recipe's `tracker_page` path with markdown tables:
```markdown
### 2026
| Date | Company | MRR | ARR | Growth | Status |
|------|---------|-----|-----|--------|--------|
| 2026-04-01 | Example Co | $188K | $2.3M | +14.7% MoM | [Source](link) |
```
Each entry links to its raw source. Running totals at the bottom of each section.
## Conventions
References `skills/conventions/quality.md` for citation and back-linking rules.
Related in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.