hubspot-contact-dedup
Deduplicate HubSpot contacts at production scale — surviving import storms, wrong-winner merges, fuzzy-match blind spots, association orphans, rate-limit exhaustion, and silent merge failures on conflicting lifecycle or opt-out status. Use when cleaning a CRM after a bulk import, running a nightly dedup pipeline on millions of records, recovering from a merge that destroyed the wrong timeline, or building fuzzy matching beyond HubSpot's native email-uniqueness. Trigger with "hubspot dedup", "hubspot merge contacts", "hubspot duplicate contacts", "hubspot contact cleanup", "hubspot import duplicates", "hubspot fuzzy match contacts".
What this skill does
# HubSpot Contact Deduplication ## Overview Merge duplicate contacts in HubSpot and operate that process in production, at scale, without data loss. This is not a one-click cleanup guide — it is the logic your pipeline runs when a sales ops team imports 80,000 leads from a tradeshow CSV that already exist in the CRM, when a merge destroys the "winner" contact's email history, when a fuzzy match on "Jon" vs "John" leaves a six-figure deal associated to a ghost record, and when on-call discovers that 40,000 contacts were merged without checking opt-out flags. The six production failures this skill prevents: 1. **Import storms creating thousands of exact duplicates** — HubSpot enforces email uniqueness only at the property level; the merge API has no dedup-all-at-once endpoint. A 100K-row CSV import where 60% of rows already exist creates 60,000 duplicates that must be found and merged one pair at a time within a 100 req/10s rate envelope. 2. **Merge destroying the wrong timeline** — `POST /crm/v3/objects/contacts/merge` requires a `primaryObjectId`. Picking the wrong one demotes the older contact's full activity timeline — calls, emails, form submissions — to the discarded record's history. 3. **Property-based dedup missing fuzzy matches** — Email-exact dedup leaves "[email protected]" and "[email protected]" as separate records. Phone dedup leaves "+1 (512) 867-5309" and "5128675309" as separate records. Without normalization your CRM accumulates a shadow population of semantically identical but technically distinct contacts. 4. **Post-merge association orphans** — When a secondary contact has deals, tickets, or company associations, HubSpot re-parents most automatically — but not all. Custom object associations and some third-party-integration links may not follow. 5. **Rate-limit exhaustion on large catalogs** — A 1-million-contact dedup scan requires 10,000 batch reads (2.7 hours at full throughput, before merge calls). Naive single-threaded loops exhaust the 500K daily quota before the search phase finishes. 6. **Silent merge failures on conflicting lifecycle or opt-out status** — The merge API returns 200 even when the resulting contact has `hs_email_optout=true` overriding the primary's opted-in status. HubSpot's "most recently updated value wins" rule is wrong for compliance flags. ## Auth Authenticate with a private app token (`pat-na1-*`) or OAuth access token. Pass it on every request: ```bash Authorization: Bearer {your-token} ``` Required scopes: `crm.objects.contacts.read`, `crm.objects.contacts.write`, `crm.associations.read`, `crm.associations.write`. See the [hubspot-auth skill](https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/hubspot-pack/skills/hubspot-auth/SKILL.md) for token caching, OAuth refresh, and scope-drift detection. ## Prerequisites - Python 3.10+ (`requests`, `phonenumbers`, `rapidfuzz`) for the full pipeline - HubSpot Professional or Enterprise account (batch merge at scale) - Private app token with required scopes (above) - `jq` for shell examples - For catalogs >500K contacts: confirm daily quota with HubSpot support ## Instructions ### Step 1. Discover duplicates with search Find exact duplicates by email using the search API. Never pull all contacts into memory for comparison — use the search endpoint with specific filter values. ```bash # Find all contacts sharing a normalized email curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \ -H "Authorization: Bearer {your-token}" \ -H "Content-Type: application/json" \ -d '{ "filterGroups": [{"filters": [ {"propertyName":"email","operator":"EQ","value":"[email protected]"} ]}], "properties": ["email","firstname","lastname","hs_object_id","createdate", "lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"], "sorts": [{"propertyName":"createdate","direction":"ASCENDING"}], "limit": 10 }' | jq '[.results[] | {id, created:.properties.createdate}]' ``` For full-portal scans across millions of contacts use the four-stage Python pipeline in [implementation-guide.md](references/implementation-guide.md). The pipeline writes a local SQLite checkpoint so rate-limit interruptions do not require starting over. ### Step 2. Select the primary (winner) contact The oldest contact by `createdate` is the primary — its timeline is most historically complete. Two overrides apply: - If the oldest contact has `hs_email_optout=true` and the newer one does not, prefer the opted-in record as primary to avoid propagating unsubscribe status. - If the oldest contact has a test-domain email (`@mailinator.com`, `@example.com`, `@test.com`), always make the real-address contact the primary. ```python from datetime import datetime def pick_primary(contacts: list[dict]) -> tuple[dict, list[dict]]: """Return (primary, secondaries). contacts is a list of HubSpot result dicts.""" TEST_DOMAINS = {"mailinator.com","example.com","test.com","yopmail.com"} def is_test(email: str) -> bool: return (email or "").split("@")[-1].lower() in TEST_DOMAINS # Sort oldest first (default primary) sorted_c = sorted(contacts, key=lambda c: c["properties"]["createdate"]) primary = sorted_c[0] # Opt-out override if primary["properties"].get("hs_email_optout") == "true": opted_in = next((c for c in sorted_c[1:] if c["properties"].get("hs_email_optout") != "true"), None) if opted_in: primary = opted_in # Test email override if is_test(primary["properties"].get("email", "")): real = next((c for c in sorted_c if not is_test(c["properties"].get("email", ""))), None) if real: primary = real secondaries = [c for c in contacts if c["id"] != primary["id"]] return primary, secondaries ``` ### Step 3. Normalize emails and phones for fuzzy matching Exact-email dedup leaves a shadow population. Normalize before comparing: ```python import phonenumbers def normalize_email(raw: str) -> str: lower = (raw or "").strip().lower().replace("@googlemail.com", "@gmail.com") local, _, domain = lower.partition("@") if domain == "gmail.com": local = local.split("+")[0].replace(".", "") return f"{local}@{domain}" if domain else lower def normalize_phone(raw: str, region: str = "US") -> str | None: try: p = phonenumbers.parse((raw or "").strip(), region) if phonenumbers.is_valid_number(p): return phonenumbers.format_number(p, phonenumbers.PhoneNumberFormat.E164) except Exception: pass return None ``` For name similarity and the full confidence-scoring matrix, see [implementation-guide.md](references/implementation-guide.md) § Stage 2. ### Step 4. Pre-merge compliance check Before merging, verify neither contact has blocking compliance flags: ```python def pre_merge_check(a: dict, b: dict) -> tuple[bool, str]: """Returns (can_merge, reason). False = queue for human review.""" pa, pb = a["properties"], b["properties"] if pa.get("hs_email_hard_bounce_reason_enum") or pb.get("hs_email_hard_bounce_reason_enum"): return False, "hard_bounce_present" # Asymmetric GDPR legal basis requires human review a_gdpr = bool(pa.get("hs_legal_basis")) b_gdpr = bool(pb.get("hs_legal_basis")) if a_gdpr != b_gdpr: return False, "gdpr_basis_asymmetry" return True, "ok" # Expected post-merge opt-out: conservative — opted out if either contact is opted out def resolve_optout(a: dict, b: dict) -> bool: return (a["properties"].get("hs_email_optout") == "true" or b["properties"].get("hs_email_optout") == "true") ``` ### Step 5. Execute merge with rate limiting ```python import time, requests MERGE_URL = "https://api.hubapi.com/crm/v3/objects/contacts/merge" _window_start = time.monotonic() _window_calls = 0 def rate_gate(burst_limit: int = 90) -> None: "
Related in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.