hubspot-contact-dedup

Included with Lifetime

$97 forever

Deduplicate HubSpot contacts at production scale — surviving import storms, wrong-winner merges, fuzzy-match blind spots, association orphans, rate-limit exhaustion, and silent merge failures on conflicting lifecycle or opt-out status. Use when cleaning a CRM after a bulk import, running a nightly dedup pipeline on millions of records, recovering from a merge that destroyed the wrong timeline, or building fuzzy matching beyond HubSpot's native email-uniqueness. Trigger with "hubspot dedup", "hubspot merge contacts", "hubspot duplicate contacts", "hubspot contact cleanup", "hubspot import duplicates", "hubspot fuzzy match contacts".

Productivityhubspotcrmdeduplicationdata-quality

What this skill does


# HubSpot Contact Deduplication

## Overview

Merge duplicate contacts in HubSpot and operate that process in production, at scale, without data loss. This is not a one-click cleanup guide — it is the logic your pipeline runs when a sales ops team imports 80,000 leads from a tradeshow CSV that already exist in the CRM, when a merge destroys the "winner" contact's email history, when a fuzzy match on "Jon" vs "John" leaves a six-figure deal associated to a ghost record, and when on-call discovers that 40,000 contacts were merged without checking opt-out flags.

The six production failures this skill prevents:

1. **Import storms creating thousands of exact duplicates** — HubSpot enforces email uniqueness only at the property level; the merge API has no dedup-all-at-once endpoint. A 100K-row CSV import where 60% of rows already exist creates 60,000 duplicates that must be found and merged one pair at a time within a 100 req/10s rate envelope.
2. **Merge destroying the wrong timeline** — `POST /crm/v3/objects/contacts/merge` requires a `primaryObjectId`. Picking the wrong one demotes the older contact's full activity timeline — calls, emails, form submissions — to the discarded record's history.
3. **Property-based dedup missing fuzzy matches** — Email-exact dedup leaves "[email protected]" and "[email protected]" as separate records. Phone dedup leaves "+1 (512) 867-5309" and "5128675309" as separate records. Without normalization your CRM accumulates a shadow population of semantically identical but technically distinct contacts.
4. **Post-merge association orphans** — When a secondary contact has deals, tickets, or company associations, HubSpot re-parents most automatically — but not all. Custom object associations and some third-party-integration links may not follow.
5. **Rate-limit exhaustion on large catalogs** — A 1-million-contact dedup scan requires 10,000 batch reads (2.7 hours at full throughput, before merge calls). Naive single-threaded loops exhaust the 500K daily quota before the search phase finishes.
6. **Silent merge failures on conflicting lifecycle or opt-out status** — The merge API returns 200 even when the resulting contact has `hs_email_optout=true` overriding the primary's opted-in status. HubSpot's "most recently updated value wins" rule is wrong for compliance flags.

## Auth

Authenticate with a private app token (`pat-na1-*`) or OAuth access token. Pass it on every request:

```bash
Authorization: Bearer {your-token}
```

Required scopes: `crm.objects.contacts.read`, `crm.objects.contacts.write`, `crm.associations.read`, `crm.associations.write`. See the [hubspot-auth skill](https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/hubspot-pack/skills/hubspot-auth/SKILL.md) for token caching, OAuth refresh, and scope-drift detection.

## Prerequisites

- Python 3.10+ (`requests`, `phonenumbers`, `rapidfuzz`) for the full pipeline
- HubSpot Professional or Enterprise account (batch merge at scale)
- Private app token with required scopes (above)
- `jq` for shell examples
- For catalogs >500K contacts: confirm daily quota with HubSpot support

## Instructions

### Step 1. Discover duplicates with search

Find exact duplicates by email using the search API. Never pull all contacts into memory for comparison — use the search endpoint with specific filter values.

```bash
# Find all contacts sharing a normalized email
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \
  -H "Authorization: Bearer {your-token}" \
  -H "Content-Type: application/json" \
  -d '{
    "filterGroups": [{"filters": [
      {"propertyName":"email","operator":"EQ","value":"[email protected]"}
    ]}],
    "properties": ["email","firstname","lastname","hs_object_id","createdate",
                   "lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"],
    "sorts": [{"propertyName":"createdate","direction":"ASCENDING"}],
    "limit": 10
  }' | jq '[.results[] | {id, created:.properties.createdate}]'
```

For full-portal scans across millions of contacts use the four-stage Python pipeline in [implementation-guide.md](references/implementation-guide.md). The pipeline writes a local SQLite checkpoint so rate-limit interruptions do not require starting over.

### Step 2. Select the primary (winner) contact

The oldest contact by `createdate` is the primary — its timeline is most historically complete. Two overrides apply:

- If the oldest contact has `hs_email_optout=true` and the newer one does not, prefer the opted-in record as primary to avoid propagating unsubscribe status.
- If the oldest contact has a test-domain email (`@mailinator.com`, `@example.com`, `@test.com`), always make the real-address contact the primary.

```python
from datetime import datetime

def pick_primary(contacts: list[dict]) -> tuple[dict, list[dict]]:
    """Return (primary, secondaries). contacts is a list of HubSpot result dicts."""
    TEST_DOMAINS = {"mailinator.com","example.com","test.com","yopmail.com"}

    def is_test(email: str) -> bool:
        return (email or "").split("@")[-1].lower() in TEST_DOMAINS

    # Sort oldest first (default primary)
    sorted_c = sorted(contacts, key=lambda c: c["properties"]["createdate"])
    primary = sorted_c[0]

    # Opt-out override
    if primary["properties"].get("hs_email_optout") == "true":
        opted_in = next((c for c in sorted_c[1:] if c["properties"].get("hs_email_optout") != "true"), None)
        if opted_in:
            primary = opted_in

    # Test email override
    if is_test(primary["properties"].get("email", "")):
        real = next((c for c in sorted_c if not is_test(c["properties"].get("email", ""))), None)
        if real:
            primary = real

    secondaries = [c for c in contacts if c["id"] != primary["id"]]
    return primary, secondaries
```

### Step 3. Normalize emails and phones for fuzzy matching

Exact-email dedup leaves a shadow population. Normalize before comparing:

```python
import phonenumbers

def normalize_email(raw: str) -> str:
    lower = (raw or "").strip().lower().replace("@googlemail.com", "@gmail.com")
    local, _, domain = lower.partition("@")
    if domain == "gmail.com":
        local = local.split("+")[0].replace(".", "")
    return f"{local}@{domain}" if domain else lower

def normalize_phone(raw: str, region: str = "US") -> str | None:
    try:
        p = phonenumbers.parse((raw or "").strip(), region)
        if phonenumbers.is_valid_number(p):
            return phonenumbers.format_number(p, phonenumbers.PhoneNumberFormat.E164)
    except Exception:
        pass
    return None
```

For name similarity and the full confidence-scoring matrix, see [implementation-guide.md](references/implementation-guide.md) § Stage 2.

### Step 4. Pre-merge compliance check

Before merging, verify neither contact has blocking compliance flags:

```python
def pre_merge_check(a: dict, b: dict) -> tuple[bool, str]:
    """Returns (can_merge, reason). False = queue for human review."""
    pa, pb = a["properties"], b["properties"]
    if pa.get("hs_email_hard_bounce_reason_enum") or pb.get("hs_email_hard_bounce_reason_enum"):
        return False, "hard_bounce_present"
    # Asymmetric GDPR legal basis requires human review
    a_gdpr = bool(pa.get("hs_legal_basis"))
    b_gdpr = bool(pb.get("hs_legal_basis"))
    if a_gdpr != b_gdpr:
        return False, "gdpr_basis_asymmetry"
    return True, "ok"

# Expected post-merge opt-out: conservative — opted out if either contact is opted out
def resolve_optout(a: dict, b: dict) -> bool:
    return (a["properties"].get("hs_email_optout") == "true" or
            b["properties"].get("hs_email_optout") == "true")
```

### Step 5. Execute merge with rate limiting

```python
import time, requests

MERGE_URL = "https://api.hubapi.com/crm/v3/objects/contacts/merge"
_window_start = time.monotonic()
_window_calls = 0

def rate_gate(burst_limit: int = 90) -> None:
    "

Files: 3

Size: 62.1 KB

Complexity: 55/100

Category: Productivity

Source: https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/hubspot-pack/skills/hubspot-contact-dedup

Related in Productivity

gitea-workflow

Included

Orchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.

Productivityscripts

microsoft-graph-gateway

Included

Route Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.

Productivityscripts

copilotkit

Included

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Provides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.

Productivityscripts

swain-session

Included

Session management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.

Productivityscripts

gandi

Included

Comprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.

Productivityscripts

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Productivityscripts

swain-session

Included

Productivityscripts

gandi

Included

Productivityscripts