Claude
Skills
Sign in
Back

hubspot-contact-dedup

Included with Lifetime
$97 forever

Deduplicate HubSpot contacts at production scale — surviving import storms, wrong-winner merges, fuzzy-match blind spots, association orphans, rate-limit exhaustion, and silent merge failures on conflicting lifecycle or opt-out status. Use when cleaning a CRM after a bulk import, running a nightly dedup pipeline on millions of records, recovering from a merge that destroyed the wrong timeline, or building fuzzy matching beyond HubSpot's native email-uniqueness. Trigger with "hubspot dedup", "hubspot merge contacts", "hubspot duplicate contacts", "hubspot contact cleanup", "hubspot import duplicates", "hubspot fuzzy match contacts".

Productivityhubspotcrmdeduplicationdata-quality

What this skill does


# HubSpot Contact Deduplication

## Overview

Merge duplicate contacts in HubSpot and operate that process in production, at scale, without data loss. This is not a one-click cleanup guide — it is the logic your pipeline runs when a sales ops team imports 80,000 leads from a tradeshow CSV that already exist in the CRM, when a merge destroys the "winner" contact's email history, when a fuzzy match on "Jon" vs "John" leaves a six-figure deal associated to a ghost record, and when on-call discovers that 40,000 contacts were merged without checking opt-out flags.

The six production failures this skill prevents:

1. **Import storms creating thousands of exact duplicates** — HubSpot enforces email uniqueness only at the property level; the merge API has no dedup-all-at-once endpoint. A 100K-row CSV import where 60% of rows already exist creates 60,000 duplicates that must be found and merged one pair at a time within a 100 req/10s rate envelope.
2. **Merge destroying the wrong timeline** — `POST /crm/v3/objects/contacts/merge` requires a `primaryObjectId`. Picking the wrong one demotes the older contact's full activity timeline — calls, emails, form submissions — to the discarded record's history.
3. **Property-based dedup missing fuzzy matches** — Email-exact dedup leaves "[email protected]" and "[email protected]" as separate records. Phone dedup leaves "+1 (512) 867-5309" and "5128675309" as separate records. Without normalization your CRM accumulates a shadow population of semantically identical but technically distinct contacts.
4. **Post-merge association orphans** — When a secondary contact has deals, tickets, or company associations, HubSpot re-parents most automatically — but not all. Custom object associations and some third-party-integration links may not follow.
5. **Rate-limit exhaustion on large catalogs** — A 1-million-contact dedup scan requires 10,000 batch reads (2.7 hours at full throughput, before merge calls). Naive single-threaded loops exhaust the 500K daily quota before the search phase finishes.
6. **Silent merge failures on conflicting lifecycle or opt-out status** — The merge API returns 200 even when the resulting contact has `hs_email_optout=true` overriding the primary's opted-in status. HubSpot's "most recently updated value wins" rule is wrong for compliance flags.

## Auth

Authenticate with a private app token (`pat-na1-*`) or OAuth access token. Pass it on every request:

```bash
Authorization: Bearer {your-token}
```

Required scopes: `crm.objects.contacts.read`, `crm.objects.contacts.write`, `crm.associations.read`, `crm.associations.write`. See the [hubspot-auth skill](https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/hubspot-pack/skills/hubspot-auth/SKILL.md) for token caching, OAuth refresh, and scope-drift detection.

## Prerequisites

- Python 3.10+ (`requests`, `phonenumbers`, `rapidfuzz`) for the full pipeline
- HubSpot Professional or Enterprise account (batch merge at scale)
- Private app token with required scopes (above)
- `jq` for shell examples
- For catalogs >500K contacts: confirm daily quota with HubSpot support

## Instructions

### Step 1. Discover duplicates with search

Find exact duplicates by email using the search API. Never pull all contacts into memory for comparison — use the search endpoint with specific filter values.

```bash
# Find all contacts sharing a normalized email
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \
  -H "Authorization: Bearer {your-token}" \
  -H "Content-Type: application/json" \
  -d '{
    "filterGroups": [{"filters": [
      {"propertyName":"email","operator":"EQ","value":"[email protected]"}
    ]}],
    "properties": ["email","firstname","lastname","hs_object_id","createdate",
                   "lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"],
    "sorts": [{"propertyName":"createdate","direction":"ASCENDING"}],
    "limit": 10
  }' | jq '[.results[] | {id, created:.properties.createdate}]'
```

For full-portal scans across millions of contacts use the four-stage Python pipeline in [implementation-guide.md](references/implementation-guide.md). The pipeline writes a local SQLite checkpoint so rate-limit interruptions do not require starting over.

### Step 2. Select the primary (winner) contact

The oldest contact by `createdate` is the primary — its timeline is most historically complete. Two overrides apply:

- If the oldest contact has `hs_email_optout=true` and the newer one does not, prefer the opted-in record as primary to avoid propagating unsubscribe status.
- If the oldest contact has a test-domain email (`@mailinator.com`, `@example.com`, `@test.com`), always make the real-address contact the primary.

```python
from datetime import datetime

def pick_primary(contacts: list[dict]) -> tuple[dict, list[dict]]:
    """Return (primary, secondaries). contacts is a list of HubSpot result dicts."""
    TEST_DOMAINS = {"mailinator.com","example.com","test.com","yopmail.com"}

    def is_test(email: str) -> bool:
        return (email or "").split("@")[-1].lower() in TEST_DOMAINS

    # Sort oldest first (default primary)
    sorted_c = sorted(contacts, key=lambda c: c["properties"]["createdate"])
    primary = sorted_c[0]

    # Opt-out override
    if primary["properties"].get("hs_email_optout") == "true":
        opted_in = next((c for c in sorted_c[1:] if c["properties"].get("hs_email_optout") != "true"), None)
        if opted_in:
            primary = opted_in

    # Test email override
    if is_test(primary["properties"].get("email", "")):
        real = next((c for c in sorted_c if not is_test(c["properties"].get("email", ""))), None)
        if real:
            primary = real

    secondaries = [c for c in contacts if c["id"] != primary["id"]]
    return primary, secondaries
```

### Step 3. Normalize emails and phones for fuzzy matching

Exact-email dedup leaves a shadow population. Normalize before comparing:

```python
import phonenumbers

def normalize_email(raw: str) -> str:
    lower = (raw or "").strip().lower().replace("@googlemail.com", "@gmail.com")
    local, _, domain = lower.partition("@")
    if domain == "gmail.com":
        local = local.split("+")[0].replace(".", "")
    return f"{local}@{domain}" if domain else lower

def normalize_phone(raw: str, region: str = "US") -> str | None:
    try:
        p = phonenumbers.parse((raw or "").strip(), region)
        if phonenumbers.is_valid_number(p):
            return phonenumbers.format_number(p, phonenumbers.PhoneNumberFormat.E164)
    except Exception:
        pass
    return None
```

For name similarity and the full confidence-scoring matrix, see [implementation-guide.md](references/implementation-guide.md) § Stage 2.

### Step 4. Pre-merge compliance check

Before merging, verify neither contact has blocking compliance flags:

```python
def pre_merge_check(a: dict, b: dict) -> tuple[bool, str]:
    """Returns (can_merge, reason). False = queue for human review."""
    pa, pb = a["properties"], b["properties"]
    if pa.get("hs_email_hard_bounce_reason_enum") or pb.get("hs_email_hard_bounce_reason_enum"):
        return False, "hard_bounce_present"
    # Asymmetric GDPR legal basis requires human review
    a_gdpr = bool(pa.get("hs_legal_basis"))
    b_gdpr = bool(pb.get("hs_legal_basis"))
    if a_gdpr != b_gdpr:
        return False, "gdpr_basis_asymmetry"
    return True, "ok"

# Expected post-merge opt-out: conservative — opted out if either contact is opted out
def resolve_optout(a: dict, b: dict) -> bool:
    return (a["properties"].get("hs_email_optout") == "true" or
            b["properties"].get("hs_email_optout") == "true")
```

### Step 5. Execute merge with rate limiting

```python
import time, requests

MERGE_URL = "https://api.hubapi.com/crm/v3/objects/contacts/merge"
_window_start = time.monotonic()
_window_calls = 0

def rate_gate(burst_limit: int = 90) -> None:
    "

Related in Productivity