theharvester
Passive email, subdomain, and IP harvesting from public sources using theHarvester. Use when: gathering corporate email lists, enumerating subdomains passively, pre-engagement recon, finding exposed employee contacts without triggering alerts.
What this skill does
# theHarvester
## Overview
theHarvester is a passive OSINT tool that aggregates information about a target domain from multiple public sources. It finds email addresses, subdomains, hostnames, and IP ranges without making any direct requests to the target — making it ideal for stealth recon during the pre-engagement phase of penetration tests or OSINT investigations.
**Sources include:** Google, Bing, DuckDuckGo, LinkedIn, Shodan, Hunter.io, CertSpotter, DNSDumpster, VirusTotal, and more.
## Instructions
### Step 1: Install theHarvester
```bash
# Option 1: pip (in a virtual environment recommended)
pip install theHarvester
# Option 2: Clone from GitHub (most up-to-date)
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
pip install -r requirements/base.txt
# Option 3: Docker
docker pull ghcr.io/laramies/theharvester
docker run ghcr.io/laramies/theharvester -d example.com -b google
```
### Step 2: Basic usage
```bash
# Syntax: theHarvester -d <domain> -b <source> [options]
# -d target domain
# -b data source(s)
# -l limit results (default: 500)
# -f output filename (supports XML and JSON)
# -n DNS lookup on discovered hosts
# -v verify host via DNS resolution
# Search a single source
theHarvester -d example.com -b google
# Search all available sources
theHarvester -d example.com -b all
# Limit results, enable DNS lookup, save output
theHarvester -d example.com -b google,bing,linkedin -l 200 -n -f results_example
# Run from cloned repo
python3 theHarvester.py -d example.com -b all -l 500 -f output
```
### Step 3: Choose sources strategically
```bash
# Email harvesting — best sources
theHarvester -d example.com -b google,bing,hunter,linkedin
# Subdomain enumeration — best sources
theHarvester -d example.com -b certspotter,dnsdumpster,virustotal,shodan
# Comprehensive (slower, uses all sources)
theHarvester -d example.com -b all -l 1000 -f full_recon_example
# LinkedIn employee discovery (requires LinkedIn API key in api-keys.yaml)
theHarvester -d example.com -b linkedin -l 200
```
### Step 4: Configure API keys
```yaml
# api-keys.yaml (place in theHarvester directory or specify with -c flag)
apikeys:
hunter:
key: YOUR_HUNTER_IO_KEY
shodan:
key: YOUR_SHODAN_KEY
virustotal:
key: YOUR_VIRUSTOTAL_KEY
binaryedge:
key: YOUR_BINARYEDGE_KEY
fullhunt:
key: YOUR_FULLHUNT_KEY
securityTrails:
key: YOUR_SECURITYTRAILS_KEY
github:
key: YOUR_GITHUB_TOKEN
```
### Step 5: Parse and process output with Python
```python
import json
import subprocess
import re
def run_harvester(domain, sources="google,bing,certspotter,dnsdumpster", limit=500):
"""Run theHarvester and return parsed results."""
output_file = f"harvester_{domain.replace('.', '_')}"
cmd = [
"theHarvester",
"-d", domain,
"-b", sources,
"-l", str(limit),
"-f", output_file,
]
print(f"Running: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
print(result.stdout)
# Parse JSON output
json_file = f"{output_file}.json"
try:
with open(json_file) as f:
data = json.load(f)
return data
except FileNotFoundError:
# Fall back to parsing stdout
return parse_stdout(result.stdout)
def parse_stdout(output):
"""Extract emails, hosts, and IPs from raw stdout."""
emails = set(re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', output))
# Filter out false positives
emails = {e for e in emails if not e.endswith(('.png', '.jpg', '.css', '.js'))}
hosts = set(re.findall(r'[\w\.-]+\.\w{2,}', output))
ips = set(re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', output))
return {"emails": list(emails), "hosts": list(hosts), "ips": list(ips)}
def deduplicate_and_report(data, domain):
"""Clean and summarize harvested data."""
emails = sorted(set(data.get("emails", [])))
hosts = sorted(set(data.get("hosts", [])))
ips = sorted(set(data.get("ips", [])))
# Filter to target domain
domain_emails = [e for e in emails if domain in e]
domain_hosts = [h for h in hosts if domain in h]
print(f"\n=== Harvest Report: {domain} ===")
print(f"Emails found: {len(domain_emails)}")
print(f"Subdomains: {len(domain_hosts)}")
print(f"IP addresses: {len(ips)}")
if domain_emails:
print("\nEmails:")
for e in domain_emails[:20]:
print(f" {e}")
if domain_hosts:
print("\nSubdomains:")
for h in domain_hosts[:20]:
print(f" {h}")
return {
"emails": domain_emails,
"subdomains": domain_hosts,
"ips": ips,
}
# Usage
results = run_harvester("target-company.com", sources="google,bing,certspotter,hunter")
clean = deduplicate_and_report(results, "target-company.com")
# Save cleaned results
with open("clean_results.json", "w") as f:
json.dump(clean, f, indent=2)
```
### Step 6: Combine with other tools
```bash
# Pass discovered subdomains to nmap (only with explicit authorization)
theHarvester -d example.com -b all -f hosts
cat hosts.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for host in data.get('hosts', []):
print(host)
" > subdomains.txt
# Feed subdomains into amass for deeper DNS enumeration
cat subdomains.txt | amass enum -df - -passive
# Check emails against breach databases
cat emails.txt | while read email; do
curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/$email" \
-H "hibp-api-key: YOUR_HIBP_KEY"
done
```
## Available Sources Reference
| Source | Data Type | API Key Required |
|--------|-----------|-----------------|
| `google` | Emails, subdomains | No |
| `bing` | Emails, subdomains | No |
| `duckduckgo` | Emails, subdomains | No |
| `linkedin` | Employees, emails | Optional |
| `hunter` | Emails | Yes |
| `certspotter` | Subdomains (SSL certs) | No |
| `dnsdumpster` | Subdomains, IPs | No |
| `virustotal` | Subdomains | Yes |
| `shodan` | IPs, open ports | Yes |
| `securitytrails` | Subdomains, DNS | Yes |
| `github` | Emails, code | Yes |
| `binaryedge` | IPs, services | Yes |
## Guidelines
- **Always get authorization** before running theHarvester against a target — passive does not mean invisible. Data queries may be logged by third-party services.
- **Rate limits**: Without API keys, theHarvester relies on scraping search engines which may throttle or block requests. Add API keys for reliable results.
- **Combine sources**: No single source is complete. Use multiple sources and deduplicate.
- **Email format detection**: Once you have a few emails (e.g., `[email protected]`, `[email protected]`), infer the naming convention and use it to generate a target list.
- **DNS verification**: Always use `-n` or `-v` to verify discovered hosts are live before reporting.
Related in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.