bio-local-blast

Included with Lifetime

$97 forever

Build local BLAST databases and run searches using NCBI BLAST+ command-line tools. Use when running >50 queries, building custom databases with -parse_seqids and -taxid, downloading prebuilt NCBI databases via update_blastdb.pl, choosing -task variants (megablast/dc-megablast/blastn/blastn-short), tuning soft/hard masking, scaling threads, or extracting hits with blastdbcmd. Encodes BLAST v5 vs v4 database format, taxonomy filtering, makeblastdb pitfalls.

Productivity

What this skill does


## Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `blastn -version` then `blastn -help` to confirm flags
- CLI: `makeblastdb -help` to confirm database build options

If a flag is unrecognized or behavior changes, introspect with `-help` and adapt the example to match the installed version rather than retrying.

# Local BLAST

**"Run BLAST locally for speed and control"** -> Build or download a BLAST+ database, run the appropriate program with carefully chosen `-task`, masking, and thread settings, parse tabular output. Local BLAST is the right tool when remote is rate-limited or when the database must be reproducible (frozen).

The biggest mistakes are (a) using `nt`/`nr` without realizing they're >250 GB and grow weekly, (b) not building with `-parse_seqids` and then being unable to extract hit sequences with `blastdbcmd`, (c) using default `blastn` for cross-species when `dc-megablast` is correct, and (d) thinking `-num_threads 32` will scale -- past ~16 threads BLAST is I/O bound.

- CLI: `makeblastdb`, `blastn`/`blastp`, `blastdbcmd`, `update_blastdb.pl` (NCBI BLAST+)
- Python: `subprocess` wrapper (preferred); `Bio.Blast.Applications` was deprecated and removed -- do not use

## Installation

```bash
# conda (preferred)
conda install -c bioconda blast

# macOS
brew install blast

# Ubuntu
sudo apt install ncbi-blast+

# Verify
blastn -version    # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head
```

## Database format: v5 vs v4

NCBI introduced BLAST database v5 in BLAST+ 2.10 (2020). v5 includes taxonomy indexing directly in the database files, enabling `-taxids` and `-taxidlist` filtering without a companion file. v4 databases require `taxonomy4blast.sqlite3` to be present and discoverable.

| Feature | v4 | v5 |
|---|---|---|
| Default for prebuilt NCBI dbs | No (legacy) | Yes (since 2020) |
| `-taxids`, `-taxidlist` support | No | Yes |
| `blastdbcmd -taxids` | No | Yes |
| New `-info` output fields | No | Yes |

`update_blastdb.pl` downloads v5 by default. When building a database manually with `makeblastdb`, v5 format requires `-blastdb_version 5`. **Always pass `-blastdb_version 5` and `-parse_seqids` when building from scratch.**

## `makeblastdb` flag taxonomy

| Flag | Effect | When |
|---|---|---|
| `-dbtype nucl` or `-dbtype prot` | Required | Always |
| `-parse_seqids` | Indexes accessions so `blastdbcmd -entry <acc>` works | Almost always (downstream extraction) |
| `-hash_index` | Speeds up extraction by accession | Large dbs |
| `-blastdb_version 5` | Use v5 format | Always |
| `-taxid 9606` | Single taxid for all seqs | Single-species DB |
| `-taxid_map file.tsv` | Per-sequence taxid mapping (seqid<TAB>taxid) | Multi-species DB |
| `-mask_data masking.asnb` | Apply precomputed soft-masking | Production pipelines |
| `-title "..."` | Free-text label | Cosmetic |
| `-out path/prefix` | DB file path prefix | Always |

```bash
makeblastdb -in reference.fasta -dbtype nucl \
            -blastdb_version 5 \
            -parse_seqids \
            -hash_index \
            -title "Custom reference 2026-05" \
            -out custom_db
```

## `-task` taxonomy (the most-misused BLAST setting)

For `blastn`, the `-task` flag picks among heuristics with different word sizes and gap parameters.

| `-task` | Word | Gapped | Use case | Mistake to avoid |
|---|---|---|---|---|
| `megablast` (default) | 28 | linear | >=95% identity, intra-species, primer hits, contamination check | Used for cross-species and misses everything |
| `dc-megablast` | 11 (discontiguous) | yes | Cross-species mRNA homology | Underused -- this is what `blastn` "should" be for cross-species |
| `blastn` | 11 | yes | General sensitive DNA | Slower than dc-megablast for same job |
| `blastn-short` | 7 | yes | Queries <50 nt (primers, small RNAs) | Default megablast can't seed at length 7 |
| `rmblastn` | 11 | yes | Repeat masking; bundled with RepeatModeler | Specialized |

For `blastp`:

| `-task` | Word | Use case |
|---|---|---|
| `blastp` (default) | 3 | General protein similarity |
| `blastp-fast` | 6 | Faster, less sensitive |
| `blastp-short` | 2 | Peptides <30 aa, with PAM30 + word_size=2 typical |

## Soft vs hard masking

| Setting | Effect on seed | Effect on extension | Effect on score |
|---|---|---|---|
| `-soft_masking true` (default for several tasks) | Skip masked positions when seeding | Allow extension through masked | Score includes masked positions |
| `-soft_masking false` + `-dust yes` / `-seg yes` | Skip masked positions when seeding | Skip masked positions in extension | Score excludes masked positions |
| Hard-mask in input FASTA (N or X) | Hard exclusion everywhere | Hard exclusion | Treated as mismatches |

Soft masking is correct for almost all cases. Hard masking creates artificial mismatches at masked boundaries and can split true alignments. The exception: searching against a database of repeats explicitly, where hard masking on the query is the right choice.

## Thread scaling

BLAST+ parallelizes per-query (with `-num_threads`) but is I/O bound past ~16 threads on most hardware. For >100,000 query batches the better answer is splitting the input FASTA into N chunks and running N parallel `blastn` invocations -- this saturates CPUs better than `-num_threads 64`.

| Threads | Typical speedup vs single | Notes |
|---|---|---|
| 1-8 | Near-linear | Default sweet spot |
| 8-16 | Sub-linear (1.5-2x over 8) | Useful on big SMP boxes |
| 16-32 | Diminishing returns | I/O bound for most DBs |
| 32+ | Often slower | Cache thrash + I/O contention |

For massive workflows, prefer **DIAMOND** (Buchfink et al. 2021 *Nat Methods* 18:366) or **MMseqs2** (Steinegger & Soding 2017 *Nat Biotechnol* 35:1026) -- 100-10,000x faster than BLASTP at comparable sensitivity. See `remote-homology` skill.

## Output format reference (`-outfmt`)

| `-outfmt` | Description | Use |
|---|---|---|
| 0 | Pairwise (default; human-readable) | Debugging, inspection |
| 5 | XML | Programmatic parsing (Bio.SearchIO) |
| 6 | Tabular (no header) | Most pipelines |
| 7 | Tabular with comment headers | Self-documenting |
| 11 | ASN.1 binary | Re-parse with later versions |

Custom tabular fields:
```bash
blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"
```

Field key fields for analysis:
- `pident` = percent identity over the HSP (NOT the query); for query-level, use `qcovhsp`
- `qcovs` = total query coverage by all HSPs of this subject (the "coverage" most users want)
- `qcovhsp` = query coverage by best HSP alone (use when there's only one HSP per hit)
- `staxids` = taxonomy IDs (v5 only); critical for any "what species" workflow

## Prebuilt NCBI databases via `update_blastdb.pl`

```bash
# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'

# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna

# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein

# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna
```

Sizes (approximate, 2026):
- `refseq_select_rna`: ~5 GB
- `refseq_protein`: ~30 GB
- `swissprot`: <1 GB
- `nt`: ~250 GB
- `nr`: ~300 GB

For most use cases, `refseq_select_*` is the right starting point. `nt`/`nr` are storage-heavy and reproducibility-hostile.

## Code patterns

### Build and search a custom protein database

**Goal:** Build a BLAST+ protein database from a custom FASTA and search against it.

**Approach:** `makeblastdb` with v5 + parse_seqids + hash_index; `blastp` with explicit outfmt.

**Reference (NCBI BLAST+ 2.15+):**
```bash
#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs

REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv

makeblastdb -in "$REF" -dbtype prot \

Files: 6

Size: 28.6 KB

Complexity: 47/100

Category: Productivity

Source: https://github.com/gptomics/bioskills/tree/main/database-access/local-blast

Related in Productivity

gitea-workflow

Included

Orchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.

Productivityscripts

microsoft-graph-gateway

Included

Route Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.

Productivityscripts

copilotkit

Included

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Provides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.

Productivityscripts

swain-session

Included

Session management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.

Productivityscripts

gandi

Included

Comprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.

Productivityscripts

Use when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.

Productivityscripts

wordly-wisdom

Included

Productivityscripts

swain-session

Included

Productivityscripts

gandi

Included

Productivityscripts