Claude
Skills
Sign in
Back

bio-local-blast

Included with Lifetime
$97 forever

Build local BLAST databases and run searches using NCBI BLAST+ command-line tools. Use when running >50 queries, building custom databases with -parse_seqids and -taxid, downloading prebuilt NCBI databases via update_blastdb.pl, choosing -task variants (megablast/dc-megablast/blastn/blastn-short), tuning soft/hard masking, scaling threads, or extracting hits with blastdbcmd. Encodes BLAST v5 vs v4 database format, taxonomy filtering, makeblastdb pitfalls.

Productivity

What this skill does


## Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `blastn -version` then `blastn -help` to confirm flags
- CLI: `makeblastdb -help` to confirm database build options

If a flag is unrecognized or behavior changes, introspect with `-help` and adapt the example to match the installed version rather than retrying.

# Local BLAST

**"Run BLAST locally for speed and control"** -> Build or download a BLAST+ database, run the appropriate program with carefully chosen `-task`, masking, and thread settings, parse tabular output. Local BLAST is the right tool when remote is rate-limited or when the database must be reproducible (frozen).

The biggest mistakes are (a) using `nt`/`nr` without realizing they're >250 GB and grow weekly, (b) not building with `-parse_seqids` and then being unable to extract hit sequences with `blastdbcmd`, (c) using default `blastn` for cross-species when `dc-megablast` is correct, and (d) thinking `-num_threads 32` will scale -- past ~16 threads BLAST is I/O bound.

- CLI: `makeblastdb`, `blastn`/`blastp`, `blastdbcmd`, `update_blastdb.pl` (NCBI BLAST+)
- Python: `subprocess` wrapper (preferred); `Bio.Blast.Applications` was deprecated and removed -- do not use

## Installation

```bash
# conda (preferred)
conda install -c bioconda blast

# macOS
brew install blast

# Ubuntu
sudo apt install ncbi-blast+

# Verify
blastn -version    # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head
```

## Database format: v5 vs v4

NCBI introduced BLAST database v5 in BLAST+ 2.10 (2020). v5 includes taxonomy indexing directly in the database files, enabling `-taxids` and `-taxidlist` filtering without a companion file. v4 databases require `taxonomy4blast.sqlite3` to be present and discoverable.

| Feature | v4 | v5 |
|---|---|---|
| Default for prebuilt NCBI dbs | No (legacy) | Yes (since 2020) |
| `-taxids`, `-taxidlist` support | No | Yes |
| `blastdbcmd -taxids` | No | Yes |
| New `-info` output fields | No | Yes |

`update_blastdb.pl` downloads v5 by default. When building a database manually with `makeblastdb`, v5 format requires `-blastdb_version 5`. **Always pass `-blastdb_version 5` and `-parse_seqids` when building from scratch.**

## `makeblastdb` flag taxonomy

| Flag | Effect | When |
|---|---|---|
| `-dbtype nucl` or `-dbtype prot` | Required | Always |
| `-parse_seqids` | Indexes accessions so `blastdbcmd -entry <acc>` works | Almost always (downstream extraction) |
| `-hash_index` | Speeds up extraction by accession | Large dbs |
| `-blastdb_version 5` | Use v5 format | Always |
| `-taxid 9606` | Single taxid for all seqs | Single-species DB |
| `-taxid_map file.tsv` | Per-sequence taxid mapping (seqid<TAB>taxid) | Multi-species DB |
| `-mask_data masking.asnb` | Apply precomputed soft-masking | Production pipelines |
| `-title "..."` | Free-text label | Cosmetic |
| `-out path/prefix` | DB file path prefix | Always |

```bash
makeblastdb -in reference.fasta -dbtype nucl \
            -blastdb_version 5 \
            -parse_seqids \
            -hash_index \
            -title "Custom reference 2026-05" \
            -out custom_db
```

## `-task` taxonomy (the most-misused BLAST setting)

For `blastn`, the `-task` flag picks among heuristics with different word sizes and gap parameters.

| `-task` | Word | Gapped | Use case | Mistake to avoid |
|---|---|---|---|---|
| `megablast` (default) | 28 | linear | >=95% identity, intra-species, primer hits, contamination check | Used for cross-species and misses everything |
| `dc-megablast` | 11 (discontiguous) | yes | Cross-species mRNA homology | Underused -- this is what `blastn` "should" be for cross-species |
| `blastn` | 11 | yes | General sensitive DNA | Slower than dc-megablast for same job |
| `blastn-short` | 7 | yes | Queries <50 nt (primers, small RNAs) | Default megablast can't seed at length 7 |
| `rmblastn` | 11 | yes | Repeat masking; bundled with RepeatModeler | Specialized |

For `blastp`:

| `-task` | Word | Use case |
|---|---|---|
| `blastp` (default) | 3 | General protein similarity |
| `blastp-fast` | 6 | Faster, less sensitive |
| `blastp-short` | 2 | Peptides <30 aa, with PAM30 + word_size=2 typical |

## Soft vs hard masking

| Setting | Effect on seed | Effect on extension | Effect on score |
|---|---|---|---|
| `-soft_masking true` (default for several tasks) | Skip masked positions when seeding | Allow extension through masked | Score includes masked positions |
| `-soft_masking false` + `-dust yes` / `-seg yes` | Skip masked positions when seeding | Skip masked positions in extension | Score excludes masked positions |
| Hard-mask in input FASTA (N or X) | Hard exclusion everywhere | Hard exclusion | Treated as mismatches |

Soft masking is correct for almost all cases. Hard masking creates artificial mismatches at masked boundaries and can split true alignments. The exception: searching against a database of repeats explicitly, where hard masking on the query is the right choice.

## Thread scaling

BLAST+ parallelizes per-query (with `-num_threads`) but is I/O bound past ~16 threads on most hardware. For >100,000 query batches the better answer is splitting the input FASTA into N chunks and running N parallel `blastn` invocations -- this saturates CPUs better than `-num_threads 64`.

| Threads | Typical speedup vs single | Notes |
|---|---|---|
| 1-8 | Near-linear | Default sweet spot |
| 8-16 | Sub-linear (1.5-2x over 8) | Useful on big SMP boxes |
| 16-32 | Diminishing returns | I/O bound for most DBs |
| 32+ | Often slower | Cache thrash + I/O contention |

For massive workflows, prefer **DIAMOND** (Buchfink et al. 2021 *Nat Methods* 18:366) or **MMseqs2** (Steinegger & Soding 2017 *Nat Biotechnol* 35:1026) -- 100-10,000x faster than BLASTP at comparable sensitivity. See `remote-homology` skill.

## Output format reference (`-outfmt`)

| `-outfmt` | Description | Use |
|---|---|---|
| 0 | Pairwise (default; human-readable) | Debugging, inspection |
| 5 | XML | Programmatic parsing (Bio.SearchIO) |
| 6 | Tabular (no header) | Most pipelines |
| 7 | Tabular with comment headers | Self-documenting |
| 11 | ASN.1 binary | Re-parse with later versions |

Custom tabular fields:
```bash
blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"
```

Field key fields for analysis:
- `pident` = percent identity over the HSP (NOT the query); for query-level, use `qcovhsp`
- `qcovs` = total query coverage by all HSPs of this subject (the "coverage" most users want)
- `qcovhsp` = query coverage by best HSP alone (use when there's only one HSP per hit)
- `staxids` = taxonomy IDs (v5 only); critical for any "what species" workflow

## Prebuilt NCBI databases via `update_blastdb.pl`

```bash
# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'

# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna

# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein

# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna
```

Sizes (approximate, 2026):
- `refseq_select_rna`: ~5 GB
- `refseq_protein`: ~30 GB
- `swissprot`: <1 GB
- `nt`: ~250 GB
- `nr`: ~300 GB

For most use cases, `refseq_select_*` is the right starting point. `nt`/`nr` are storage-heavy and reproducibility-hostile.

## Code patterns

### Build and search a custom protein database

**Goal:** Build a BLAST+ protein database from a custom FASTA and search against it.

**Approach:** `makeblastdb` with v5 + parse_seqids + hash_index; `blastp` with explicit outfmt.

**Reference (NCBI BLAST+ 2.15+):**
```bash
#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs

REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv

makeblastdb -in "$REF" -dbtype prot \

Related in Productivity