Claude
Skills
Sign in
Back

bio-immunoinformatics-tcr-epitope-binding

Included with Lifetime
$97 forever

Infer or annotate TCR antigen specificity by unsupervised clustering (TCRdist/tcrdist3, GLIPH2, clusTCR, GIANA) and database lookup (VDJdb, IEDB, McPAS-TCR), and rank candidates with supervised predictors (ERGO-II, NetTCR-2.x, pMTnet) under explicit caveats. Encodes the central truth that general TCR-epitope prediction for UNSEEN epitopes essentially does not work (collapses to near-random; IMMREP22, Grazioli 2022) because labeled data is dominated by a few immunodominant epitopes and there is no true negative set — so clustering for discovery is the honest task and de-novo binding needs wet-lab validation. Use when annotating TCR specificity or grouping a repertoire. Epitope/MHC context lives in mhc-binding-prediction.

Productivity

What this skill does


## Version Compatibility

Reference examples tested with: tcrdist3 0.2+, pandas 2.2+, scipy 1.12+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

Notes specific to this skill: tcrdist3 expects IMGT-style columns (`cdr3_b_aa`, `v_b_gene`, `j_b_gene`, and the `_a_` analogs, plus `count`). Tools disagree on whether CDR3 keeps the leading Cys / trailing Phe-Trp — a common silent input mismatch. Supervised predictors (ERGO-II, NetTCR, pMTnet) are separate repos with pretrained weights; their reported AUCs depend heavily on the train/test split and negative-sampling scheme. Re-verify before trusting any number.

# TCR-Epitope Binding

**"What antigen does this TCR recognize / which TCRs share specificity?"** -> Annotate specificity by clustering + database lookup; predict de-novo binding only as a validation-bound hypothesis.
- Python: `tcrdist3` (TCRrep distance + meta-clonotypes), GLIPH2, clusTCR, GIANA for clustering
- Python: ERGO-II / NetTCR-2.x / pMTnet for supervised scoring (caveated); VDJdb/IEDB/McPAS-TCR for lookup

## The Single Most Important Modern Insight -- general prediction for unseen epitopes does not work; clustering does

Every supervised TCR-epitope predictor performs respectably on epitopes seen in training and collapses to near-random on epitopes it has never seen (Grazioli 2022; IMMREP22, Meysman 2023 across 23 models). The cause is the data, not the architecture: the labeled TCR-pMHC universe is dominated by a few immunodominant epitopes (NLVPMVATV/CMV, GILGFVFTL/influenza M1, SARS-CoV-2 spike), so a model learns "is this an anti-CMV TCR" rather than the rules of TCR-peptide docking. Compounding this, there is no true negative set — experiments report binders, and absence of a measured non-binder is not non-binding — so every supervised model manufactures negatives, and that choice dominates the reported metric more than the architecture (Dens 2023). The honest, defensible task is unsupervised specificity clustering: "these TCRs are sequence-similar enough to likely share a specificity," a discovery statement used within one dataset and propagated by guilt-by-association to a known member. Clustering is honest because it never extrapolates into unseen-epitope space; per-pair prediction is dishonest when it pretends to. Route the user to the honest task and refuse to let a supervised per-pair probability substitute for a tetramer.

## Tool Taxonomy

| Tool | Citation | Task | Input | Note |
|------|----------|------|-------|------|
| TCRdist / tcrdist3 | Dash 2017; Mayer-Blackwell 2021 | Clustering (distance) | CDR3 + V/J, both chains | Multi-loop distance, 3x weight on CDR3; meta-clonotypes |
| GLIPH2 | Huang 2020 | Clustering (global + motif) | CDR3β + V/J + HLA | Predicts restricting allele; background-repertoire dependent |
| clusTCR | Valkiers 2021 | Clustering (Faiss+MCL) | CDR3β | Scales to millions; speed for specificity |
| GIANA / iSMART | Zhang 2021; Zhang 2020 | Clustering (fast) | CDR3β | Small high-specificity clusters |
| ERGO-II | Springer 2021 | Supervised prediction | CDR3β(+α,V,J,MHC) | Degrades gracefully; seen-epitope only |
| NetTCR-2.x | Montemurro 2021 | Supervised prediction | paired CDR3α+β | Paired beats single-chain; ~150 pos/epitope needed |
| pMTnet / PanPep | Lu 2021; Gao 2023 | Supervised, neoantigen-aimed | CDR3β + peptide + MHC | Zero-shot claims need skepticism |

## Reference Databases (training set AND lookup table)

| Database | Citation | Content | Caveat |
|----------|----------|---------|--------|
| VDJdb | Shugay 2018; Bagaev 2020 | Curated TCR-pMHC with confidence 0-3 | Filter on confidence; skewed to HLA-A*02:01 |
| IEDB | Vita 2019 | TCR + pMHC assays | The corpus most predictors draw on |
| McPAS-TCR | Tickotsky 2017 | Pathology-organized (infection/cancer/autoimmune) | Human + mouse |
| 10x dextramer | Zhang 2021 (Sci Adv) | Largest paired-chain set, 4 donors | Labels are threshold calls, not gold; multiplets/background |

## Decision Tree by Scenario

| Scenario | Recommended | Why |
|----------|-------------|-----|
| Have known specificities (tetramer sort / DB hits) | Cluster (tcrdist3/GLIPH2) + lookup, propagate labels | The honest, bounded question |
| Group a repertoire by likely shared specificity | tcrdist3 or clusTCR within one cohort | Discovery within dataset; keep HLA as covariate |
| Truly de-novo novel epitope (e.g. neoantigen) | Rank with pMTnet/PanPep, label as hypothesis, validate | Prediction does not generalize; tetramer/functional assay decides |
| Millions of CDR3s | clusTCR (Faiss+MCL) | Speed at modest specificity cost |
| Predict restricting HLA from sequence | GLIPH2 | Infers allele from cross-donor co-occurrence |
| "Does this TCR bind peptide X?" for unseen X | No reliable computational answer | State plainly; there is no third branch |

## Cluster TCRs by Specificity (tcrdist3)

**Goal:** Group TCRs likely to share an antigen, for discovery within one cohort.

**Approach:** Build a TCRrep (which computes the position-weighted multi-loop distance, 3x on CDR3), then cluster the pairwise matrix and annotate clusters containing a known-specificity member. Keep HLA as an explicit covariate — the same CDR3 on a different allele is a different specificity.

```python
from tcrdist.repertoire import TCRrep
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

def cluster_tcrs(df, max_dist=50):
    '''df needs IMGT columns: cdr3_b_aa, v_b_gene, j_b_gene (+ _a_ analogs), count.
    Returns cluster labels; annotate clusters that contain a database/tetramer hit.'''
    tr = TCRrep(cell_df=df, organism='human', chains=['beta'])
    condensed = squareform(tr.pw_beta, checks=False)
    return fcluster(linkage(condensed, method='average'), t=max_dist, criterion='distance')
```

## Annotate by Database Lookup

**Goal:** Assign specificity to TCRs that match known TCR-pMHC pairs.

**Approach:** Match exactly or near-exactly against VDJdb/IEDB/McPAS, filtering VDJdb on its confidence score, and report the database hit and HLA restriction driving each annotation — not a per-pair probability dressed as certainty.

```python
import pandas as pd

def lookup_vdjdb(query_cdr3b, vdjdb, min_confidence=1):
    '''Exact CDR3b match against a confidence-filtered VDJdb. Near-matches (edit
    distance 1) belong to the clustering route, not a binding claim.'''
    db = vdjdb[vdjdb['vdjdb.score'] >= min_confidence]
    hits = db[db['cdr3'].isin(set(query_cdr3b))]
    return hits[['cdr3', 'antigen.epitope', 'antigen.species', 'mhc.a']]
```

## Per-Method Failure Modes

### Unseen-epitope collapse
**Trigger:** using a supervised model on an epitope absent from training. **Mechanism:** models learn a few well-sampled specificities, not docking rules. **Symptom:** great benchmark AUC, near-random on novel epitopes. **Fix:** route de-novo questions to ranking-plus-validation; never report a confident per-pair call.

### Negative-sampling artifact
**Trigger:** trusting a headline AUC. **Mechanism:** manufactured negatives (shuffled or random-TCR) create artificial separability; repeated-negative leakage lets the model count TCR frequency. **Symptom:** AUC > 0.85 with no discussion of negatives/splits. **Fix:** read the negative-sampling sentence first; require epitope-disjoint evaluation.

### CDR3β-only ceiling
**Trigger:** strong claims from a β-only model. **Mechanism:** alpha chain and V/J carry heavy signal; bulk β-only is information-poor. **Symptom:** big AUC from the least informative input (i.e. from artifacts). **Fix:** prefer paired-chain data; add V/J; discount β-only headline numbers.

### Clustering con

Related in Productivity