bio-immunoinformatics-tcr-epitope-binding
Infer or annotate TCR antigen specificity by unsupervised clustering (TCRdist/tcrdist3, GLIPH2, clusTCR, GIANA) and database lookup (VDJdb, IEDB, McPAS-TCR), and rank candidates with supervised predictors (ERGO-II, NetTCR-2.x, pMTnet) under explicit caveats. Encodes the central truth that general TCR-epitope prediction for UNSEEN epitopes essentially does not work (collapses to near-random; IMMREP22, Grazioli 2022) because labeled data is dominated by a few immunodominant epitopes and there is no true negative set — so clustering for discovery is the honest task and de-novo binding needs wet-lab validation. Use when annotating TCR specificity or grouping a repertoire. Epitope/MHC context lives in mhc-binding-prediction.
What this skill does
## Version Compatibility
Reference examples tested with: tcrdist3 0.2+, pandas 2.2+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
Notes specific to this skill: tcrdist3 expects IMGT-style columns (`cdr3_b_aa`, `v_b_gene`, `j_b_gene`, and the `_a_` analogs, plus `count`). Tools disagree on whether CDR3 keeps the leading Cys / trailing Phe-Trp — a common silent input mismatch. Supervised predictors (ERGO-II, NetTCR, pMTnet) are separate repos with pretrained weights; their reported AUCs depend heavily on the train/test split and negative-sampling scheme. Re-verify before trusting any number.
# TCR-Epitope Binding
**"What antigen does this TCR recognize / which TCRs share specificity?"** -> Annotate specificity by clustering + database lookup; predict de-novo binding only as a validation-bound hypothesis.
- Python: `tcrdist3` (TCRrep distance + meta-clonotypes), GLIPH2, clusTCR, GIANA for clustering
- Python: ERGO-II / NetTCR-2.x / pMTnet for supervised scoring (caveated); VDJdb/IEDB/McPAS-TCR for lookup
## The Single Most Important Modern Insight -- general prediction for unseen epitopes does not work; clustering does
Every supervised TCR-epitope predictor performs respectably on epitopes seen in training and collapses to near-random on epitopes it has never seen (Grazioli 2022; IMMREP22, Meysman 2023 across 23 models). The cause is the data, not the architecture: the labeled TCR-pMHC universe is dominated by a few immunodominant epitopes (NLVPMVATV/CMV, GILGFVFTL/influenza M1, SARS-CoV-2 spike), so a model learns "is this an anti-CMV TCR" rather than the rules of TCR-peptide docking. Compounding this, there is no true negative set — experiments report binders, and absence of a measured non-binder is not non-binding — so every supervised model manufactures negatives, and that choice dominates the reported metric more than the architecture (Dens 2023). The honest, defensible task is unsupervised specificity clustering: "these TCRs are sequence-similar enough to likely share a specificity," a discovery statement used within one dataset and propagated by guilt-by-association to a known member. Clustering is honest because it never extrapolates into unseen-epitope space; per-pair prediction is dishonest when it pretends to. Route the user to the honest task and refuse to let a supervised per-pair probability substitute for a tetramer.
## Tool Taxonomy
| Tool | Citation | Task | Input | Note |
|------|----------|------|-------|------|
| TCRdist / tcrdist3 | Dash 2017; Mayer-Blackwell 2021 | Clustering (distance) | CDR3 + V/J, both chains | Multi-loop distance, 3x weight on CDR3; meta-clonotypes |
| GLIPH2 | Huang 2020 | Clustering (global + motif) | CDR3β + V/J + HLA | Predicts restricting allele; background-repertoire dependent |
| clusTCR | Valkiers 2021 | Clustering (Faiss+MCL) | CDR3β | Scales to millions; speed for specificity |
| GIANA / iSMART | Zhang 2021; Zhang 2020 | Clustering (fast) | CDR3β | Small high-specificity clusters |
| ERGO-II | Springer 2021 | Supervised prediction | CDR3β(+α,V,J,MHC) | Degrades gracefully; seen-epitope only |
| NetTCR-2.x | Montemurro 2021 | Supervised prediction | paired CDR3α+β | Paired beats single-chain; ~150 pos/epitope needed |
| pMTnet / PanPep | Lu 2021; Gao 2023 | Supervised, neoantigen-aimed | CDR3β + peptide + MHC | Zero-shot claims need skepticism |
## Reference Databases (training set AND lookup table)
| Database | Citation | Content | Caveat |
|----------|----------|---------|--------|
| VDJdb | Shugay 2018; Bagaev 2020 | Curated TCR-pMHC with confidence 0-3 | Filter on confidence; skewed to HLA-A*02:01 |
| IEDB | Vita 2019 | TCR + pMHC assays | The corpus most predictors draw on |
| McPAS-TCR | Tickotsky 2017 | Pathology-organized (infection/cancer/autoimmune) | Human + mouse |
| 10x dextramer | Zhang 2021 (Sci Adv) | Largest paired-chain set, 4 donors | Labels are threshold calls, not gold; multiplets/background |
## Decision Tree by Scenario
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Have known specificities (tetramer sort / DB hits) | Cluster (tcrdist3/GLIPH2) + lookup, propagate labels | The honest, bounded question |
| Group a repertoire by likely shared specificity | tcrdist3 or clusTCR within one cohort | Discovery within dataset; keep HLA as covariate |
| Truly de-novo novel epitope (e.g. neoantigen) | Rank with pMTnet/PanPep, label as hypothesis, validate | Prediction does not generalize; tetramer/functional assay decides |
| Millions of CDR3s | clusTCR (Faiss+MCL) | Speed at modest specificity cost |
| Predict restricting HLA from sequence | GLIPH2 | Infers allele from cross-donor co-occurrence |
| "Does this TCR bind peptide X?" for unseen X | No reliable computational answer | State plainly; there is no third branch |
## Cluster TCRs by Specificity (tcrdist3)
**Goal:** Group TCRs likely to share an antigen, for discovery within one cohort.
**Approach:** Build a TCRrep (which computes the position-weighted multi-loop distance, 3x on CDR3), then cluster the pairwise matrix and annotate clusters containing a known-specificity member. Keep HLA as an explicit covariate — the same CDR3 on a different allele is a different specificity.
```python
from tcrdist.repertoire import TCRrep
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
def cluster_tcrs(df, max_dist=50):
'''df needs IMGT columns: cdr3_b_aa, v_b_gene, j_b_gene (+ _a_ analogs), count.
Returns cluster labels; annotate clusters that contain a database/tetramer hit.'''
tr = TCRrep(cell_df=df, organism='human', chains=['beta'])
condensed = squareform(tr.pw_beta, checks=False)
return fcluster(linkage(condensed, method='average'), t=max_dist, criterion='distance')
```
## Annotate by Database Lookup
**Goal:** Assign specificity to TCRs that match known TCR-pMHC pairs.
**Approach:** Match exactly or near-exactly against VDJdb/IEDB/McPAS, filtering VDJdb on its confidence score, and report the database hit and HLA restriction driving each annotation — not a per-pair probability dressed as certainty.
```python
import pandas as pd
def lookup_vdjdb(query_cdr3b, vdjdb, min_confidence=1):
'''Exact CDR3b match against a confidence-filtered VDJdb. Near-matches (edit
distance 1) belong to the clustering route, not a binding claim.'''
db = vdjdb[vdjdb['vdjdb.score'] >= min_confidence]
hits = db[db['cdr3'].isin(set(query_cdr3b))]
return hits[['cdr3', 'antigen.epitope', 'antigen.species', 'mhc.a']]
```
## Per-Method Failure Modes
### Unseen-epitope collapse
**Trigger:** using a supervised model on an epitope absent from training. **Mechanism:** models learn a few well-sampled specificities, not docking rules. **Symptom:** great benchmark AUC, near-random on novel epitopes. **Fix:** route de-novo questions to ranking-plus-validation; never report a confident per-pair call.
### Negative-sampling artifact
**Trigger:** trusting a headline AUC. **Mechanism:** manufactured negatives (shuffled or random-TCR) create artificial separability; repeated-negative leakage lets the model count TCR frequency. **Symptom:** AUC > 0.85 with no discussion of negatives/splits. **Fix:** read the negative-sampling sentence first; require epitope-disjoint evaluation.
### CDR3β-only ceiling
**Trigger:** strong claims from a β-only model. **Mechanism:** alpha chain and V/J carry heavy signal; bulk β-only is information-poor. **Symptom:** big AUC from the least informative input (i.e. from artifacts). **Fix:** prefer paired-chain data; add V/J; discount β-only headline numbers.
### Clustering conRelated in Productivity
gitea-workflow
IncludedOrchestrate agile development workflows for Gitea repositories using the tea CLI. Use when working with Gitea-hosted repos and asking to 'run the workflow', 'continue working', 'what's next', 'complete the task cycle', 'start my day', 'end the sprint', 'implement the next task', or wanting guided step-by-step development assistance. Keywords: workflow, orchestrate, agile, task cycle, sprint, daily, implement, review, PR, standup, retrospective, gitea, tea.
microsoft-graph-gateway
IncludedRoute Microsoft Graph work in this workspace. Use when users want to read or write Outlook mail, calendar events, contacts, OneDrive or SharePoint files, Teams, Planner, To Do, users, groups, directory data, or arbitrary Microsoft Graph endpoints from VS Code. Prefer WorkIQ for common read scenarios. Use Microsoft Graph for write actions and gap-read scenarios that need exact Graph properties, filters, permissions, or endpoints.
copilotkit
IncludedUse when building with CopilotKit — setup, development, integrations, debugging, upgrading, or contributing. Routes to the appropriate specialized skill based on the task.
wordly-wisdom
IncludedProvides calibrated decision analysis using Charlie Munger-style multiple mental models, inversion, incentive mapping, circle-of-competence checks, misjudgment audits, second-order effects, and forecast updates. Use when the user asks for an oracle take, a hard call, a decision memo, a premortem, an outside view, a red-team, a sanity-check, what am I missing, think this through, or wants a strategy, hire, investment, plan, product, partnership, or major life choice analysed. Avoid for simple factual lookups or time-sensitive legal, medical, or market questions without fresh evidence.
swain-session
IncludedSession management and project status dashboard. Owns the full session lifecycle (start/work/close/resume), focus lane, bookmarks, worktree detection, and tab naming. Also serves as the project status dashboard — shows active epics, progress, actionable next steps, blocked items, tasks, GitHub issues, and recommendations. Worktree creation is deferred to swain-do task dispatch (SPEC-195). Triggers on: 'session', 'status', 'what's next', 'dashboard', 'overview', 'where are we', 'what should I work on', 'show me priorities', 'bookmark', 'focus on', 'session info'.
gandi
IncludedComprehensive Gandi domain registrar integration for domain and DNS management. Register and manage domains, create/update/delete DNS records (A, AAAA, CNAME, MX, TXT, SRV, and more), configure email forwarding and aliases, check SSL certificate status, create DNS snapshots for safe rollback, bulk update zone files, and monitor domain expiration. Supports multi-domain management, zone file import/export, and automated DNS backups. Includes both read-only and destructive operations with safety controls.