Claude
Skills
Sign in
Back

bio-molecular-descriptors

Included with Lifetime
$97 forever

Calculates molecular fingerprints (ECFP/Morgan, FCFP, MACCS, RDKit, AtomPair, TopologicalTorsion, Avalon, MAP4, MHFP6) and physicochemical descriptors (Lipinski, QED, TPSA, Crippen LogP, 3D shape) with explicit choice tables, bit vs count semantics, and partial-charge model selection. Use when featurizing molecules for similarity, QSAR, virtual screening, or ML, or selecting the correct fingerprint for a chemotype-aware task.

Productivity

What this skill does


## Version Compatibility

Reference examples tested with: RDKit 2024.09+, numpy 1.26+, pandas 2.2+, mapchiral 0.1+ (MAP4), mhfp 1.9+.

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Molecular Descriptors

Featurize molecules for similarity search, QSAR, virtual screening, or ML. The fingerprint or descriptor choice is **chemotype-aware**: ECFP4 dominates drug-like organic similarity, AtomPair and TopologicalTorsion outperform for scaffold hopping, MAP4/MHFP6 win on metabolomics-scale chemical diversity, and 3D conformer-based descriptors are essential when shape and stereochemistry matter.

For canonicalization before featurization, see `chemoinformatics/molecular-standardization`. For 3D-only descriptors, see `chemoinformatics/conformer-generation`.

## Fingerprint Taxonomy

| Fingerprint | Type | Radius/Path | Bits | Use case | Fails when |
|-------------|------|-------------|------|----------|------------|
| Morgan (ECFP) | Circular | r=2 (ECFP4), r=3 (ECFP6) | 2048 typical | Drug-like similarity, ML default | Loses long-range topology; bit collisions at low nBits |
| FCFP | Functional Morgan | r=2 default | 2048 | Pharmacophore-aware similarity | Same caveats as ECFP; less specific |
| MACCS | Substructure key | 166 fixed bits | 167 | Quick fingerprint, drug-likeness | Too sparse for large diverse libraries |
| RDKit FP | Path-based | linear paths up to 7 atoms | 2048 | RDKit-native ECFP alternative | Drug-like only; not optimal for scaffold hopping |
| AtomPair | Pair + topological distance | All atom pairs | 2048 | Scaffold hopping; flexible mol | Slower than ECFP; harder to interpret |
| TopologicalTorsion | 4-atom torsion | All TT | 2048 | Scaffold hopping; less hit-rate | Like AP, slower than ECFP |
| Avalon | Substructure + atom pairs | Mixed | 512/1024 | Fast similarity | Less standard; older |
| MAP4 (MinHashed atom-pair) | MinHash atom-pair | r=1,2 | 1024/2048 | Biological + metabolite diversity | Library required (mapchiral); slower hash |
| MHFP6 (MinHash) | MinHash ECFP-like | r=3 (diam 6) | 2048 | Big-data nearest-neighbor (Annoy) | Different distance (Jaccard on MinHash) |
| Pharm2D | 2D pharmacophore | feature pairs/triplets | sparse | Pharmacophore search | Sparse, slower |

**Decision:** For drug-like similarity ranking, use **ECFP4 2048 bit**; established baseline, fast, well-understood. For diverse libraries (>1M compounds, metabolomics, peptides), **MHFP6** outperforms ECFP4 on analog recovery (Probst & Reymond 2018). For scaffold hopping, **AtomPair** beats ECFP4 on retrospective benchmarks but loses on retrospective single-target.

## Bit vs Count Vectors

| Form | Use | Library impact |
|------|-----|----------------|
| Bit (0/1) | Tanimoto similarity, BulkTanimotoSimilarity, RDKit fingerprint folding | Standard for similarity |
| Count (integer) | Some ML methods, RF on counts, neural fingerprints | Loses bit-level fast operations; richer signal |
| Sparse (dict) | Direct chemical interpretation (which fragments at which atoms) | Use for SHAP / atomic attribution |

```python
from rdkit import Chem
from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles('CCO')

ecfp4_bit = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
ecfp4_count = AllChem.GetHashedMorganFingerprint(mol, radius=2, nBits=2048)
ecfp4_sparse = AllChem.GetMorganFingerprint(mol, radius=2)
```

## Morgan / ECFP Radius Math

ECFP-X notation: X is the **diameter** in bonds. RDKit's `radius` parameter is half of X.

| Notation | RDKit radius | Diameter | Captures |
|----------|--------------|----------|----------|
| ECFP0 | 0 | 0 | Atom identity only |
| ECFP2 | 1 | 2 | Atom + immediate neighbors |
| ECFP4 | 2 | 4 | Atom + 2-bond environment |
| ECFP6 | 3 | 6 | Atom + 3-bond environment |

**Trade-off:** Larger radius captures more specific local environment but inflates bit-collision rate at fixed nBits. For QSAR with <10k compounds, **ECFP4 2048** is the established default (Rogers & Hahn 2010; MoleculeNet benchmarks Wu 2018). For large libraries (>1M), use **nBits=4096** or unhashed sparse representation to reduce ~1-5% bit-collision rate (O'Boyle 2016).

## FCFP vs ECFP

FCFP (Functional-Class) uses atom invariants based on pharmacophore role (donor, acceptor, hydrophobe, aromatic, halogen, basic, acidic) instead of atom identity. Trades atom-specificity for functional-equivalence.

```python
ecfp4 = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, useFeatures=False)
fcfp4 = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, useFeatures=True)
```

**When to use FCFP4:** Scaffold-hopping campaigns, pharmacophore-driven similarity, cross-target activity prediction.

**When to use ECFP4:** Within-series QSAR, lead optimization, when chemotype identity matters.

## 3D Descriptors and Conformer Dependence

Conformer-dependent descriptors (asphericity, eccentricity, principal moments of inertia, RDF) require a generated 3D structure. A **single conformer** is rarely sufficient: descriptor variance across the conformer ensemble can exceed the descriptor signal.

**Goal:** Compute 3D shape descriptors over a conformer ensemble rather than from a single (possibly unrepresentative) conformer.

**Approach:** Add explicit hydrogens, embed N conformers with ETKDGv3, MMFF-optimize them all, then evaluate the descriptor across each conformer for downstream averaging.

```python
from rdkit.Chem import AllChem, Descriptors3D

mol = Chem.MolFromSmiles('CCCCO')
mol = Chem.AddHs(mol)

params = AllChem.ETKDGv3()
params.randomSeed = 42
n = AllChem.EmbedMultipleConfs(mol, numConfs=20, params=params)
AllChem.MMFFOptimizeMoleculeConfs(mol)

asphericities = [Descriptors3D.Asphericity(mol, confId=c) for c in range(n)]
```

**Decision:** For QSAR / ML, compute over a conformer ensemble (n=20-100) and report mean or Boltzmann-weighted average. Single-conformer 3D descriptors are unreliable.

## Partial Charge Methods

| Method | Software | Cost | Accuracy | Use for |
|--------|----------|------|----------|---------|
| Gasteiger-Marsili | RDKit, Open Babel | <0.1s/mol | Empirical, rough | AutoDock Vina, fast screening |
| MMFF94 | RDKit | 0.1s/mol | Force-field consistent | MMFF energy, conformer ranking |
| AM1-BCC | antechamber (AmberTools) | ~10s/mol | Semi-empirical | MD setup, FEP, GAFF |
| RESP | psi4, Gaussian | minutes/mol | DFT ESP-fitted | High-accuracy MD, FEP |
| OpenFF Recharge | openff-recharge | seconds | DFT-derived but cached | OpenFF / SAGE setup |
| ABCG2 | Open Babel | <1s | Improved empirical | Modern Vina, AutoDock-GPU |

```python
from rdkit.Chem import AllChem

AllChem.ComputeGasteigerCharges(mol)
for atom in mol.GetAtoms():
    print(atom.GetIdx(), atom.GetPropsAsDict().get('_GasteigerCharge', None))
```

**Critical:** Charge method must match downstream. Gasteiger charges in an AMBER MD run violate the assumptions of the protein force field.

## MAP4 and MHFP6 for Diverse Libraries

For libraries spanning drug-like + natural products + peptides + metabolites, ECFP4 saturates Tanimoto similarity (most pairs report 0.1-0.3, hard to rank). MAP4 and MHFP6 use MinHash + atom-pair / circular substructures and discriminate better.

```python
from mhfp.encoder import MHFPEncoder

encoder = MHFPEncoder(2048)
mhfp6 = encoder.encode(mol, radius=3)
```

MHFP6 distance is Jaccard on MinHash, not standard Tanimoto. Use `MHFPEncoder.distance(fp1, fp2)`.

## Physicochemical Descriptors

| Descriptor | Source | Range | Drug-like cutoff |
|------------|--------|-------|-------------------|
| MolWt | RDKit `Descriptors.MolWt` | ~50-2000 Da | <=500 (Lipinski) |
| MolLogP (Crippen) | RDKit `Descriptors.MolLogP` | -5 to 8 | <=5 (Lipinski) |
| HBD | `Lipinski.NumHDonors` | 0-10 | <=5 (Li

Related in Productivity