synth_pdb.plm β Protein Language Model Embeddings
ESM-2 per-residue embeddings via HuggingFace Transformers.
Install the optional dependency first:
Quick Start
from synth_pdb.plm import ESM2Embedder
embedder = ESM2Embedder() # lazy β model loads on first embed() call
# Per-residue embeddings
emb = embedder.embed("MQIFVKTLTGKTITLEVEPS")
print(emb.shape) # (20, 320) β 20 residues Γ 320-dim float32
# From a biotite AtomArray
emb = embedder.embed_structure(atom_array) # same shape as embed()
# Sequence-level cosine similarity
sim = embedder.sequence_similarity("ACDEF", "ACDEF") # β 1.0
sim = embedder.sequence_similarity("ACDEF", "VWLYG") # β ~0.7β0.9
Lazy loading
ESM2Embedder() does nothing until you call embed(). This means
from synth_pdb.plm import ESM2Embedder is always safe, even
without torch or transformers installed.
Using a Larger Model
All ESM-2 variants share the same API:
# Default (8M params, 320-dim, ~30 MB)
embedder = ESM2Embedder()
# Better accuracy (35M params, 480-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t12_35M_UR50D")
# Near-production (150M params, 640-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t30_150M_UR50D")
API Reference
ESM2Embedder
Per-residue protein language model embeddings from ESM-2.
The model is loaded lazily on the first call to embed() β not at
init time. This means:
β’ import synth_pdb.plm is always safe with no torch/transformers installed
β’ ESM2Embedder() is instantaneous
β’ The ~5-second model load occurs once, then is cached in self._model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID. Default: "facebook/esm2_t6_8M_UR50D". Upgrade to "facebook/esm2_t12_35M_UR50D" for 480-dim embeddings with better accuracy; API is identical. |
_DEFAULT_MODEL
|
device
|
Optional[str]
|
Torch device string ("cpu", "cuda", "mps"). Default: auto-detect. |
None
|
Attributes
embedding_dim
property
Embedding dimensionality for this model variant.
Determined from the model config after the first embed() call. Before the first call, returns the known default for common models.
Functions
embed(sequence)
Embed a protein sequence using ESM-2.
Each amino acid is represented as a D-dimensional float32 vector encoding evolutionary, structural, and chemical context learned from 250M protein sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
Single-letter amino acid string, e.g. "MQIFVKTLTG". Standard 20 amino acids only. Unknown residues β 'X'. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray of shape (L, D) and dtype float32. |
ndarray
|
L = len(sequence), D = self.embedding_dim (320 for default model). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If torch or transformers are not installed. Install with: pip install synth-pdb[plm] |
Example
embedder = ESM2Embedder() emb = embedder.embed("ACDEFGHIKLMNPQRSTVWY") emb.shape (20, 320)
embed_structure(structure)
Embed a protein given its biotite AtomArray.
Extracts the amino acid sequence from the structure (using residue names in the AtomArray), then delegates to embed().
This is a convenience method β the embeddings are purely sequence-based and do not use any 3D coordinate information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
structure
|
Any
|
biotite.structure.AtomArray. Must contain at least one atom per residue (e.g. CA atoms suffice). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray of shape (n_residues, embedding_dim), float32. |
Example
from synth_pdb.generator import ProteinGenerator structure = ProteinGenerator().generate(20, ss_type="helix") emb = embedder.embed_structure(structure) emb.shape (20, 320)
mean_embed(sequence)
Return the mean-pooled sequence-level embedding.
Mean pooling averages the per-residue vectors
mean_embed(seq) = (1/L) Ξ£ embed(seq)[i] for i in 0..L-1
This gives a single D-dim vector representing the whole sequence. Loses positional information but enables fast sequence comparison.
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray of shape (D,), float32. |
sequence_similarity(seq_a, seq_b)
Cosine similarity between the mean embeddings of two sequences.
Returns a value in [-1, 1]: β’ 1.0 β identical embeddings (same sequence) β’ 0.0 β orthogonal (no similarity) β’ -1.0 β opposite (very unlikely for protein embeddings)
WHY COSINE, NOT L2?
Cosine similarity is magnitude-invariant β it measures the angle between vectors, not their length. Longer proteins have higher-norm embeddings simply because there are more residues, not because they are more similar. Cosine corrects for this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq_a
|
str
|
First single-letter amino acid string. |
required |
seq_b
|
str
|
Second single-letter amino acid string. |
required |
Returns:
| Type | Description |
|---|---|
float
|
float β cosine similarity of mean embeddings. |
Example
embedder.sequence_similarity("AAAAAAA", "VIVIVIV") 0.832... # high β both are simple repetitive peptides embedder.sequence_similarity("ACDEFGHIK", "WQMPLRNTS") 0.71... # lower β very different character
Practical Examples
Feed into GNN as node features
from synth_pdb.plm import ESM2Embedder
import numpy as np
plm = ESM2Embedder()
plm_features = plm.embed_structure(structure) # (L, 320)
# Concatenate with your existing per-residue geometry features
node_features = np.concatenate([geometry_features, plm_features], axis=-1)
Secondary structure linear probe
import torch
import torch.nn as nn
plm = ESM2Embedder()
emb = torch.tensor(plm.embed("MQIFVKTLTGKTITLEVEPS")) # (20, 320)
probe = nn.Linear(320, 3) # 3 classes: Helix / Strand / Coil
logits = probe(emb) # (20, 3)
probs = logits.softmax(-1)
Pairwise similarity matrix over a sequence library
import numpy as np
sequences = ["ACDEF", "ACDEF", "VWLYG", "RRKKK"]
plm = ESM2Embedder()
mean_embs = np.stack([plm.mean_embed(s) for s in sequences]) # (N, 320)
# Normalise rows, then dot-product β cosine similarity matrix
norms = np.linalg.norm(mean_embs, axis=1, keepdims=True)
normed = mean_embs / (norms + 1e-8)
sim_matrix = normed @ normed.T # (N, N)
Background
See Protein Language Models for the full scientific background, model architecture diagram, and explanation of what the embedding dimensions encode.