Skip to content

synth_pdb.plm β€” Protein Language Model Embeddings

ESM-2 per-residue embeddings via HuggingFace Transformers.

Install the optional dependency first:

pip install synth-pdb[plm]

Quick Start

from synth_pdb.plm import ESM2Embedder

embedder = ESM2Embedder()   # lazy β€” model loads on first embed() call

# Per-residue embeddings
emb = embedder.embed("MQIFVKTLTGKTITLEVEPS")
print(emb.shape)   # (20, 320) β€” 20 residues Γ— 320-dim float32

# From a biotite AtomArray
emb = embedder.embed_structure(atom_array)   # same shape as embed()

# Sequence-level cosine similarity
sim = embedder.sequence_similarity("ACDEF", "ACDEF")   # β†’ 1.0
sim = embedder.sequence_similarity("ACDEF", "VWLYG")   # β†’ ~0.7–0.9

Lazy loading

ESM2Embedder() does nothing until you call embed(). This means from synth_pdb.plm import ESM2Embedder is always safe, even without torch or transformers installed.


Using a Larger Model

All ESM-2 variants share the same API:

# Default (8M params, 320-dim, ~30 MB)
embedder = ESM2Embedder()

# Better accuracy (35M params, 480-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t12_35M_UR50D")

# Near-production (150M params, 640-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t30_150M_UR50D")

API Reference

ESM2Embedder

Per-residue protein language model embeddings from ESM-2.

The model is loaded lazily on the first call to embed() β€” not at init time. This means: β€’ import synth_pdb.plm is always safe with no torch/transformers installed β€’ ESM2Embedder() is instantaneous β€’ The ~5-second model load occurs once, then is cached in self._model

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID. Default: "facebook/esm2_t6_8M_UR50D". Upgrade to "facebook/esm2_t12_35M_UR50D" for 480-dim embeddings with better accuracy; API is identical.

_DEFAULT_MODEL
device Optional[str]

Torch device string ("cpu", "cuda", "mps"). Default: auto-detect.

None

Attributes

embedding_dim property

Embedding dimensionality for this model variant.

Determined from the model config after the first embed() call. Before the first call, returns the known default for common models.

Functions

embed(sequence)

Embed a protein sequence using ESM-2.

Each amino acid is represented as a D-dimensional float32 vector encoding evolutionary, structural, and chemical context learned from 250M protein sequences.

Parameters:

Name Type Description Default
sequence str

Single-letter amino acid string, e.g. "MQIFVKTLTG". Standard 20 amino acids only. Unknown residues β†’ 'X'.

required

Returns:

Type Description
ndarray

np.ndarray of shape (L, D) and dtype float32.

ndarray

L = len(sequence), D = self.embedding_dim (320 for default model).

Raises:

Type Description
ImportError

If torch or transformers are not installed. Install with: pip install synth-pdb[plm]

Example

embedder = ESM2Embedder() emb = embedder.embed("ACDEFGHIKLMNPQRSTVWY") emb.shape (20, 320)

embed_structure(structure)

Embed a protein given its biotite AtomArray.

Extracts the amino acid sequence from the structure (using residue names in the AtomArray), then delegates to embed().

This is a convenience method β€” the embeddings are purely sequence-based and do not use any 3D coordinate information.

Parameters:

Name Type Description Default
structure Any

biotite.structure.AtomArray. Must contain at least one atom per residue (e.g. CA atoms suffice).

required

Returns:

Type Description
ndarray

np.ndarray of shape (n_residues, embedding_dim), float32.

Example

from synth_pdb.generator import ProteinGenerator structure = ProteinGenerator().generate(20, ss_type="helix") emb = embedder.embed_structure(structure) emb.shape (20, 320)

mean_embed(sequence)

Return the mean-pooled sequence-level embedding.

Mean pooling averages the per-residue vectors

mean_embed(seq) = (1/L) Ξ£ embed(seq)[i] for i in 0..L-1

This gives a single D-dim vector representing the whole sequence. Loses positional information but enables fast sequence comparison.

Returns:

Type Description
ndarray

np.ndarray of shape (D,), float32.

sequence_similarity(seq_a, seq_b)

Cosine similarity between the mean embeddings of two sequences.

Returns a value in [-1, 1]: β€’ 1.0 β€” identical embeddings (same sequence) β€’ 0.0 β€” orthogonal (no similarity) β€’ -1.0 β€” opposite (very unlikely for protein embeddings)

WHY COSINE, NOT L2?

Cosine similarity is magnitude-invariant β€” it measures the angle between vectors, not their length. Longer proteins have higher-norm embeddings simply because there are more residues, not because they are more similar. Cosine corrects for this.

Parameters:

Name Type Description Default
seq_a str

First single-letter amino acid string.

required
seq_b str

Second single-letter amino acid string.

required

Returns:

Type Description
float

float β€” cosine similarity of mean embeddings.

Example

embedder.sequence_similarity("AAAAAAA", "VIVIVIV") 0.832... # high β€” both are simple repetitive peptides embedder.sequence_similarity("ACDEFGHIK", "WQMPLRNTS") 0.71... # lower β€” very different character


Practical Examples

Feed into GNN as node features

from synth_pdb.plm import ESM2Embedder
import numpy as np

plm = ESM2Embedder()
plm_features = plm.embed_structure(structure)   # (L, 320)

# Concatenate with your existing per-residue geometry features
node_features = np.concatenate([geometry_features, plm_features], axis=-1)

Secondary structure linear probe

import torch
import torch.nn as nn

plm = ESM2Embedder()
emb = torch.tensor(plm.embed("MQIFVKTLTGKTITLEVEPS"))   # (20, 320)

probe = nn.Linear(320, 3)   # 3 classes: Helix / Strand / Coil
logits = probe(emb)          # (20, 3)
probs = logits.softmax(-1)

Pairwise similarity matrix over a sequence library

import numpy as np

sequences = ["ACDEF", "ACDEF", "VWLYG", "RRKKK"]
plm = ESM2Embedder()
mean_embs = np.stack([plm.mean_embed(s) for s in sequences])   # (N, 320)

# Normalise rows, then dot-product β†’ cosine similarity matrix
norms = np.linalg.norm(mean_embs, axis=1, keepdims=True)
normed = mean_embs / (norms + 1e-8)
sim_matrix = normed @ normed.T   # (N, N)

Background

See Protein Language Models for the full scientific background, model architecture diagram, and explanation of what the embedding dimensions encode.