`synth_pdb.plm` — Protein Language Model Embeddings

ESM-2 per-residue embeddings via HuggingFace Transformers.

Install the optional dependency first:

pip install synth-pdb[plm]

Quick Start

from synth_pdb.plm import ESM2Embedder

embedder = ESM2Embedder()   # lazy — model loads on first embed() call

# Per-residue embeddings
emb = embedder.embed("MQIFVKTLTGKTITLEVEPS")
print(emb.shape)   # (20, 320) — 20 residues × 320-dim float32

# From a biotite AtomArray
emb = embedder.embed_structure(atom_array)   # same shape as embed()

# Sequence-level cosine similarity
sim = embedder.sequence_similarity("ACDEF", "ACDEF")   # → 1.0
sim = embedder.sequence_similarity("ACDEF", "VWLYG")   # → ~0.7–0.9

Lazy loading

ESM2Embedder() does nothing until you call embed(). This means from synth_pdb.plm import ESM2Embedder is always safe, even without torch or transformers installed.

Using a Larger Model

All ESM-2 variants share the same API:

# Default (8M params, 320-dim, ~30 MB)
embedder = ESM2Embedder()

# Better accuracy (35M params, 480-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t12_35M_UR50D")

# Near-production (150M params, 640-dim)
embedder = ESM2Embedder(model_name="facebook/esm2_t30_150M_UR50D")

API Reference

`ESM2Embedder`

Per-residue protein language model embeddings from ESM-2.

The model is loaded lazily on the first call to embed() — not at init time. This means: • import synth_pdb.plm is always safe with no torch/transformers installed • ESM2Embedder() is instantaneous • The ~5-second model load occurs once, then is cached in self._model

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID. Default: "facebook/esm2_t6_8M_UR50D". Upgrade to "facebook/esm2_t12_35M_UR50D" for 480-dim embeddings with better accuracy; API is identical.	`_DEFAULT_MODEL`
`device`	`Optional[str]`	Torch device string ("cpu", "cuda", "mps"). Default: auto-detect.	`None`

Attributes

`embedding_dim` `property`

Embedding dimensionality for this model variant.

Determined from the model config after the first embed() call. Before the first call, returns the known default for common models.

Functions

`embed(sequence)`

Embed a protein sequence using ESM-2.

Each amino acid is represented as a D-dimensional float32 vector encoding evolutionary, structural, and chemical context learned from 250M protein sequences.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	Single-letter amino acid string, e.g. "MQIFVKTLTG". Standard 20 amino acids only. Unknown residues → 'X'.	required

Returns:

Type	Description
`ndarray`	np.ndarray of shape (L, D) and dtype float32.
`ndarray`	L = len(sequence), D = self.embedding_dim (320 for default model).

Raises:

Type	Description
`ImportError`	If torch or transformers are not installed. Install with: pip install synth-pdb[plm]

Example

embedder = ESM2Embedder() emb = embedder.embed("ACDEFGHIKLMNPQRSTVWY") emb.shape (20, 320)

`embed_structure(structure)`

Embed a protein given its biotite AtomArray.

Extracts the amino acid sequence from the structure (using residue names in the AtomArray), then delegates to embed().

This is a convenience method — the embeddings are purely sequence-based and do not use any 3D coordinate information.

Parameters:

Name	Type	Description	Default
`structure`	`Any`	biotite.structure.AtomArray. Must contain at least one atom per residue (e.g. CA atoms suffice).	required

Returns:

Type	Description
`ndarray`	np.ndarray of shape (n_residues, embedding_dim), float32.

Example

from synth_pdb.generator import ProteinGenerator structure = ProteinGenerator().generate(20, ss_type="helix") emb = embedder.embed_structure(structure) emb.shape (20, 320)

`mean_embed(sequence)`

Return the mean-pooled sequence-level embedding.

Mean pooling averages the per-residue vectors

mean_embed(seq) = (1/L) Σ embed(seq)[i] for i in 0..L-1

This gives a single D-dim vector representing the whole sequence. Loses positional information but enables fast sequence comparison.

Returns:

Type	Description
`ndarray`	np.ndarray of shape (D,), float32.

`sequence_similarity(seq_a, seq_b)`

Cosine similarity between the mean embeddings of two sequences.

Returns a value in [-1, 1]: • 1.0 — identical embeddings (same sequence) • 0.0 — orthogonal (no similarity) • -1.0 — opposite (very unlikely for protein embeddings)

WHY COSINE, NOT L2?

Cosine similarity is magnitude-invariant — it measures the angle between vectors, not their length. Longer proteins have higher-norm embeddings simply because there are more residues, not because they are more similar. Cosine corrects for this.

Parameters:

Name	Type	Description	Default
`seq_a`	`str`	First single-letter amino acid string.	required
`seq_b`	`str`	Second single-letter amino acid string.	required

Returns:

Type	Description
`float`	float — cosine similarity of mean embeddings.

Example

embedder.sequence_similarity("AAAAAAA", "VIVIVIV") 0.832... # high — both are simple repetitive peptides embedder.sequence_similarity("ACDEFGHIK", "WQMPLRNTS") 0.71... # lower — very different character

Practical Examples

Feed into GNN as node features

from synth_pdb.plm import ESM2Embedder
import numpy as np

plm = ESM2Embedder()
plm_features = plm.embed_structure(structure)   # (L, 320)

# Concatenate with your existing per-residue geometry features
node_features = np.concatenate([geometry_features, plm_features], axis=-1)

Secondary structure linear probe

import torch
import torch.nn as nn

plm = ESM2Embedder()
emb = torch.tensor(plm.embed("MQIFVKTLTGKTITLEVEPS"))   # (20, 320)

probe = nn.Linear(320, 3)   # 3 classes: Helix / Strand / Coil
logits = probe(emb)          # (20, 3)
probs = logits.softmax(-1)

Pairwise similarity matrix over a sequence library

import numpy as np

sequences = ["ACDEF", "ACDEF", "VWLYG", "RRKKK"]
plm = ESM2Embedder()
mean_embs = np.stack([plm.mean_embed(s) for s in sequences])   # (N, 320)

# Normalise rows, then dot-product → cosine similarity matrix
norms = np.linalg.norm(mean_embs, axis=1, keepdims=True)
normed = mean_embs / (norms + 1e-8)
sim_matrix = normed @ normed.T   # (N, N)

Background

See Protein Language Models for the full scientific background, model architecture diagram, and explanation of what the embedding dimensions encode.

synth_pdb.plm — Protein Language Model Embeddings

Quick Start

Using a Larger Model

API Reference

ESM2Embedder

Attributes

embedding_dim property

Functions

embed(sequence)

embed_structure(structure)

mean_embed(sequence)

sequence_similarity(seq_a, seq_b)

WHY COSINE, NOT L2?

Practical Examples

Feed into GNN as node features

Secondary structure linear probe

Pairwise similarity matrix over a sequence library

Background

`synth_pdb.plm` — Protein Language Model Embeddings

`ESM2Embedder`

`embedding_dim` `property`

`embed(sequence)`

`embed_structure(structure)`

`mean_embed(sequence)`

`sequence_similarity(seq_a, seq_b)`