Protein Language Model Embeddings with ESM-2¶

Run this notebook on Google Colab:

Who is this for? Chemists and biologists who are curious about machine learning applied to protein sequences -- no prior ML experience required. Every technical concept is explained in plain language as it first appears.

Background: What is a Protein Language Model?¶

A language model is software that learns patterns from vast amounts of text. ChatGPT, for example, was trained on billions of sentences and learned grammar, facts, and reasoning from the statistical patterns in that text.

A protein language model (PLM) does exactly the same thing -- but its "alphabet" is the 20 standard amino acids (A, C, D, ... Y) instead of English words. ESM-2 (Evolutionary Scale Modelling, version 2) was trained on 250 million protein sequences from the UniRef database.

During training, the model played a fill-in-the-blank game: random residues were masked out, and the model had to predict them from context. To succeed, it had to implicitly learn evolutionary constraints, structural preferences, and functional signals for every position in a sequence.

Key insight: ESM-2 never sees 3D coordinates during training -- yet its internal representations encode structural information anyway, because sequence and structure are deeply coupled through evolution.

What you will learn¶

This notebook demonstrates ESM-2 protein language model embeddings via synth-pdb.

What is encoded	Why it is useful
Evolutionary conservation	Co-varying positions across species
Structural context	Buried vs. solvent-exposed
Chemical environment	Charged / polar / hydrophobic neighbourhoods
Functional signals	Active-site residues

All from sequence alone -- no 3D coordinates.

Notebook outline¶

Setup & Installation
Generate synthetic protein structures with synth-pdb
Embed sequences with ESM-2 -- (L, 320) float32 arrays
Visualise per-residue embedding heatmaps
Similarity heatmap -- compare proteins in embedding space
Cluster residues in 2D with UMAP
Linear structural probe -- predict secondary structure from embeddings alone
Embed a structure generated by synth-pdb

1. Setup & Installation¶

⚠️ Colab users: The setup cell installs dependencies and automatically restarts the kernel once. This is expected — just wait ~10 seconds and Run All Cells again.

In [ ]:

Copied!





# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

# ── Local development path (ignored on Colab) ────────────────────────────
try:
    repo_root = Path(".").resolve().parent.parent
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"📌 Local library: {repo_root}")
except Exception:
    pass

# ── Colab installation ────────────────────────────────────────────────────
if 'google.colab' in str(get_ipython()):
    if not os.path.exists("plm_installed.marker"):
        print("🔧 Installing synth-pdb[plm] and visualisation dependencies ...")
        get_ipython().run_line_magic('pip', 'install -q "synth-pdb[plm]" umap-learn matplotlib seaborn')
        with open("plm_installed.marker", "w") as f:
            f.write("done")
        print("🔄 Installation complete. KERNEL RESTARTING ...")
        print("⚠️  Wait ~10 seconds, then Run All Cells again.")
        os.kill(os.getpid(), 9)
    else:
        print("✅ Dependencies ready.")
else:
    import synth_pdb
    print(f"✅ Local synth-pdb {synth_pdb.__version__}")
    print("   Make sure you have installed: pip install 'synth-pdb[plm]' umap-learn seaborn")
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

# ── Local development path (ignored on Colab) ────────────────────────────
try:
    repo_root = Path(".").resolve().parent.parent
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"📌 Local library: {repo_root}")
except Exception:
    pass

# ── Colab installation ────────────────────────────────────────────────────
if 'google.colab' in str(get_ipython()):
    if not os.path.exists("plm_installed.marker"):
        print("🔧 Installing synth-pdb[plm] and visualisation dependencies ...")
        get_ipython().run_line_magic('pip', 'install -q "synth-pdb[plm]" umap-learn matplotlib seaborn')
        with open("plm_installed.marker", "w") as f:
            f.write("done")
        print("🔄 Installation complete. KERNEL RESTARTING ...")
        print("⚠️  Wait ~10 seconds, then Run All Cells again.")
        os.kill(os.getpid(), 9)
    else:
        print("✅ Dependencies ready.")
else:
    import synth_pdb
    print(f"✅ Local synth-pdb {synth_pdb.__version__}")
    print("   Make sure you have installed: pip install 'synth-pdb[plm]' umap-learn seaborn")

What is an 'embedding'?¶

An embedding is a list of numbers (a vector) that represents something in a form a computer can do maths on. Think of it as a coordinate in a very high-dimensional space, where similar things end up close together.

For example, in word embeddings, the vectors for king and queen point in similar directions, and so do cat and dog. In protein embeddings, residues that play similar biochemical roles -- e.g., positively charged Lys and Arg -- end up in nearby regions of the embedding space.

ESM-2 produces one 320-number vector per residue. A 20-residue peptide therefore gives a matrix of shape 20 x 320. The 320 dimensions have no single human- interpretable meaning -- they are learned automatically by the model to be useful for predicting masked residues.

In [ ]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from synth_pdb.plm import ESM2Embedder

# Instantiate the embedder — model loads lazily on first embed() call
embedder = ESM2Embedder()
print(f"ESM2Embedder ready (model: {embedder.model_name})")
print(f"Embedding dim: {embedder.embedding_dim}  (will be 320 for the default t6_8M model)")
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from synth_pdb.plm import ESM2Embedder

# Instantiate the embedder — model loads lazily on first embed() call
embedder = ESM2Embedder()
print(f"ESM2Embedder ready (model: {embedder.model_name})")
print(f"Embedding dim: {embedder.embedding_dim}  (will be 320 for the default t6_8M model)")

2. Generate Synthetic Structures with synth-pdb¶

We use synth-pdb to create three homopolymers -- peptides made of a single amino acid type repeated many times. Why homopolymers?

Controlled test cases: every residue sees the same local sequence context, so differences in embeddings between the three homopolymers must arise from amino acid identity, not from varying neighbours.
Known secondary structure preferences: Ala strongly prefers alpha-helices; Val (beta-branched, bulky side chain) strongly prefers beta-strands; Gly (no side chain) is the most flexible and tends to form coil or turns.

Protein	Sequence	Expected conformation	Chemical reason
Polyalanine	`AAAAAAAAAAAAAAAAAA` (18 res)	alpha-helix	Ala has the highest helix propensity of all amino acids
Polyvaline	`VVVVVVVVVVVVVVVVVV` (18 res)	beta-strand	Val's beta-branching clashes sterically in a helix
Polyglycine	`GGGGGGGGGGGGGGGGGG` (18 res)	Coil	Gly has no side chain, giving it near-unrestricted backbone freedom
Ubiquitin N-term	`MQIFVKTLTGKTITLEVEPSDT` (22 res)	Mixed	Fragment of a real 76-residue eukaryotic regulatory protein

Helix propensity (Pace & Scholtz, 1998, Biophysical Journal) is the thermodynamic tendency of an amino acid to adopt an alpha-helical backbone conformation (phi ~-57 degrees, psi ~-47 degrees). It can be measured experimentally by substituting single residues into an alanine-based host peptide and measuring the change in helix content by circular dichroism (CD) spectroscopy.

In [ ]:

Copied!





# Our test sequences — we'll embed these and compare
sequences = {
    "Poly-Ala (helix)": "AAAAAAAAAAAAAAAAAA",
    "Poly-Val (strand)": "VVVVVVVVVVVVVVVVVV",
    "Poly-Gly (coil)": "GGGGGGGGGGGGGGGGGG",
    "Ubiquitin N-term": "MQIFVKTLTGKTITLEVEPSDT",
    "Villin HP35": "LSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF",
    "Trp-cage": "NLYIQWLKDGGPSSGRPPPS",
}

print("Sequences loaded:")
for name, seq in sequences.items():
    print(f"  {name:25s}  {len(seq):3d} residues  {seq[:20]}{'...' if len(seq)>20 else ''}")
# Our test sequences — we'll embed these and compare
sequences = {
    "Poly-Ala (helix)": "AAAAAAAAAAAAAAAAAA",
    "Poly-Val (strand)": "VVVVVVVVVVVVVVVVVV",
    "Poly-Gly (coil)": "GGGGGGGGGGGGGGGGGG",
    "Ubiquitin N-term": "MQIFVKTLTGKTITLEVEPSDT",
    "Villin HP35": "LSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF",
    "Trp-cage": "NLYIQWLKDGGPSSGRPPPS",
}

print("Sequences loaded:")
for name, seq in sequences.items():
    print(f"  {name:25s}  {len(seq):3d} residues  {seq[:20]}{'...' if len(seq)>20 else ''}")

3. Embed with ESM-2¶

The first call downloads the model weights (~30 MB for the smallest variant) and loads the transformer. Subsequent calls are fast (~10 ms per protein on CPU).

Output shape: (L, 320) -- one 320-dimensional float32 vector per residue.

Under the hood -- how ESM-2 produces embeddings¶

ESM-2 is a Transformer neural network (the same architecture behind ChatGPT, but Fine tuned for proteins). The key operation is self-attention:

Each residue starts as a lookup vector based on its amino acid type (like a dictionary: A maps to one vector, V maps to another, ...).
The transformer stacks 6 layers. In each layer, every residue attends to every other residue -- it updates its own representation by asking: 'which other positions in this sequence are most relevant to understanding me?'
After 6 such rounds, each residue's 320-number vector encodes not just its own identity but also what its sequence neighbourhood looks like.

This is why an Ala in the middle of a helix gets a different vector than the same Ala next to a Pro (which breaks helices) -- context changes the embedding.

Reference: Lin et al. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130. https://doi.org/10.1126/science.ade2574

In [ ]:

Copied!





import time

print("Embedding sequences with ESM-2 t6_8M (320-dim) ...")
print("(First call downloads ~30 MB model weights — takes ~5-30 s)\n")

embeddings = {}
for name, seq in sequences.items():
    t0 = time.time()
    emb = embedder.embed(seq)                  # (L, 320) float32
    dt_ms = (time.time() - t0) * 1000
    embeddings[name] = emb
    print(f"  {name:25s}  shape={emb.shape}  dtype={emb.dtype}  {dt_ms:.1f} ms")

print("\n✅ All embeddings computed.")
import time

print("Embedding sequences with ESM-2 t6_8M (320-dim) ...")
print("(First call downloads ~30 MB model weights — takes ~5-30 s)\n")

embeddings = {}
for name, seq in sequences.items():
    t0 = time.time()
    emb = embedder.embed(seq)                  # (L, 320) float32
    dt_ms = (time.time() - t0) * 1000
    embeddings[name] = emb
    print(f"  {name:25s}  shape={emb.shape}  dtype={emb.dtype}  {dt_ms:.1f} ms")

print("\n✅ All embeddings computed.")

4. Per-Residue Embedding Heatmaps¶

Each row = one residue; each column = one of the 320 latent dimensions. The colour shows the activation value, normalised per row so the brightest colour in each row is 1.0 and the darkest is 0.0.

What to look for:

Pattern	Interpretation
Uniform rows (same colour across a row)	Every residue sees the same context -- expected for a homopolymer
Vertical stripes (same column highlighted for many rows)	That dimension is activated by a particular residue type
Highly variable rows	Each residue has a unique chemical neighbourhood

Poly-Ala should look very uniform (all Ala, all identical context), while Ubiquitin N-term should have noticeably more variation between rows.

Analogy for NMR spectroscopists: these heatmaps are analogous to a 2D HSQC spectrum, where each cross-peak encodes the chemical environment of one NH group. Just as HSQC peak positions shift when you mutate a neighbouring residue, the embedding changes when sequence context changes.

In [ ]:

Copied!





fig, axes = plt.subplots(2, 3, figsize=(18, 8))
axes = axes.flatten()

for ax, (name, emb) in zip(axes, embeddings.items()):
    # Normalise each row (residue) to [0,1] for visual comparison
    row_min = emb.min(axis=1, keepdims=True)
    row_max = emb.max(axis=1, keepdims=True)
    normed = (emb - row_min) / (row_max - row_min + 1e-8)

    im = ax.imshow(normed, aspect="auto", cmap="RdBu_r",
                   interpolation="nearest", vmin=0, vmax=1)
    ax.set_title(f"{name}\n({emb.shape[0]} residues × {emb.shape[1]} dims)",
                 fontsize=10, fontweight="bold")
    ax.set_xlabel("ESM-2 embedding dimension", fontsize=8)
    ax.set_ylabel("Residue position", fontsize=8)
    ax.tick_params(labelsize=7)

fig.colorbar(im, ax=axes[-1], label="Activation (row-normalised)", shrink=0.8)
fig.suptitle("ESM-2 Per-Residue Embedding Heatmaps",
             fontsize=14, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

print("""
Observations:
  • Poly-Ala rows are nearly identical — every residue sees the same context.
  • Ubiquitin rows vary noticeably — each residue has a unique neighborhood.
  • Vertical stripes = dimensions strongly activated by a residue-type pattern.
""")
fig, axes = plt.subplots(2, 3, figsize=(18, 8))
axes = axes.flatten()

for ax, (name, emb) in zip(axes, embeddings.items()):
    # Normalise each row (residue) to [0,1] for visual comparison
    row_min = emb.min(axis=1, keepdims=True)
    row_max = emb.max(axis=1, keepdims=True)
    normed = (emb - row_min) / (row_max - row_min + 1e-8)

    im = ax.imshow(normed, aspect="auto", cmap="RdBu_r",
                   interpolation="nearest", vmin=0, vmax=1)
    ax.set_title(f"{name}\n({emb.shape[0]} residues × {emb.shape[1]} dims)",
                 fontsize=10, fontweight="bold")
    ax.set_xlabel("ESM-2 embedding dimension", fontsize=8)
    ax.set_ylabel("Residue position", fontsize=8)
    ax.tick_params(labelsize=7)

fig.colorbar(im, ax=axes[-1], label="Activation (row-normalised)", shrink=0.8)
fig.suptitle("ESM-2 Per-Residue Embedding Heatmaps",
             fontsize=14, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

print("""
Observations:
  • Poly-Ala rows are nearly identical — every residue sees the same context.
  • Ubiquitin rows vary noticeably — each residue has a unique neighborhood.
  • Vertical stripes = dimensions strongly activated by a residue-type pattern.
""")

5. Sequence Similarity in Embedding Space¶

We collapse each (L, 320) matrix to a single (320,) vector by mean pooling (averaging the 320-number vectors over all L residues). This gives one vector per protein representing the whole sequence.

We then compute cosine similarity between all pairs:

$$\text{sim}(A, B) = \frac{\overline{\text{emb}}(A) \cdot \overline{\text{emb}}(B)}{\|\overline{\text{emb}}(A)\| \cdot \|\overline{\text{emb}}(B)\|}$$

What is cosine similarity?¶

Think of two arrows starting from the same point in 320-dimensional space. Cosine similarity measures the angle between them -- not their lengths:

1.0 -- identical directions -- the two proteins look the same to ESM-2
0.0 -- perpendicular -- completely uncorrelated representations
Proteins with the same amino acid composition but different order can still have high similarity if they fold similarly.

What to expect:

Poly-Ala and Poly-Val: high similarity -- both are simple, repetitive homopolymers
Poly-Gly: lower -- Gly is biochemically unique (only achiral amino acid, no side chain, exceptional backbone flexibility)
Ubiquitin / Villin / Trp-cage: moderate -- all are real compact folded proteins

Why not just use BLAST sequence identity? Two proteins with <20% sequence identity (the 'twilight zone') can still perform identical functions and have the same fold. Cosine similarity in PLM embedding space can detect functional similarity even between evolutionarily distant homologues where BLAST gives no meaningful alignment.

In [ ]:

Copied!





names = list(sequences.keys())
n = len(names)

# Compute mean embeddings  →  (N, 320)
mean_embs = np.stack([embeddings[nm].mean(axis=0) for nm in names])

# Normalise rows → unit vectors, then dot product = cosine similarity
norms = np.linalg.norm(mean_embs, axis=1, keepdims=True)
normed = mean_embs / (norms + 1e-8)
sim_matrix = normed @ normed.T   # (N, N)

# Plot
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    sim_matrix,
    xticklabels=names, yticklabels=names,
    annot=True, fmt=".3f",
    cmap="YlOrRd", vmin=0.5, vmax=1.0,
    linewidths=0.5,
    ax=ax,
)
ax.set_title("ESM-2 Sequence Similarity (cosine, mean-pooled)",
             fontsize=13, fontweight="bold", pad=12)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right", fontsize=9)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=9)
plt.tight_layout()
plt.show()

# Print commentary
for i, a in enumerate(names):
    for j, b in enumerate(names):
        if j > i:
            s = sim_matrix[i, j]
            label = "high" if s > 0.85 else ("moderate" if s > 0.75 else "low")
            print(f"  {a:25s}  ↔  {b:25s}  sim={s:.3f}  ({label})")
names = list(sequences.keys())
n = len(names)

# Compute mean embeddings  →  (N, 320)
mean_embs = np.stack([embeddings[nm].mean(axis=0) for nm in names])

# Normalise rows → unit vectors, then dot product = cosine similarity
norms = np.linalg.norm(mean_embs, axis=1, keepdims=True)
normed = mean_embs / (norms + 1e-8)
sim_matrix = normed @ normed.T   # (N, N)

# Plot
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    sim_matrix,
    xticklabels=names, yticklabels=names,
    annot=True, fmt=".3f",
    cmap="YlOrRd", vmin=0.5, vmax=1.0,
    linewidths=0.5,
    ax=ax,
)
ax.set_title("ESM-2 Sequence Similarity (cosine, mean-pooled)",
             fontsize=13, fontweight="bold", pad=12)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right", fontsize=9)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=9)
plt.tight_layout()
plt.show()

# Print commentary
for i, a in enumerate(names):
    for j, b in enumerate(names):
        if j > i:
            s = sim_matrix[i, j]
            label = "high" if s > 0.85 else ("moderate" if s > 0.75 else "low")
            print(f"  {a:25s}  ↔  {b:25s}  sim={s:.3f}  ({label})")

6. Residue-Level UMAP Clustering¶

Each residue is a point in 320-dimensional space. We use UMAP to project down to 2D so we can visualise it.

What is dimensionality reduction?¶

Imagine flattening a globe onto a flat map -- you inevitably lose some information, but relative positions are preserved as faithfully as possible. UMAP (Uniform Manifold Approximation and Projection, McInnes et al. 2018) is a widely used algorithm that preserves local neighbourhood structure: if two residues are close in 320-D, they will be close in the 2D plot.

Key question: do residues cluster by amino acid identity in this 2D map, even though ESM-2 was never explicitly told what a residue is -- only which positions were masked?

The answer is yes. ESM-2 implicitly learns amino acid identity through the pattern of masked predictions. In UMAP space, Ala, Val, Gly etc. tend to form distinct clouds.

Note on homopolymers: Poly-Ala, Poly-Val, and Poly-Gly will each appear as a single tight dot (all residues share identical context), while Ubiquitin and Villin will appear as clouds of distinct points -- one point per residue.

Reference: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.

In [ ]:

Copied!





try:
    import umap

    # Stack all residue embeddings with labels
    all_embs = []
    all_labels = []
    all_source = []

    for name, seq in sequences.items():
        emb = embeddings[name]       # (L, 320)
        for i, (vec, aa) in enumerate(zip(emb, seq)):
            all_embs.append(vec)
            all_labels.append(aa)   # single-letter amino acid code
            all_source.append(name)

    X = np.stack(all_embs)           # (N_residues, 320)
    labels = np.array(all_labels)    # amino acid 1-letter codes

    print(f"Running UMAP on {len(X)} residue embeddings (320-dim → 2D) ...")
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=8, min_dist=0.3)
    Z = reducer.fit_transform(X)     # (N_residues, 2)

    # Colour by amino acid identity
    unique_aas = sorted(set(labels))
    cmap = plt.cm.get_cmap("tab20", len(unique_aas))
    color_map = {aa: cmap(i) for i, aa in enumerate(unique_aas)}
    colors = [color_map[aa] for aa in labels]

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Left: color by amino acid
    ax = axes[0]
    for aa in unique_aas:
        mask = labels == aa
        ax.scatter(Z[mask, 0], Z[mask, 1], c=[color_map[aa]],
                   label=aa, s=40, alpha=0.75, edgecolors="white", linewidths=0.3)
    ax.legend(title="Amino acid", bbox_to_anchor=(1.0, 1), loc="upper left",
              fontsize=8, title_fontsize=9, ncol=2)
    ax.set_title("UMAP of ESM-2 residue embeddings\nColoured by amino acid identity",
                 fontweight="bold")
    ax.set_xlabel("UMAP-1")
    ax.set_ylabel("UMAP-2")

    # Right: color by source protein
    ax = axes[1]
    source_labels = np.array(all_source)
    unique_sources = list(sequences.keys())
    scmap = plt.cm.get_cmap("Set2", len(unique_sources))
    for i, src in enumerate(unique_sources):
        mask = source_labels == src
        ax.scatter(Z[mask, 0], Z[mask, 1], c=[scmap(i)],
                   label=src, s=40, alpha=0.75, edgecolors="white", linewidths=0.3)
    ax.legend(title="Protein", bbox_to_anchor=(1.0, 1), loc="upper left",
              fontsize=8, title_fontsize=9)
    ax.set_title("UMAP of ESM-2 residue embeddings\nColoured by source protein",
                 fontweight="bold")
    ax.set_xlabel("UMAP-1")
    ax.set_ylabel("UMAP-2")

    plt.tight_layout()
    plt.show()

    print("""
What to look for:
  • Same amino acid in different proteins → clusters together   (ESM-2 learned residue identity)
  • Points from the same protein → may also cluster             (ESM-2 learned local context)
  • Poly-Ala / Poly-Val / Poly-Gly → tight single-residue dots (all identical context within each)
""")

except ImportError:
    print("umap-learn not installed. Run: pip install umap-learn")
    print("Then re-run this cell.")
try:
    import umap

    # Stack all residue embeddings with labels
    all_embs = []
    all_labels = []
    all_source = []

    for name, seq in sequences.items():
        emb = embeddings[name]       # (L, 320)
        for i, (vec, aa) in enumerate(zip(emb, seq)):
            all_embs.append(vec)
            all_labels.append(aa)   # single-letter amino acid code
            all_source.append(name)

    X = np.stack(all_embs)           # (N_residues, 320)
    labels = np.array(all_labels)    # amino acid 1-letter codes

    print(f"Running UMAP on {len(X)} residue embeddings (320-dim → 2D) ...")
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=8, min_dist=0.3)
    Z = reducer.fit_transform(X)     # (N_residues, 2)

    # Colour by amino acid identity
    unique_aas = sorted(set(labels))
    cmap = plt.cm.get_cmap("tab20", len(unique_aas))
    color_map = {aa: cmap(i) for i, aa in enumerate(unique_aas)}
    colors = [color_map[aa] for aa in labels]

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Left: color by amino acid
    ax = axes[0]
    for aa in unique_aas:
        mask = labels == aa
        ax.scatter(Z[mask, 0], Z[mask, 1], c=[color_map[aa]],
                   label=aa, s=40, alpha=0.75, edgecolors="white", linewidths=0.3)
    ax.legend(title="Amino acid", bbox_to_anchor=(1.0, 1), loc="upper left",
              fontsize=8, title_fontsize=9, ncol=2)
    ax.set_title("UMAP of ESM-2 residue embeddings\nColoured by amino acid identity",
                 fontweight="bold")
    ax.set_xlabel("UMAP-1")
    ax.set_ylabel("UMAP-2")

    # Right: color by source protein
    ax = axes[1]
    source_labels = np.array(all_source)
    unique_sources = list(sequences.keys())
    scmap = plt.cm.get_cmap("Set2", len(unique_sources))
    for i, src in enumerate(unique_sources):
        mask = source_labels == src
        ax.scatter(Z[mask, 0], Z[mask, 1], c=[scmap(i)],
                   label=src, s=40, alpha=0.75, edgecolors="white", linewidths=0.3)
    ax.legend(title="Protein", bbox_to_anchor=(1.0, 1), loc="upper left",
              fontsize=8, title_fontsize=9)
    ax.set_title("UMAP of ESM-2 residue embeddings\nColoured by source protein",
                 fontweight="bold")
    ax.set_xlabel("UMAP-1")
    ax.set_ylabel("UMAP-2")

    plt.tight_layout()
    plt.show()

    print("""
What to look for:
  • Same amino acid in different proteins → clusters together   (ESM-2 learned residue identity)
  • Points from the same protein → may also cluster             (ESM-2 learned local context)
  • Poly-Ala / Poly-Val / Poly-Gly → tight single-residue dots (all identical context within each)
""")

except ImportError:
    print("umap-learn not installed. Run: pip install umap-learn")
    print("Then re-run this cell.")

7. Linear Secondary Structure Probe¶

Can a single linear layer on top of ESM-2 embeddings predict secondary structure?

What is a 'linear probe'?¶

A linear probe is the simplest possible ML classifier: it multiplies the embedding vector by a weight matrix and adds a bias -- just one matrix multiplication. If this step achieves high accuracy, secondary structure is already explicitly encoded in the embedding and requires no further non-linear processing to extract.

Analogy: if you can read the temperature off a thermometer by looking at a number (linear), the information is visually explicit. If you needed to do a complex chemical assay (non-linear), the information would be deeply implicit.

The experiment¶

We build a small labelled dataset from synth-pdb:

30 residues of Poly-Ala (dominant helix tendency) -- label 0 (Helix)
30 residues of Poly-Val (dominant strand tendency) -- label 1 (Strand)
30 residues of Poly-Gly (dominant coil tendency) -- label 2 (Coil)

Then a logistic regression (a linear layer followed by softmax to produce class probabilities) is evaluated by 5-fold cross-validation: the data is split into 5 equal chunks; the model trains on 4 and tests on the 5th, rotating through all combinations. The 5 accuracy values are averaged to give a robust estimate that avoids memorising any single train/test split.

Literature benchmark: Rao et al. (2019) Evaluating protein transfer learning with TAPE (NeurIPS) reported ~70-80% SS3 accuracy on real protein datasets with a linear probe over ESM embeddings. Our synthetic homopolymer dataset is simpler, so accuracy should be very high -- but the principle is identical.

What to expect: Near-perfect accuracy. The three homopolymers represent very different biochemical environments; ESM-2 should encode helix/strand/coil preferences linearly from the amino acid type alone.

In [ ]:

Copied!





import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Build a labelled dataset of residue embeddings
# Label 0 = helix (Ala)  |  Label 1 = strand (Val)  |  Label 2 = coil (Gly)
probe_seqs = [
    ("A" * 30, 0, "Helix (Ala)"),
    ("V" * 30, 1, "Strand (Val)"),
    ("G" * 30, 2, "Coil (Gly)"),
    # Additional diversity: mixed sequences
    ("AAAVVVAAA" * 3, None, "Mixed"),   # ground truth is ambiguous, skip
]

X_probe, y_probe = [], []
for seq, label, _desc in probe_seqs:
    if label is None:
        continue
    emb = embedder.embed(seq)   # (L, 320)
    for row in emb:
        X_probe.append(row)
        y_probe.append(label)

X_probe = np.array(X_probe)
y_probe = np.array(y_probe)
print(f"Dataset: {X_probe.shape[0]} residues × {X_probe.shape[1]} embedding dims")
print(f"Classes: 0=Helix ({np.sum(y_probe==0)}), 1=Strand ({np.sum(y_probe==1)}), 2=Coil ({np.sum(y_probe==2)})")

# Standardise features (logistic regression converges better)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_probe)

# 5-fold cross-validation with a linear probe
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    clf = LogisticRegression(max_iter=500, C=1.0, random_state=42)
    cv_scores = cross_val_score(clf, X_scaled, y_probe, cv=5, scoring="accuracy")

print(f"\n5-fold CV accuracy: {cv_scores.mean():.1%} ± {cv_scores.std():.1%}")
print(f"Individual folds: {[f'{s:.1%}' for s in cv_scores]}")

# Fit on full data and show confusion matrix
clf.fit(X_scaled, y_probe)
y_pred = clf.predict(X_scaled)

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

cm = confusion_matrix(y_probe, y_pred)
fig, ax = plt.subplots(figsize=(5, 4))
disp = ConfusionMatrixDisplay(cm, display_labels=["Helix (Ala)", "Strand (Val)", "Coil (Gly)"])
disp.plot(ax=ax, cmap="Blues", colorbar=False)
ax.set_title(f"Linear Probe on ESM-2 Embeddings\nCV accuracy: {cv_scores.mean():.1%}",
             fontweight="bold", pad=10)
plt.tight_layout()
plt.show()

print("""
Interpretation:
  High accuracy → secondary structure is LINEARLY ENCODED in ESM-2 embeddings.
  The model learned to distinguish helix/strand/coil from sequence context alone,
  with no 3D coordinates in the training data.
  This is the same principle that enables ESMFold to predict full 3D structure.
""")
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Build a labelled dataset of residue embeddings
# Label 0 = helix (Ala)  |  Label 1 = strand (Val)  |  Label 2 = coil (Gly)
probe_seqs = [
    ("A" * 30, 0, "Helix (Ala)"),
    ("V" * 30, 1, "Strand (Val)"),
    ("G" * 30, 2, "Coil (Gly)"),
    # Additional diversity: mixed sequences
    ("AAAVVVAAA" * 3, None, "Mixed"),   # ground truth is ambiguous, skip
]

X_probe, y_probe = [], []
for seq, label, _desc in probe_seqs:
    if label is None:
        continue
    emb = embedder.embed(seq)   # (L, 320)
    for row in emb:
        X_probe.append(row)
        y_probe.append(label)

X_probe = np.array(X_probe)
y_probe = np.array(y_probe)
print(f"Dataset: {X_probe.shape[0]} residues × {X_probe.shape[1]} embedding dims")
print(f"Classes: 0=Helix ({np.sum(y_probe==0)}), 1=Strand ({np.sum(y_probe==1)}), 2=Coil ({np.sum(y_probe==2)})")

# Standardise features (logistic regression converges better)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_probe)

# 5-fold cross-validation with a linear probe
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    clf = LogisticRegression(max_iter=500, C=1.0, random_state=42)
    cv_scores = cross_val_score(clf, X_scaled, y_probe, cv=5, scoring="accuracy")

print(f"\n5-fold CV accuracy: {cv_scores.mean():.1%} ± {cv_scores.std():.1%}")
print(f"Individual folds: {[f'{s:.1%}' for s in cv_scores]}")

# Fit on full data and show confusion matrix
clf.fit(X_scaled, y_probe)
y_pred = clf.predict(X_scaled)

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

cm = confusion_matrix(y_probe, y_pred)
fig, ax = plt.subplots(figsize=(5, 4))
disp = ConfusionMatrixDisplay(cm, display_labels=["Helix (Ala)", "Strand (Val)", "Coil (Gly)"])
disp.plot(ax=ax, cmap="Blues", colorbar=False)
ax.set_title(f"Linear Probe on ESM-2 Embeddings\nCV accuracy: {cv_scores.mean():.1%}",
             fontweight="bold", pad=10)
plt.tight_layout()
plt.show()

print("""
Interpretation:
  High accuracy → secondary structure is LINEARLY ENCODED in ESM-2 embeddings.
  The model learned to distinguish helix/strand/coil from sequence context alone,
  with no 3D coordinates in the training data.
  This is the same principle that enables ESMFold to predict full 3D structure.
""")

8. Embedding a Structure from synth-pdb¶

We can also pass a biotite AtomArray (a table of atoms with 3D coordinates) directly to embed_structure(). The method extracts the amino acid sequence from the residue names and calls embed() under the hood.

Why combine structure generation + PLM embeddings?¶

synth-pdb gives you	ESM-2 gives you
Realistic 3D atomic coordinates	Sequence-derived feature vectors
Backbone geometry (phi/psi angles, bond lengths)	Evolutionary & structural context per residue
PDB format for downstream software	Compact 320-dim representation for ML

Combining both lets you build models that use geometric features (from 3D structure) and sequence context features (from ESM-2) -- a combination used in state-of-the-art protein quality assessment and design pipelines.

Note: embed_structure() only uses the sequence extracted from the AtomArray; it ignores the 3D coordinates. To use both, extract the embeddings here and compute geometric descriptors (angles, SASA etc.) separately, then concatenate them as features.

In [ ]:

Copied!





import io

import biotite.structure as struc
import biotite.structure.io.pdb as pdb

from synth_pdb.generator import generate_pdb_content

# Generate a synthetic 20-residue peptide using synth-pdb.
# 'conformation="alpha"' sets all backbone torsion angles to canonical
# alpha-helix values (phi ~ -57 deg, psi ~ -47 deg).
# 'minimize_energy=False' skips OpenMM energy minimisation for speed.
pdb_content = generate_pdb_content(
    sequence_str="ALA-ALA-ALA-ALA-ALA-GLY-LEU-ALA-ALA-ALA-ALA-ALA-SER-ALA-ALA-ALA-ALA-ALA-ALA-ALA",
    conformation="alpha",
    minimize_energy=False,
)

# Parse the PDB text into a biotite AtomArray --
# a structured table of atoms with coordinates, chain IDs, residue names, etc.
pdb_file = pdb.PDBFile.read(io.StringIO(pdb_content))
structure = pdb_file.get_structure(model=1)

# Keep only backbone atoms (N, C-alpha, C, O) to discard hydrogens
# and side chain atoms. For embedding we only need the residue sequence,
# but filtering to backbone atoms is good practice for downstream geometry work.
backbone = structure[np.isin(structure.atom_name, ["N", "CA", "C", "O"])]

print(f"Generated structure: {len(struc.get_residues(backbone)[0])} residues, {len(backbone)} atoms")

# embed_structure() reads the 3-letter residue names from the AtomArray,
# converts them to 1-letter codes (ALA->A, GLY->G, etc.), then calls embed().
# Output: (L, 320) float32 numpy array -- one embedding vector per residue.
struct_emb = embedder.embed_structure(backbone)
print(f"PLM embedding shape: {struct_emb.shape}")
print(f"dtype: {struct_emb.dtype}")

# Plot the embedding matrix transposed so residues run along the x-axis.
# Each vertical slice is the 320-number 'fingerprint' for one residue.
fig, ax = plt.subplots(figsize=(12, 3))
im = ax.imshow(struct_emb.T, aspect="auto", cmap="RdBu_r",
               interpolation="nearest")
ax.set_xlabel("Residue position", fontsize=10)
ax.set_ylabel("ESM-2 dimension (320)", fontsize=10)
ax.set_title("synth-pdb -> ESM-2 embedding:  20-residue synthetic helix",
             fontsize=11, fontweight="bold")
plt.colorbar(im, ax=ax, label="Activation")
plt.tight_layout()
plt.show()

print("\nThe embedder extracted the sequence from the AtomArray residue names,")
print("passed it through ESM-2, and returned a (L, 320) float32 matrix.")
print("This can be used directly as node features in a GNN, or concatenated")
print("with geometric features (bond angles, SASA, etc.) for richer input representations.")
import io

import biotite.structure as struc
import biotite.structure.io.pdb as pdb

from synth_pdb.generator import generate_pdb_content

# Generate a synthetic 20-residue peptide using synth-pdb.
# 'conformation="alpha"' sets all backbone torsion angles to canonical
# alpha-helix values (phi ~ -57 deg, psi ~ -47 deg).
# 'minimize_energy=False' skips OpenMM energy minimisation for speed.
pdb_content = generate_pdb_content(
    sequence_str="ALA-ALA-ALA-ALA-ALA-GLY-LEU-ALA-ALA-ALA-ALA-ALA-SER-ALA-ALA-ALA-ALA-ALA-ALA-ALA",
    conformation="alpha",
    minimize_energy=False,
)

# Parse the PDB text into a biotite AtomArray --
# a structured table of atoms with coordinates, chain IDs, residue names, etc.
pdb_file = pdb.PDBFile.read(io.StringIO(pdb_content))
structure = pdb_file.get_structure(model=1)

# Keep only backbone atoms (N, C-alpha, C, O) to discard hydrogens
# and side chain atoms. For embedding we only need the residue sequence,
# but filtering to backbone atoms is good practice for downstream geometry work.
backbone = structure[np.isin(structure.atom_name, ["N", "CA", "C", "O"])]

print(f"Generated structure: {len(struc.get_residues(backbone)[0])} residues, {len(backbone)} atoms")

# embed_structure() reads the 3-letter residue names from the AtomArray,
# converts them to 1-letter codes (ALA->A, GLY->G, etc.), then calls embed().
# Output: (L, 320) float32 numpy array -- one embedding vector per residue.
struct_emb = embedder.embed_structure(backbone)
print(f"PLM embedding shape: {struct_emb.shape}")
print(f"dtype: {struct_emb.dtype}")

# Plot the embedding matrix transposed so residues run along the x-axis.
# Each vertical slice is the 320-number 'fingerprint' for one residue.
fig, ax = plt.subplots(figsize=(12, 3))
im = ax.imshow(struct_emb.T, aspect="auto", cmap="RdBu_r",
               interpolation="nearest")
ax.set_xlabel("Residue position", fontsize=10)
ax.set_ylabel("ESM-2 dimension (320)", fontsize=10)
ax.set_title("synth-pdb -> ESM-2 embedding:  20-residue synthetic helix",
             fontsize=11, fontweight="bold")
plt.colorbar(im, ax=ax, label="Activation")
plt.tight_layout()
plt.show()

print("\nThe embedder extracted the sequence from the AtomArray residue names,")
print("passed it through ESM-2, and returned a (L, 320) float32 matrix.")
print("This can be used directly as node features in a GNN, or concatenated")
print("with geometric features (bond angles, SASA, etc.) for richer input representations.")

9. Summary & Next Steps¶

What we demonstrated¶

Step	Result
`embedder.embed(seq)`	`(L, 320)` per-residue float32 matrix
`embedder.embed_structure(arr)`	Same, directly from `AtomArray`
Heatmaps	Poly-Ala uniform; ubiquitin diverse — context is encoded
Similarity matrix	Repetitive peptides cluster; real proteins moderate similarity
UMAP	Residues cluster by amino acid identity in 2D
Linear probe	High SS accuracy → structural info linearly accessible

Using a more powerful model¶

The API is identical for all ESM-2 variants:

# Current: 8M params, 320-dim
embedder = ESM2Embedder()   

# Better: 35M params, 480-dim (~140 MB)
embedder = ESM2Embedder(model_name="facebook/esm2_t12_35M_UR50D")

# Best for research: 650M params, 1280-dim (~2.5 GB, use GPU)
embedder = ESM2Embedder(model_name="facebook/esm2_t33_650M_UR50D", device="cuda")

Downstream integration ideas¶

# 1. Feed into the synth-pdb GNN quality scorer as node features
plm_feats = embedder.embed_structure(structure)      # (L, 320)
node_features = np.concatenate([geo_feats, plm_feats], axis=-1)

# 2. Pairwise contact prediction (outer product trick)
emb = embedder.embed(sequence)                       # (L, 320)
outer = np.einsum('id,jd->ij', emb, emb)            # (L, L) — feed into CNN

# 3. Retrieval by function similarity
sim = embedder.sequence_similarity(query_seq, candidate_seq)

Reference¶

Lin, Z. et al. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130.
https://doi.org/10.1126/science.ade2574