🌌 AI Latent Space Explorer¶
Visualizing How Protein AI Models "See" Structural Diversity¶
🎯 What You'll Learn¶
Modern protein folding AI models like AlphaFold and trRosetta don't predict 3D coordinates directly. Instead, they:
- Predict inter-residue geometry (distances and orientations)
- Use these 2D "maps" to reconstruct 3D structure
In this tutorial:
- 🚀 Generate 500 protein conformations in parallel using
BatchedGenerator - 📐 Compute 6D trRosetta orientograms (the AI's "view" of structure)
- 🎨 Use PCA to visualize the high-dimensional "latent space" in 2D
- 🔍 Explore individual structures from the latent space galaxy
💡 Why This Matters: Understanding how AI models represent proteins is crucial for:
- Protein structure prediction
- Generative protein design
- Transfer learning in structural biology
# 🔧 Environment Detection & Setup
import os
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
print('🌐 Running in Google Colab')
try:
import synth_pdb
print(' ✅ synth-pdb already installed')
except ImportError:
print(' 📦 Installing synth-pdb...')
!pip install -q synth-pdb py3Dmol biotite
print(' ✅ Installation complete')
import plotly.io as pio
pio.renderers.default = 'colab'
else:
print('💻 Running in local Jupyter environment')
sys.path.append(os.path.abspath('../../'))
print('✅ Environment configured!')
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
import py3Dmol
from IPython.display import HTML, clear_output, display
from ipywidgets import IntSlider
from sklearn.decomposition import PCA
from synth_pdb import PeptideGenerator
from synth_pdb.batch_generator import BatchedGenerator
print('✅ Latent Space Explorer Ready!')
📚 Theoretical Foundation¶
Protein Representation Learning¶
Why not use Cartesian coordinates directly?
Cartesian coordinates (x, y, z) have several problems:
- Not rotation-invariant: Same structure, different orientation = different coordinates
- Not translation-invariant: Same structure, different position = different coordinates
- High dimensional: N atoms × 3 coordinates = 3N dimensions
Solution: Inter-residue Geometry
Instead, AI models use pairwise geometric relationships between residues:
| Feature | Symbol | Description | Range |
|---|---|---|---|
| Distance | d | Cβ-Cβ distance | 0-20 Å |
| Omega | ω | Dihedral angle between Cβ-Cβ frames | -180° to +180° |
| Theta | θ | Planar angle in first frame | 0° to 180° |
| Phi | φ | Planar angle in second frame | 0° to 180° |
These 4 values (plus 2 more for complete orientation) form the 6D orientogram.
Dimensionality Reduction: PCA¶
Principal Component Analysis finds the directions of maximum variance in high-dimensional data.
Given data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (n samples, d features):
Center the data: $\mathbf{X}_{centered} = \mathbf{X} - \bar{\mathbf{X}}$
Compute covariance: $\mathbf{C} = \frac{1}{n-1}\mathbf{X}_{centered}^T \mathbf{X}_{centered}$
Eigendecomposition: $\mathbf{C} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$
Project: $\mathbf{Z} = \mathbf{X}_{centered}\mathbf{V}_k$ (keep top k eigenvectors)
Why PCA for proteins?
- Reveals the "principal modes" of structural variation
- Reduces ~1000 dimensions to 2-3 for visualization
- Preserves maximum variance (information)
🔬 Alternative: t-SNE preserves local structure better but is non-linear and slower
1. Parallel Structure Generation¶
We'll generate 500 diverse conformations of a 7-residue peptide using BatchedGenerator, which uses vectorized NumPy operations for speed.
sequence = "TRP-SER-GLY-ALA-VAL-PRO-ILE"
n_batch = 500
display(HTML(f"""
<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white; padding: 15px; border-radius: 10px;
font-family: monospace; margin-bottom: 15px;'>
<b>🚀 Batch Generation</b><br>
Sequence: {sequence}<br>
Structures: {n_batch}<br>
Method: Vectorized NeRF algorithm
</div>
"""))
# drift=20.0 adds Gaussian noise (σ=20°) to backbone φ/ψ angles so each
# structure samples a *different* conformation. Without drift every replica is
# identical, all feature vectors are the same, and PCA gives PC1=PC2=0 for all.
# We mix α-helix and extended conformations (250 each) so PCA shows two
# well-separated clusters — a much more instructive latent space.
n_alpha = n_batch // 2
n_extended = n_batch - n_alpha
print(f"Generating {n_alpha} α-helix + {n_extended} extended structures (drift=20°)...")
bg_alpha = BatchedGenerator(sequence, n_batch=n_alpha)
bg_ext = BatchedGenerator(sequence, n_batch=n_extended)
batch_alpha = bg_alpha.generate_batch(conformation='alpha', drift=20.0, seed=42)
batch_extended = bg_ext.generate_batch(conformation='extended', drift=20.0, seed=99)
class _CombinedBatch:
"""Thin wrapper that concatenates two BatchedPeptide objects along the batch axis."""
def __init__(self, a, b):
self.coords = np.concatenate([a.coords, b.coords], axis=0)
self._a, self._b = a, b
def get_6d_orientations(self):
oa = self._a.get_6d_orientations()
ob = self._b.get_6d_orientations()
return {k: np.concatenate([oa[k], ob[k]], axis=0) for k in oa}
batch = _CombinedBatch(batch_alpha, batch_extended)
print(f"✅ Generated {batch.coords.shape[0]} structures")
print(f" Shape: {batch.coords.shape} (Batch, Atoms, XYZ)")
Sequence: {sequence}
Structures: {n_batch}
Method: Vectorized NeRF algorithm
2. Computing 6D Orientograms¶
For every pair of residues in every protein, we calculate the 6D geometric relationship. This is exactly what trRosetta predicts before reconstructing 3D structure.
The 6D representation:
- Distance (d)
- Omega (ω) - dihedral between frames
- Theta (θ) - planar angle in frame 1
- Phi (φ) - planar angle in frame 2
- (Plus 2 more for complete orientation)
This creates a rotation and translation invariant representation of structure.
print("Computing 6D orientograms...")
orients = batch.get_6d_orientations()
print(f"✅ Computed orientations for {n_batch} structures")
print(f" Available features: {list(orients.keys())}")
print(f" Shape per feature: {orients['dist'].shape} (Batch, Residues, Residues)")
3. Dimensionality Reduction with PCA¶
We flatten the 2D geometry maps into high-dimensional feature vectors and use PCA to project them into 2D for visualization.
Feature vector construction:
- Distance maps: 7×7 = 49 features
- Omega maps: 7×7 = 49 features
- Theta maps: 7×7 = 49 features
- Phi maps: 7×7 = 49 features
- Total: 196 dimensions → 2 dimensions via PCA
# Flatten all features into one vector per structure
feature_vector = np.concatenate([
orients['dist'].reshape(n_batch, -1),
orients['omega'].reshape(n_batch, -1),
orients['theta'].reshape(n_batch, -1),
orients['phi'].reshape(n_batch, -1)
], axis=1)
print(f"Feature vector dimensionality: {feature_vector.shape[1]}")
# Apply PCA
pca = PCA(n_components=2)
latent_points = pca.fit_transform(feature_vector)
# Show variance explained
var_explained = pca.explained_variance_ratio_
print("\n✅ PCA complete")
print(f" PC1 variance: {var_explained[0]:.1%}")
print(f" PC2 variance: {var_explained[1]:.1%}")
print(f" Total variance captured: {var_explained.sum():.1%}")
4. Interactive Latent Space Galaxy¶
Each point represents one protein conformation. Points close together have similar geometric properties. This is the "latent space" - a compressed representation of structural diversity.
import os
fig = go.Figure(data=[go.Scatter(
x=latent_points[:, 0],
y=latent_points[:, 1],
mode='markers',
marker={
'size': 8,
'color': np.arange(n_batch),
'colorscale': 'Viridis',
'showscale': True,
'colorbar': {'title': 'Structure ID'},
'line': {'width': 0.5, 'color': 'white'}
},
text=[f"Structure {i}" for i in range(n_batch)],
hovertemplate='%{text}<br>PC1: %{x:.2f}<br>PC2: %{y:.2f}<extra></extra>'
)])
fig.update_layout(
title={
'text': f'Protein Latent Space (PCA of 6D Orientograms)<br><sub>Variance explained: PC1={var_explained[0]:.1%}, PC2={var_explained[1]:.1%}</sub>',
'x': 0.5,
'xanchor': 'center'
},
xaxis_title='Principal Component 1',
yaxis_title='Principal Component 2',
width=900, height=600,
template="plotly_dark",
hovermode='closest'
)
fig.show(renderer='json' if os.getenv('CI') else None)
PC1: %{x:.2f}
PC2: %{y:.2f}
Variance explained: PC1={var_explained[0]:.1%}, PC2={var_explained[1]:.1%}', 'x': 0.5, 'xanchor': 'center' }, xaxis_title='Principal Component 1', yaxis_title='Principal Component 2', width=900, height=600, template="plotly_dark", hovermode='closest' ) fig.show(renderer='json' if os.getenv('CI') else None)
5. Structure Explorer¶
Use the slider to browse individual structures and see their corresponding distance maps (how the AI "sees" them).
import os
IN_CI = bool(os.getenv("CI"))
if not IN_CI:
# Output widget for clean updates
out = widgets.Output()
# Slider
slider = IntSlider(min=0, max=n_batch-1, step=1, value=0, description='Structure:', layout=widgets.Layout(width='500px'))
# Track initialization
_initializing = True
def view_from_latent(change=None):
global _initializing
if _initializing and change is not None:
return
index = slider.value
coords = batch.coords[index]
# Create structure
pgen = PeptideGenerator(sequence)
res = pgen.generate()
if res.structure.array_length() == coords.shape[0]:
res.structure.coord = coords
with out:
clear_output(wait=True)
print(f'Structure {index} | PC1={latent_points[index,0]:.2f}, PC2={latent_points[index,1]:.2f}\n')
# 3D viewer
view = py3Dmol.view(width=500, height=400)
view.addModel(res.pdb, "pdb")
view.setStyle({'stick': {'colorscheme': 'chainHetatm'}})
view.setBackgroundColor('#1a1a1a')
view.zoomTo()
display(view.show())
# Distance map
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
im = ax.imshow(orients['dist'][index], cmap='magma', vmin=0, vmax=20)
ax.set_title("Distance Map (AI View)", color='white')
ax.set_xlabel("Residue", color='white')
ax.set_ylabel("Residue", color='white')
ax.tick_params(colors='white')
fig.patch.set_facecolor('#1a1a1a')
ax.set_facecolor('#1a1a1a')
plt.colorbar(im, ax=ax, label='Distance (Å)')
plt.tight_layout()
plt.show()
plt.close(fig) # Important: close figure to prevent memory leak
# Connect slider
slider.observe(view_from_latent, 'value')
# Display UI
display(widgets.VBox([slider, out]))
# Initialize
_initializing = False
# 3Dmol.js initializes asynchronously in the browser. Calling view_from_latent()
# at kernel execution time races with the library bootstrap and produces
# the 'failed to load' error. Show a placeholder instead; the first
# slider change fires view_from_latent() after 3Dmol.js is fully ready.
with out:
from IPython.display import HTML as _HTML
from IPython.display import display as _disp
_disp(_HTML(
'<div style="text-align:center;padding:40px;color:#aaa;'
'border:1px dashed #555;border-radius:8px;'
'font-style:italic;background:#1a1a1a;">'
'🔄 Move the slider above to load the 3D structure'
'</div>'
))
# (Widget output skipped in CI)
🎓 Key Insights¶
- Geometric Representation: AI models use inter-residue geometry (distances + orientations) instead of raw coordinates
- Rotation Invariance: 6D orientograms are invariant to rotation and translation
- Latent Space: PCA reveals the "principal modes" of structural variation
- Dimensionality: 196D → 2D while preserving ~XX% of variance
📖 Further Reading¶
Protein Structure Prediction:
- Jumper et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature 596:583-589. DOI: 10.1038/s41586-021-03819-2
- Yang et al. (2020). "Improved protein structure prediction using predicted interresidue orientations." PNAS 117:1496-1503. DOI: 10.1073/pnas.1914677117
Protein Representation Learning:
- Rao et al. (2019). "Evaluating protein transfer learning with TAPE." NeurIPS 2019. arXiv:1906.08230
- Greener et al. (2018). "Design of metalloproteins and novel protein folds using variational autoencoders." Sci Rep 8:16189. DOI: 10.1038/s41598-018-34533-1
Dimensionality Reduction:
- Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." Phil Mag 2:559-572.
- van der Maaten & Hinton (2008). "Visualizing data using t-SNE." JMLR 9:2579-2605.
🎉 Exploration Complete!
You've mastered protein latent space visualization!