🌌 AI Latent Space Explorer¶

Visualizing How Protein AI Models "See" Structural Diversity¶

🎯 What You'll Learn¶

Modern protein folding AI models like AlphaFold and trRosetta don't predict 3D coordinates directly. Instead, they:

Predict inter-residue geometry (distances and orientations)
Use these 2D "maps" to reconstruct 3D structure

In this tutorial:

🚀 Generate 500 protein conformations in parallel using BatchedGenerator
📐 Compute 6D trRosetta orientograms (the AI's "view" of structure)
🎨 Use PCA to visualize the high-dimensional "latent space" in 2D
🔍 Explore individual structures from the latent space galaxy

💡 Why This Matters: Understanding how AI models represent proteins is crucial for:

Protein structure prediction

Generative protein design

Transfer learning in structural biology

In [ ]:

Copied!





# 🔧 Environment Detection & Setup
import os
import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('🌐 Running in Google Colab')
    try:
        import synth_pdb
        print('   ✅ synth-pdb already installed')
    except ImportError:
        print('   📦 Installing synth-pdb...')
        !pip install -q synth-pdb py3Dmol biotite
        print('   ✅ Installation complete')
    import plotly.io as pio
    pio.renderers.default = 'colab'
else:
    print('💻 Running in local Jupyter environment')
    sys.path.append(os.path.abspath('../../'))

print('✅ Environment configured!')
# 🔧 Environment Detection & Setup
import os
import sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('🌐 Running in Google Colab')
    try:
        import synth_pdb
        print('   ✅ synth-pdb already installed')
    except ImportError:
        print('   📦 Installing synth-pdb...')
        !pip install -q synth-pdb py3Dmol biotite
        print('   ✅ Installation complete')
    import plotly.io as pio
    pio.renderers.default = 'colab'
else:
    print('💻 Running in local Jupyter environment')
    sys.path.append(os.path.abspath('../../'))

print('✅ Environment configured!')

In [ ]:

Copied!





import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
import py3Dmol
from IPython.display import HTML, clear_output, display
from ipywidgets import IntSlider
from sklearn.decomposition import PCA

from synth_pdb import PeptideGenerator
from synth_pdb.batch_generator import BatchedGenerator

print('✅ Latent Space Explorer Ready!')
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
import py3Dmol
from IPython.display import HTML, clear_output, display
from ipywidgets import IntSlider
from sklearn.decomposition import PCA

from synth_pdb import PeptideGenerator
from synth_pdb.batch_generator import BatchedGenerator

print('✅ Latent Space Explorer Ready!')

📚 Theoretical Foundation¶

Protein Representation Learning¶

Why not use Cartesian coordinates directly?

Cartesian coordinates (x, y, z) have several problems:

Not rotation-invariant: Same structure, different orientation = different coordinates
Not translation-invariant: Same structure, different position = different coordinates
High dimensional: N atoms × 3 coordinates = 3N dimensions

Solution: Inter-residue Geometry

Instead, AI models use pairwise geometric relationships between residues:

Feature	Symbol	Description	Range
Distance	d	Cβ-Cβ distance	0-20 Å
Omega	ω	Dihedral angle between Cβ-Cβ frames	-180° to +180°
Theta	θ	Planar angle in first frame	0° to 180°
Phi	φ	Planar angle in second frame	0° to 180°

These 4 values (plus 2 more for complete orientation) form the 6D orientogram.

Dimensionality Reduction: PCA¶

Principal Component Analysis finds the directions of maximum variance in high-dimensional data.

Given data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (n samples, d features):

Center the data: $\mathbf{X}_{centered} = \mathbf{X} - \bar{\mathbf{X}}$
Compute covariance: $\mathbf{C} = \frac{1}{n-1}\mathbf{X}_{centered}^T \mathbf{X}_{centered}$
Eigendecomposition: $\mathbf{C} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$
Project: $\mathbf{Z} = \mathbf{X}_{centered}\mathbf{V}_k$ (keep top k eigenvectors)

Why PCA for proteins?

Reveals the "principal modes" of structural variation
Reduces ~1000 dimensions to 2-3 for visualization
Preserves maximum variance (information)

🔬 Alternative: t-SNE preserves local structure better but is non-linear and slower

1. Parallel Structure Generation¶

We'll generate 500 diverse conformations of a 7-residue peptide using BatchedGenerator, which uses vectorized NumPy operations for speed.

In [ ]:

Copied!





sequence = "TRP-SER-GLY-ALA-VAL-PRO-ILE"
n_batch = 500

display(HTML(f"""
<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white; padding: 15px; border-radius: 10px;
            font-family: monospace; margin-bottom: 15px;'>
    <b>🚀 Batch Generation</b><br>
    Sequence: {sequence}<br>
    Structures: {n_batch}<br>
    Method: Vectorized NeRF algorithm
</div>
"""))

# drift=20.0 adds Gaussian noise (σ=20°) to backbone φ/ψ angles so each
# structure samples a *different* conformation.  Without drift every replica is
# identical, all feature vectors are the same, and PCA gives PC1=PC2=0 for all.
# We mix α-helix and extended conformations (250 each) so PCA shows two
# well-separated clusters — a much more instructive latent space.
n_alpha    = n_batch // 2
n_extended = n_batch - n_alpha

print(f"Generating {n_alpha} α-helix + {n_extended} extended structures (drift=20°)...")

bg_alpha = BatchedGenerator(sequence, n_batch=n_alpha)
bg_ext   = BatchedGenerator(sequence, n_batch=n_extended)

batch_alpha    = bg_alpha.generate_batch(conformation='alpha',    drift=20.0, seed=42)
batch_extended = bg_ext.generate_batch(conformation='extended', drift=20.0, seed=99)

class _CombinedBatch:
    """Thin wrapper that concatenates two BatchedPeptide objects along the batch axis."""

    def __init__(self, a, b):
        self.coords = np.concatenate([a.coords, b.coords], axis=0)
        self._a, self._b = a, b
    def get_6d_orientations(self):
        oa = self._a.get_6d_orientations()
        ob = self._b.get_6d_orientations()
        return {k: np.concatenate([oa[k], ob[k]], axis=0) for k in oa}

batch = _CombinedBatch(batch_alpha, batch_extended)

print(f"✅ Generated {batch.coords.shape[0]} structures")
print(f"   Shape: {batch.coords.shape} (Batch, Atoms, XYZ)")
sequence = "TRP-SER-GLY-ALA-VAL-PRO-ILE"
n_batch = 500

display(HTML(f"""

🚀 Batch Generation

    Sequence: {sequence}

    Structures: {n_batch}

    Method: Vectorized NeRF algorithm

"""))

# drift=20.0 adds Gaussian noise (σ=20°) to backbone φ/ψ angles so each
# structure samples a *different* conformation.  Without drift every replica is
# identical, all feature vectors are the same, and PCA gives PC1=PC2=0 for all.
# We mix α-helix and extended conformations (250 each) so PCA shows two
# well-separated clusters — a much more instructive latent space.
n_alpha    = n_batch // 2
n_extended = n_batch - n_alpha

print(f"Generating {n_alpha} α-helix + {n_extended} extended structures (drift=20°)...")

bg_alpha = BatchedGenerator(sequence, n_batch=n_alpha)
bg_ext   = BatchedGenerator(sequence, n_batch=n_extended)

batch_alpha    = bg_alpha.generate_batch(conformation='alpha',    drift=20.0, seed=42)
batch_extended = bg_ext.generate_batch(conformation='extended', drift=20.0, seed=99)

class _CombinedBatch:
    """Thin wrapper that concatenates two BatchedPeptide objects along the batch axis."""

    def __init__(self, a, b):
        self.coords = np.concatenate([a.coords, b.coords], axis=0)
        self._a, self._b = a, b
    def get_6d_orientations(self):
        oa = self._a.get_6d_orientations()
        ob = self._b.get_6d_orientations()
        return {k: np.concatenate([oa[k], ob[k]], axis=0) for k in oa}

batch = _CombinedBatch(batch_alpha, batch_extended)

print(f"✅ Generated {batch.coords.shape[0]} structures")
print(f"   Shape: {batch.coords.shape} (Batch, Atoms, XYZ)")

2. Computing 6D Orientograms¶

For every pair of residues in every protein, we calculate the 6D geometric relationship. This is exactly what trRosetta predicts before reconstructing 3D structure.

The 6D representation:

Distance (d)
Omega (ω) - dihedral between frames
Theta (θ) - planar angle in frame 1
Phi (φ) - planar angle in frame 2
(Plus 2 more for complete orientation)

This creates a rotation and translation invariant representation of structure.

In [ ]:

Copied!





print("Computing 6D orientograms...")
orients = batch.get_6d_orientations()

print(f"✅ Computed orientations for {n_batch} structures")
print(f"   Available features: {list(orients.keys())}")
print(f"   Shape per feature: {orients['dist'].shape} (Batch, Residues, Residues)")
print("Computing 6D orientograms...")
orients = batch.get_6d_orientations()

print(f"✅ Computed orientations for {n_batch} structures")
print(f"   Available features: {list(orients.keys())}")
print(f"   Shape per feature: {orients['dist'].shape} (Batch, Residues, Residues)")

3. Dimensionality Reduction with PCA¶

We flatten the 2D geometry maps into high-dimensional feature vectors and use PCA to project them into 2D for visualization.

Feature vector construction:

Distance maps: 7×7 = 49 features
Omega maps: 7×7 = 49 features
Theta maps: 7×7 = 49 features
Phi maps: 7×7 = 49 features
Total: 196 dimensions → 2 dimensions via PCA

In [ ]:

Copied!





# Flatten all features into one vector per structure
feature_vector = np.concatenate([
    orients['dist'].reshape(n_batch, -1),
    orients['omega'].reshape(n_batch, -1),
    orients['theta'].reshape(n_batch, -1),
    orients['phi'].reshape(n_batch, -1)
], axis=1)

print(f"Feature vector dimensionality: {feature_vector.shape[1]}")

# Apply PCA
pca = PCA(n_components=2)
latent_points = pca.fit_transform(feature_vector)

# Show variance explained
var_explained = pca.explained_variance_ratio_
print("\n✅ PCA complete")
print(f"   PC1 variance: {var_explained[0]:.1%}")
print(f"   PC2 variance: {var_explained[1]:.1%}")
print(f"   Total variance captured: {var_explained.sum():.1%}")
# Flatten all features into one vector per structure
feature_vector = np.concatenate([
    orients['dist'].reshape(n_batch, -1),
    orients['omega'].reshape(n_batch, -1),
    orients['theta'].reshape(n_batch, -1),
    orients['phi'].reshape(n_batch, -1)
], axis=1)

print(f"Feature vector dimensionality: {feature_vector.shape[1]}")

# Apply PCA
pca = PCA(n_components=2)
latent_points = pca.fit_transform(feature_vector)

# Show variance explained
var_explained = pca.explained_variance_ratio_
print("\n✅ PCA complete")
print(f"   PC1 variance: {var_explained[0]:.1%}")
print(f"   PC2 variance: {var_explained[1]:.1%}")
print(f"   Total variance captured: {var_explained.sum():.1%}")

4. Interactive Latent Space Galaxy¶

Each point represents one protein conformation. Points close together have similar geometric properties. This is the "latent space" - a compressed representation of structural diversity.

In [ ]:

Copied!





import os

fig = go.Figure(data=[go.Scatter(
    x=latent_points[:, 0],
    y=latent_points[:, 1],
    mode='markers',
    marker={
        'size': 8,
        'color': np.arange(n_batch),
        'colorscale': 'Viridis',
        'showscale': True,
        'colorbar': {'title': 'Structure ID'},
        'line': {'width': 0.5, 'color': 'white'}
    },
    text=[f"Structure {i}" for i in range(n_batch)],
    hovertemplate='%{text}<br>PC1: %{x:.2f}<br>PC2: %{y:.2f}<extra></extra>'
)])

fig.update_layout(
    title={
        'text': f'Protein Latent Space (PCA of 6D Orientograms)<br><sub>Variance explained: PC1={var_explained[0]:.1%}, PC2={var_explained[1]:.1%}</sub>',
        'x': 0.5,
        'xanchor': 'center'
    },
    xaxis_title='Principal Component 1',
    yaxis_title='Principal Component 2',
    width=900, height=600,
    template="plotly_dark",
    hovermode='closest'
)

fig.show(renderer='json' if os.getenv('CI') else None)
import os

fig = go.Figure(data=[go.Scatter(
    x=latent_points[:, 0],
    y=latent_points[:, 1],
    mode='markers',
    marker={
        'size': 8,
        'color': np.arange(n_batch),
        'colorscale': 'Viridis',
        'showscale': True,
        'colorbar': {'title': 'Structure ID'},
        'line': {'width': 0.5, 'color': 'white'}
    },
    text=[f"Structure {i}" for i in range(n_batch)],
    hovertemplate='%{text}
PC1: %{x:.2f}
PC2: %{y:.2f}'
)])

fig.update_layout(
    title={
        'text': f'Protein Latent Space (PCA of 6D Orientograms)
Variance explained: PC1={var_explained[0]:.1%}, PC2={var_explained[1]:.1%}',
        'x': 0.5,
        'xanchor': 'center'
    },
    xaxis_title='Principal Component 1',
    yaxis_title='Principal Component 2',
    width=900, height=600,
    template="plotly_dark",
    hovermode='closest'
)

fig.show(renderer='json' if os.getenv('CI') else None)

5. Structure Explorer¶

Use the slider to browse individual structures and see their corresponding distance maps (how the AI "sees" them).

In [ ]:

Copied!





import os

IN_CI = bool(os.getenv("CI"))

if not IN_CI:
    # Output widget for clean updates
    out = widgets.Output()

    # Slider
    slider = IntSlider(min=0, max=n_batch-1, step=1, value=0, description='Structure:', layout=widgets.Layout(width='500px'))

    # Track initialization
    _initializing = True

    def view_from_latent(change=None):
        global _initializing
        if _initializing and change is not None:
            return

        index = slider.value
        coords = batch.coords[index]

        # Create structure
        pgen = PeptideGenerator(sequence)
        res = pgen.generate()

        if res.structure.array_length() == coords.shape[0]:
            res.structure.coord = coords

        with out:
            clear_output(wait=True)

            print(f'Structure {index} | PC1={latent_points[index,0]:.2f}, PC2={latent_points[index,1]:.2f}\n')

            # 3D viewer
            view = py3Dmol.view(width=500, height=400)
            view.addModel(res.pdb, "pdb")
            view.setStyle({'stick': {'colorscheme': 'chainHetatm'}})
            view.setBackgroundColor('#1a1a1a')
            view.zoomTo()
            display(view.show())

            # Distance map
            fig, ax = plt.subplots(1, 1, figsize=(4, 4))
            im = ax.imshow(orients['dist'][index], cmap='magma', vmin=0, vmax=20)
            ax.set_title("Distance Map (AI View)", color='white')
            ax.set_xlabel("Residue", color='white')
            ax.set_ylabel("Residue", color='white')
            ax.tick_params(colors='white')
            fig.patch.set_facecolor('#1a1a1a')
            ax.set_facecolor('#1a1a1a')
            plt.colorbar(im, ax=ax, label='Distance (Å)')
            plt.tight_layout()
            plt.show()
            plt.close(fig)  # Important: close figure to prevent memory leak

    # Connect slider
    slider.observe(view_from_latent, 'value')

    # Display UI
    display(widgets.VBox([slider, out]))

    # Initialize
    _initializing = False
    # 3Dmol.js initializes asynchronously in the browser. Calling view_from_latent()
    # at kernel execution time races with the library bootstrap and produces
    # the 'failed to load' error. Show a placeholder instead; the first
    # slider change fires view_from_latent() after 3Dmol.js is fully ready.
    with out:
        from IPython.display import HTML as _HTML
        from IPython.display import display as _disp
        _disp(_HTML(
            '<div style="text-align:center;padding:40px;color:#aaa;'
            'border:1px dashed #555;border-radius:8px;'
            'font-style:italic;background:#1a1a1a;">'
            '&#128260; Move the slider above to load the 3D structure'
            '</div>'
        ))

# (Widget output skipped in CI)
import os

IN_CI = bool(os.getenv("CI"))

if not IN_CI:
    # Output widget for clean updates
    out = widgets.Output()

    # Slider
    slider = IntSlider(min=0, max=n_batch-1, step=1, value=0, description='Structure:', layout=widgets.Layout(width='500px'))

    # Track initialization
    _initializing = True

    def view_from_latent(change=None):
        global _initializing
        if _initializing and change is not None:
            return

        index = slider.value
        coords = batch.coords[index]

        # Create structure
        pgen = PeptideGenerator(sequence)
        res = pgen.generate()

        if res.structure.array_length() == coords.shape[0]:
            res.structure.coord = coords

        with out:
            clear_output(wait=True)

            print(f'Structure {index} | PC1={latent_points[index,0]:.2f}, PC2={latent_points[index,1]:.2f}\n')

            # 3D viewer
            view = py3Dmol.view(width=500, height=400)
            view.addModel(res.pdb, "pdb")
            view.setStyle({'stick': {'colorscheme': 'chainHetatm'}})
            view.setBackgroundColor('#1a1a1a')
            view.zoomTo()
            display(view.show())

            # Distance map
            fig, ax = plt.subplots(1, 1, figsize=(4, 4))
            im = ax.imshow(orients['dist'][index], cmap='magma', vmin=0, vmax=20)
            ax.set_title("Distance Map (AI View)", color='white')
            ax.set_xlabel("Residue", color='white')
            ax.set_ylabel("Residue", color='white')
            ax.tick_params(colors='white')
            fig.patch.set_facecolor('#1a1a1a')
            ax.set_facecolor('#1a1a1a')
            plt.colorbar(im, ax=ax, label='Distance (Å)')
            plt.tight_layout()
            plt.show()
            plt.close(fig)  # Important: close figure to prevent memory leak

    # Connect slider
    slider.observe(view_from_latent, 'value')

    # Display UI
    display(widgets.VBox([slider, out]))

    # Initialize
    _initializing = False
    # 3Dmol.js initializes asynchronously in the browser. Calling view_from_latent()
    # at kernel execution time races with the library bootstrap and produces
    # the 'failed to load' error. Show a placeholder instead; the first
    # slider change fires view_from_latent() after 3Dmol.js is fully ready.
    with out:
        from IPython.display import HTML as _HTML
        from IPython.display import display as _disp
        _disp(_HTML(
            ''
            '🔄 Move the slider above to load the 3D structure'
            '
'
        ))

# (Widget output skipped in CI)

🎓 Key Insights¶

Geometric Representation: AI models use inter-residue geometry (distances + orientations) instead of raw coordinates
Rotation Invariance: 6D orientograms are invariant to rotation and translation
Latent Space: PCA reveals the "principal modes" of structural variation
Dimensionality: 196D → 2D while preserving ~XX% of variance

📖 Further Reading¶

Protein Structure Prediction:

Jumper et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature 596:583-589. DOI: 10.1038/s41586-021-03819-2
Yang et al. (2020). "Improved protein structure prediction using predicted interresidue orientations." PNAS 117:1496-1503. DOI: 10.1073/pnas.1914677117

Protein Representation Learning:

Rao et al. (2019). "Evaluating protein transfer learning with TAPE." NeurIPS 2019. arXiv:1906.08230
Greener et al. (2018). "Design of metalloproteins and novel protein folds using variational autoencoders." Sci Rep 8:16189. DOI: 10.1038/s41598-018-34533-1

Dimensionality Reduction:

Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." Phil Mag 2:559-572.
van der Maaten & Hinton (2008). "Visualizing data using t-SNE." JMLR 9:2579-2605.

🎉 Exploration Complete!

You've mastered protein latent space visualization!