
📐 Geometry Factory: The trRosetta 6D Orientogram ⚛️¶
Objective: Understand the "Inter-Residue Reference Frame" and how we translate 3D protein folds into 6-dimensional mathematical signatures for AI training.
🌟 Why 1D distances aren't enough?¶
Traditional structural models (like early GNNs) often relied solely on Distance Maps. While helpful, distance only tells you how close two points are—it doesn't tell you their relative "twist" or orientation.¶
🌟 The Philosophy: From Distances to Frames¶
The trRosetta (transform-restrained Rosetta) paper revolutionized AI structure prediction by providing 6D relative orientations between every pair of residues. This allows models to learn the complex 3D assembly of helices and sheets with much higher precision.Imagine trying to describe a dance to someone. If you only tell them the distance between the dancers' feet, they can't see the full performance. They don't know if the dancers are facing each other, looking away, or leaning in. They are missing the Orientations.
In structural biology, residues (amino acids) are like those dancers.
- Early AI (Coarse-Grained): Used simple Distance Maps ($L \times L$). They knew where the residues were, but not how they "faced" each other.
- Modern AI (Frame-Based): Uses 6D Orientations. Every residue is treated as a "Rigid Body" or local coordinate frame. By measuring the 6 relative values between every pair of residues, we capture the full 3D assembly with mathematical completeness.
This notebook demonstrates how synth-pdb generates these advanced descriptors, which powered the breakthroughs in models like trRosetta and AlphaFold.
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path
try:
current_path = Path(".").resolve()
repo_root = current_path.parent.parent
if (repo_root / "synth_pdb").exists():
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"📌 Added local library to path: {repo_root}")
except Exception:
pass
if 'google.colab' in str(get_ipython()):
if not os.path.exists("installed.marker"):
print("Running on Google Colab. Installing dependencies...")
get_ipython().run_line_magic('pip', 'install synth-pdb numpy matplotlib py3Dmol biotite')
with open("installed.marker", "w") as f:
f.write("done")
print("🔄 Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
os.kill(os.getpid(), 9)
else:
print("✅ Dependencies Ready.")
else:
import synth_pdb
print(f"✅ Running locally. Using synth-pdb version: {synth_pdb.__version__}")
import matplotlib.pyplot as plt
from synth_pdb.batch_generator import BatchedGenerator
print("Geometric kernels loaded. Ready to compute orientograms. 📐")
1. Defining the 6D Descriptors¶
For any two residues $i$ and $j$, we define the orientation of residue $j$ relative to $i$ using their $C\alpha$ and $C\beta$ positions (plus $N$ to fix the rotation). we calculate 4 primary tensors:
We will generate a peptide with a mixed Alpha/Beta structure and analyze its geometric footprint.1. $d$ (Distance): The straight-line distance between $C\beta_i$ and $C\beta_j$. This is the foundation of the "Contact Map". 2. $\\omega$ (Omega): The dihedral (twist) angle $C\alpha_i - C\beta_i - C\beta_j - C\alpha_j$. It tells us how the backbones of the two residues are rotated relative to each other. 3. $\\theta$ (Theta): The plane angle $C\alpha_i - C\beta_i - C\beta_j$. It describes how residue $i$ "looks at" residue $j$. 4. $\\phi$ (Phi): The polar dihedral $N_i - C\alpha_i - C\beta_i - C\beta_j$. It anchors the orientation to the backbone's local coordinate system.
Let's generate a Beta Sheet fold, where these angles are particularly well-defined and structured.
sequence = "ALA-VAL-LEU-ILE-SER-GLY-MET-TRP" * 4 # 32 residues
generator = BatchedGenerator(sequence, n_batch=1, full_atom=False) # Backbone only
batch = generator.generate_batch(conformation='beta') # Beta sheets have distinctive 6D signals
print("Structure Batch Generated.")
print("Computing 6D Orientations...")
orientations = batch.get_6d_orientations()
print(f"Orientations computed for {batch.n_residues} residues.")
print(f"Tensors available: {list(orientations.keys())}")
2. Visualizing the Orientogram¶
Let's look at the 4 primary descriptors for our batch member 0:
- Distance ($d$): $C\beta - C\beta$ Euclidean distance.
- $\\omega$: Absolute rotation between frames.
- $\\theta$: Orientation angle.
- $\\phi$: Dihedral angle between frames.
Below we plot the four tensors as $L \times L$ heatmaps.
Educational Insight: Note how the 6D tensors capture the diagonal structure of the fold differently than a simple distance map.### How to read these "Images":
- Symmetry: $d$ is symmetric ($dist_{i,j} = dist_{j,i}$), but the others might not be! $\\theta$ is specifically defined relative to the "source" residue $i$.
- Regularity: The dashed patterns you see are the hallmark of real protein physics. Beta sheets create rhythmic, staggered patterns in these maps because of the alternating "up-down" nature of the amino acid sidechains in a sheet.
- AI Readiness: For a Computer Vision model (like a CNN), these are 4 "channels" (like Red, Green, Blue) that describe the protein's essence perfectly.
fig, axes = plt.subplots(2, 2, figsize=(12, 11))
plt.subplots_adjust(hspace=0.3, wspace=0.2)
# Using raw strings (r"...") to ensure Python 3.12 compatibility with LaTeX
titles = {
'dist': r'A. Distance Map (C-beta) ($d$) [$\AA$]',
'omega': r'B. Omega Dihedral (Torsion) ($\omega$) [$^\circ$]',
'theta': r'C. Theta Angle (Plane) ($\theta$) [$^\circ$]',
'phi': r'D. Phi Dihedral (Polar) ($\phi$) [$^\circ$]'
}
cmaps = {'dist': 'viridis_r', 'omega': 'hsv', 'theta': 'magma', 'phi': 'twilight'}
for i, key in enumerate(['dist', 'omega', 'theta', 'phi']):
ax = axes[i // 2, i % 2]
data = orientations[key][0] # First batch member
if key == 'dist':
im = ax.imshow(data, cmap=cmaps[key], vmax=15.0) # Cap distance for visual clarity
else:
# For periodic angles, clarify -180 to 180
# Angular values wrap from -180 to 180
im = ax.imshow(data, cmap=cmaps[key], vmin=-180, vmax=180)
ax.set_title(titles[key], fontweight='bold')
fig.colorbar(im, ax=ax, shrink=0.8)
plt.suptitle("The 6D Orientogram: A 'Computer Vision' View of Protein Structure", fontsize=16, y=0.95)
plt.show()
3. Handling the "Invisible" Residue: Glycine¶
In 6D geometry, you must have a $C\beta$ atom to define the residue's orientation frame. But there's a problem: Glycine (GLY) has no $C\beta$! Its sidechain is just a single Hydrogen atom.
Glycine is the only amino acid without a side chain (just a Hydrogen). However, AI models require a consistent $C\beta$ node for every residue to maintain a rigid frame.
The Fix: Virtual Reconstruction¶
AI models solve this by reconstructing a Virtual C-beta. Even though it's not physically there in Glycine, we can calculate where it would be if Glycine were an L-Alanine.
synth-pdb automatically reconstructs the "Ideal L-Alanine Position" for any Glycine in your sequence, ensuring your tensors are compatible with model requirements.It uses the positions of $N, C\alpha,$ and $C$ to "project" the virtual $C\beta$ into space using ideal geometry. This ensures your data tensors are always contiguous and complete, even for highly flexible Glycine-rich loops.
gly_res_idx = [i for i, r in enumerate(batch.sequence) if r == "GLY"]
print(f"Analyzing Glycine at indices: {gly_res_idx}")
# Look at distance to neighboring residues for a GLY entry
for idx in gly_res_idx[:1]:
dist_row = orientations['dist'][0, idx, :]
print(f"\n🔎 Virtual C-beta mapping for GLY {idx+1}:")
print(f"Distances to neighbors: {dist_row[max(0, idx-2):min(idx+3, len(dist_row))]}")
print("Note how the values are consistent with the rest of the chain!")
print("✅ Virtual C-beta mapping successful.")
4. Why does this exist?¶
This pipeline exists because Generating 3D Coordinates is HARD, but Generating 2D Tensors is FAST.
- Training AI: We generate millions of such tensors from synthetic PDBs. The AI learns the "Language" of these heatmaps.
- Prediction: When we give the AI a new sequence, it predicts these heatmaps.
- Recovery: We then use a process called "Minimization" or "Folding" to reconstruct the 3D structure that best fits those predicted 6D heatmaps.
By providing these descriptors, synth-pdb allows you to bench-test the entire lifecycle of an AI model, from data production to descriptor analysis.
🏆 Experiment for the User¶
- Ensemble Variance: Generate a batch with
drift=10.0and plot the Standard Deviation of the distance maps. 📉 - Feature Engineering: Standardize these tensors (e.g.
log(dist)) to prepare them as direct inputs for a Convolutional Neural Network (CNN) classifier.
Try generating structures with conformation='alpha' instead of 'beta'.
Predict: How will the distance map change? (Hint: Alpha helices stay closer to their immediate neighbors, creating a thick diagonal line!).
You are now extracting the rich 3D information that powers modern structural AI. Happy building. 📐🤖The structural signatures are yours to explore. 🧬📐🤖