
🧬 The Hard Decoy Challenge¶
Objective: Learn how to generate high-quality negative samples for training Protein AI models.
In the world of Protein AI (like AlphaFold-3 or RosettaFold), generating "good" structures is only half the battle. To train robust models, researchers need Hard Decoys—structures that look physically plausible (correct bond lengths, no overlaps) but are biologically or topologically incorrect.
Why do we need Hard Decoys?¶
- Teaching the Global Minimum: If a model only ever sees perfect structures, it won't know why they are better than slightly distorted ones.
- Improving Discriminators: To train a model to score protein quality, you need a balanced dataset of 'Natives' (Score 1.0) and 'Decoys' (Score 0.0).
- Robustness: Hard decoys test if a model is just memorizing patterns or actually understanding biophysics.
⚠️ How to Run (Important!)¶
This notebook requires a specific environment setup. Follow these steps strictly:
- Run All Cells (
Runtime->Run allorCtrl+F9). - Wait for the Crash: If on Colab, the setup cell will automatically restart the session to load libraries. This is normal.
- Local Users: If you are running locally after editing the library code, Restart your Kernel manually to ensure changes take effect.
- Wait 10 Seconds: Allow the session to reconnect.
- Run All Cells AGAIN: This time, the setup will detect it is ready ('✅ Dependencies Ready') and proceed typically.
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path
# Ensure the local synth_pdb source code is prioritized if running from the repo
try:
current_path = Path(".").resolve()
repo_root = current_path.parent.parent
if (repo_root / "synth_pdb").exists():
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"📌 Added local library to path: {repo_root}")
except Exception:
pass
if 'google.colab' in str(get_ipython()):
if not os.path.exists("installed.marker"):
print("Running on Google Colab. Installing dependencies...")
get_ipython().run_line_magic('pip', 'install synth-pdb py3Dmol')
with open("installed.marker", "w") as f:
f.write("done")
print("🔄 Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
print("⚠️ Please wait 10 seconds, then Run All Cells again.")
os.kill(os.getpid(), 9)
else:
print("✅ Dependencies Ready.")
else:
import synth_pdb
print(f"✅ Running locally. Using synth-pdb version: {synth_pdb.__version__} from {synth_pdb.__file__}")
import matplotlib.pyplot as plt
import numpy as np
import py3Dmol
from synth_pdb.batch_generator import BatchedGenerator, BatchedPeptide
print("Libraries Loaded.")
Strategy 1: Torsion Angle Drift (Conformational Noise)¶
Objective: Generate "Near-Native" decoys by adding controlled Gaussian noise to the ideal Ramachandran angles.
In AI training, we use the --drift parameter to test model sensitivity to backbone precision.
# Generate an ensemble with increasing amounts of noise
sequence = "LEU-LYS-GLU-LEU-GLU-LYS-GLU-LEU-GLU-LYS-GLU-LEU-GLU-LYS-GLU-LEU" # Zipper fragment
generator = BatchedGenerator(sequence, n_batch=100, full_atom=True)
print("Generating Native (Drift = 0.0)...")
native = generator.generate_batch(drift=0.0)
print("Generating Hard Decoy (Drift = 15.0)...")
hard_decoy = generator.generate_batch(drift=15.0)
print("Generation Complete.")
📈 Visualization A: The Ramachandran Plot¶
A Ramachandran Plot maps the backbone torsion angles (Phi and Psi) for every residue.
- Natives cluster tightly in favored regions (e.g., the bottom-left for alpha-helices).
- Decoys leak into disallowed regions as the noise increases, breaking the "physical law" of protein folding.
def get_rama_angles(pdb_str):
"""Extract phi/psi angles using biotite."""
from io import StringIO
import biotite.structure as struc
import biotite.structure.io.pdb as pdb
text_file = StringIO(pdb_str)
array = pdb.PDBFile.read(text_file).get_structure(model=1)
# dihedral_backbone returns (phi, psi, omega) arrays
phi, psi, omega = struc.dihedral_backbone(array)
return np.degrees(phi), np.degrees(psi)
def plot_ramachandran(batch, title):
all_phi = []
all_psi = []
# Sample 10 structures from the batch
for i in range(min(10, batch.coords.shape[0])):
phi, psi = get_rama_angles(batch.to_pdb(i))
all_phi.extend(phi[~np.isnan(phi)])
all_psi.extend(psi[~np.isnan(psi)])
plt.figure(figsize=(6, 6))
plt.scatter(all_phi, all_psi, alpha=0.5, s=10, color='#667eea')
plt.xlim(-180, 180)
plt.ylim(-180, 180)
plt.axhline(0, color='grey', lw=1, alpha=0.3)
plt.axvline(0, color='grey', lw=1, alpha=0.3)
plt.title(f"Ramachandran: {title}")
plt.xlabel("Phi (Φ)")
plt.ylabel("Psi (Ψ)")
plt.grid(alpha=0.2)
plt.show()
plot_ramachandran(native, "Native (Ideal Alpha Helix)")
plot_ramachandran(hard_decoy, "Hard Decoy (15° Noise)")
🗺️ Visualization B: The Contact Map¶
A Contact Map is a 2D matrix where each pixel $(i, j)$ represents the distance between residue $i$ and $j$.
- Perfect structures have clear patterns (helixes show a dark line parallel to the diagonal).
- High-drift decoys smear these patterns, showing the model "what not to predict".
def plot_contact_map(batch, title):
# Get CA atom coordinates for the first model
c = batch.coords[0]
atom_names = batch.atom_names
ca_mask = np.array([name == "CA" for name in atom_names])
ca_coords = c[ca_mask]
# Calculate pairwise distances
diff = ca_coords[:, np.newaxis, :] - ca_coords[np.newaxis, :, :]
dist_matrix = np.sqrt((diff**2).sum(-1))
plt.figure(figsize=(6, 5))
plt.imshow(dist_matrix, cmap='viridis_r')
plt.colorbar(label="Distance (Å)")
plt.title(f"Contact Map: {title}")
plt.xlabel("Residue Index")
plt.ylabel("Residue Index")
plt.show()
plot_contact_map(native, "Native Contacts")
plot_contact_map(hard_decoy, "Decoy Contacts (Scattered)")
Strategy 2: Label Shuffling (Chemical Mismatch)¶
Objective: Create a physically perfect structure that is chemically impossible.
By shuffling residue labels, we create structures where bulky residues are forced into cramped spaces, or hydrophobic residues are exposed to solvent, forcing the model to learn that Backbone Geometry must match Sidechain Chemistry.
import random
def create_shuffled_decoy(batch):
original_seq = batch.sequence
shuffled_seq = original_seq.copy()
random.shuffle(shuffled_seq)
print(f"Native Sequence: {' '.join(original_seq[:8])}...")
print(f"Shuffled Decoy: {' '.join(shuffled_seq[:8])}...")
return shuffled_seq
shuffled_labels = create_shuffled_decoy(native)
print("\n✅ This structural data now points to a nonsensical chemical identity.")
🔬 Strategic Insight: Residue-to-Structure Mismatch¶
Imagine training a model to predict side-chain orientations (Rotamers). If you provide a Shuffled Decoy, the backbone will suggest a tiny Glycine spot, but the label will say "Tryptophan".
This forces the model to learn that Backbone Geometry must match Sidechain Chemistry.
Strategy 3: Sequence Threading (Fold Mismatch)¶
Objective: Force a sequence onto a fold it cannot naturally adopt.
Example: Threading a Poly-Glycine sequence onto the backbone of a Poly-Tryptophan alpha helix.
template_seq = "TRP-TRP-TRP-TRP-TRP-TRP-TRP-TRP-TRP"
thread_seq = "GLY-GLY-GLY-GLY-GLY-GLY-GLY-GLY-GLY"
generator = BatchedGenerator(template_seq, n_batch=1, full_atom=True)
batch = generator.generate_batch()
print(f"Backbone generated for Template: {template_seq}")
print(f"Threaded with Decoy Sequence: {thread_seq}")
view = py3Dmol.view(width=400, height=300)
view.setBackgroundColor("#fdfdfd")
# ROBUST CENTERING (Ported from ml_handover_demo)
c = batch.coords[0].copy()
mask = np.any(c != 0, axis=1)
c_clean = c[mask]
center = (c_clean.min(axis=0) + c_clean.max(axis=0)) / 2
c_centered = c_clean - center
p_tmp = BatchedPeptide(
c_centered[np.newaxis, ...],
batch.sequence,
np.array(batch.atom_names)[mask].tolist(),
np.array(batch.residue_indices)[mask].tolist()
)
view.addModel(p_tmp.to_pdb(0), 'pdb')
view.setStyle({'stick': {'radius': 0.15}, 'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.center()
view.zoom(1.2)
view.show()
🏆 The Challenge: Mass Dataset Generation¶
In a production pipeline, you would use these strategies to generate millions of rows:
# Mock Training Loop logic
for i in range(1000):
is_native = (i % 2 == 0)
drift = 0.0 if is_native else 10.0
data = generator.generate_batch(drift=drift)
# Feed to GNN/Transformer...
By generating hard decoys on the fly, you create an infinite stream of diverse training data that prevents your model from overfitting! 🚀
Strategy 3: Structure Threading View¶
Let's visualize a hard decoy ensemble.
view = py3Dmol.view(width=800, height=400)
view.setBackgroundColor("#fdfdfd")
colors = ["#ff9999", "#66b3ff", "#99ff99", "#ffcc99", "#c2c2f0"]
for i in range(min(5, hard_decoy.coords.shape[0])):
# ROBUST PER-MODEL CENTERING
c = hard_decoy.coords[i].copy()
mask = np.any(c != 0, axis=1)
c_clean = c[mask]
center = (c_clean.min(axis=0) + c_clean.max(axis=0)) / 2
c_centered = c_clean - center
p_tmp = BatchedPeptide(
c_centered[np.newaxis, ...],
hard_decoy.sequence,
np.array(hard_decoy.atom_names)[mask].tolist(),
np.array(hard_decoy.residue_indices)[mask].tolist()
)
view.addModel(p_tmp.to_pdb(0), 'pdb')
view.setStyle({'model': i}, {'stick': {'radius': 0.15, 'color': colors[i], 'opacity': 0.8}})
view.zoomTo()
view.center()
view.zoom(1.8)
view.show()
🏆 Next Steps¶
- Increase the
driftto 30.0. How does the 3D ensemble change? - Try generating a batch with
full_atom=False. 🚀