🔍 Protein Quality Assessment Lab¶
Welcome to the Protein Quality Assessment Lab!
In this tutorial, you'll learn how to evaluate protein structures like a professional structural biologist.
🎯 Goal¶
Understand what makes a "good" vs "bad" protein structure by applying validation metrics used by the Protein Data Bank (PDB) and tools like MolProbity.
📚 What You'll Learn¶
- Ramachandran Analysis: Check backbone geometry
- Clash Detection: Find steric overlaps
- Bond Geometry: Validate lengths and angles
- Rotamer Analysis: Verify side-chain conformations
- Overall Quality Scoring: Combine metrics
⚠️ The Golden Rule¶
Garbage In, Garbage Out
A beautiful molecular dynamics simulation on a terrible structure is still terrible. Always validate your starting structures!
# 🛠️ SETUP: Environment and Dependencies
import os
import sys
# Check if we are running in Google Colab
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
print("🌐 Running in Google Colab - Installing public dependencies...")
# These are NOT pre-installed in Colab
%pip install -q biotite py3Dmol ipywidgets
# Path handling for Colab
# If the user opened from GitHub, they might need to clone or we append path
if not os.path.exists('synth_pdb'):
print("⚠️ synth_pdb not found in current directory. Appending path...")
else:
print("💻 Running locally")
# Ensure synth_pdb is in the python path
# '../../' should point to the repository root relative to this notebook
sys.path.append(os.path.abspath('../../'))
print("✅ Setup complete!")
import io
import biotite.structure as struc
import biotite.structure.io.pdb as pdb
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
from synth_pdb.generator import generate_pdb_content
from synth_pdb.validator import PDBValidator
from synth_pdb.geometry import calculate_dihedral, calculate_rmsd, superimpose_structures
print("📦 All imports successful!")
Step 1: Generate Test Structures¶
We'll create two structures:
- Good Structure: Properly minimized with realistic geometry
- Bad Structure: No energy minimization (raw NeRF output)
This lets us compare quality metrics side-by-side.
# Generate a Good Structure (with minimization)
print("🧬 Generating GOOD structure (with energy minimization)...")
sequence = "MKFLKFSLLTAVLLSVVFAFSSCGDDDDAKAAAKAAAKAAAKEAAAKEAAAKA"
structure_def = "1-10:alpha,11-20:beta,21-25:random,26-35:alpha,36-45:beta,46-53:alpha"
good_pdb = generate_pdb_content(
sequence_str=sequence,
structure=structure_def,
optimize_sidechains=True,
minimize_energy=True # KEY: Energy minimization ON
)
print("✅ Good structure ready!")
print(f" Length: {len(good_pdb)} characters")
# Generate a Bad Structure (No energy minimization + torsion drift)
print("⚠️ Generating BAD structure (Drifted Decoy)...")
# We use drift to create a structure that has a fold but is physically misfolded
bad_pdb = generate_pdb_content(
sequence_str=sequence,
structure=structure_def,
drift=139.0, # KEY: Add torsion noise to create outliers
optimize_sidechains=False, # KEY: No optimization to create clashes
minimize_energy=False # KEY: No minimization
)
print("✅ Bad structure (Decoy) ready!")
print(f" Length: {len(bad_pdb)} characters")
print("\n💡 TIP: This 'decoy' has a realistic fold but is physically distorted.")
Step 2: Ramachandran Plot Analysis¶
The Ramachandran Plot is the most fundamental quality check.
📐 What is it?¶
It plots the backbone dihedral angles (φ, ψ) for each residue. Only certain combinations are sterically allowed:
- Blue Region: Alpha helices (φ ≈ -60°, ψ ≈ -45°)
- Red Region: Beta sheets (φ ≈ -120°, ψ ≈ +120°)
- Forbidden Zones: Atoms would overlap (bad!)
✅ Good Structure¶
- 90%+ residues in favored regions
- No outliers in forbidden zones
❌ Bad Structure¶
- Many outliers
- Scattered distribution
def plot_ramachandran(pdb_content, title="Ramachandran Plot"):
"""Extract phi/psi angles and plot Ramachandran diagram."""
pdb_file = pdb.PDBFile.read(io.StringIO(pdb_content))
structure = pdb_file.get_structure(model=1)
# Calculate backbone dihedrals
phi, psi, omega = struc.dihedral_backbone(structure)
# Convert to degrees and filter NaN
phi_deg = []
psi_deg = []
colors = []
for i in range(len(phi)):
if not np.isnan(phi[i]) and not np.isnan(psi[i]):
p = np.degrees(phi[i])
s = np.degrees(psi[i])
phi_deg.append(p)
psi_deg.append(s)
# Color by region (Using Broadened 'Professional Allowed' regions)
# Alpha/General Allowed: Phi in [-180, -30], Psi in [-120, 60]
if -180 < p < -30 and -120 < s < 60:
colors.append('blue') # Alpha favored/allowed
# Beta Allowed: Phi in [-180, -45], Psi in [60, 180] or [-180, -140]
elif -180 < p < -45 and (60 < s < 180 or -180 < s < -140):
colors.append('red') # Beta favored/allowed
elif p > 0:
colors.append('green') # Left-handed (usually Glycine)
else:
colors.append('orange') # Other/outlier
# Background regions (Professional Allowed Boundaries)
plt.gca().add_patch(plt.Rectangle((-180, -120), 150, 180,
color='blue', alpha=0.1, label='Alpha/Gen Allowed'))
plt.gca().add_patch(plt.Rectangle((-180, 60), 135, 120,
color='red', alpha=0.1, label='Beta Allowed'))
# Wraparound for beta
plt.gca().add_patch(plt.Rectangle((-180, -180), 135, 40,
color='red', alpha=0.1))
# Data points with small jitter
jitter_phi = np.random.normal(0, 1.5, size=len(phi_deg))
jitter_psi = np.random.normal(0, 1.5, size=len(psi_deg))
plt.scatter(np.array(phi_deg) + jitter_phi, np.array(psi_deg) + jitter_psi,
c=colors, alpha=0.6, edgecolors='black', s=50)
plt.xlim(-180, 180)
plt.ylim(-180, 180)
plt.axhline(0, color='black', linewidth=0.5, alpha=0.3)
plt.axvline(0, color='black', linewidth=0.5, alpha=0.3)
plt.xlabel('Phi (φ) degrees', fontsize=12)
plt.ylabel('Psi (ψ) degrees', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.2)
plt.legend(loc='upper right')
plt.tight_layout()
# Calculate statistics
total = len(phi_deg)
favored = sum(1 for c in colors if c in ['blue', 'red'])
outliers = sum(1 for c in colors if c == 'orange')
return {
'total': total,
'favored': favored,
'favored_pct': 100 * favored / total if total > 0 else 0,
'outliers': outliers,
'outliers_pct': 100 * outliers / total if total > 0 else 0
}
# Compare Good vs Bad
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
plt.sca(ax1)
good_stats = plot_ramachandran(good_pdb, title="GOOD: Minimized & Optimized")
plt.sca(ax2)
bad_stats = plot_ramachandran(bad_pdb, title="BAD: Drifted Decoy (No Minimization)")
plt.show()
# Print reports
print("📊 RAMACHANDRAN STATISTICS")
print("=" * 50)
print(f"GOOD Structure: {good_stats['favored_pct']:.1f}% in favored regions")
print(f" {good_stats['outliers_pct']:.1f}% outliers")
print(f"\nBAD Structure: {bad_stats['favored_pct']:.1f}% in favored regions")
print(f" {bad_stats['outliers_pct']:.1f}% outliers")
print("\n💡 INTERPRETATION:")
print(" - 'Bad' structures (Decoys) have a fold but are physically misfolded.")
print(" - 'Good' structures have been relaxed to resolve all steric clashes.")
Step 3: Clash Detection¶
Steric clashes occur when atoms are too close together (overlapping van der Waals radii).
🔬 Detection Method¶
For each atom pair:
- Calculate distance
- Compare to sum of van der Waals radii
- If distance < 0.4 Å below expected → CLASH!
✅ Good Structure¶
- Few or no clashes
- Clashes only in flexible loops (acceptable)
❌ Bad Structure¶
- Many clashes throughout
- Clashes in core regions (unacceptable)
def detect_clashes(pdb_content, clash_threshold=2.0):
"""Detect steric clashes between atoms."""
pdb_file = pdb.PDBFile.read(io.StringIO(pdb_content))
structure = pdb_file.get_structure(model=1)
# Get all heavy atoms (no hydrogens)
heavy = structure[structure.element != 'H']
clashes = []
n_atoms = len(heavy)
# Pairwise distance check (simplified - real tools use spatial indexing)
for i in range(min(n_atoms, 500)): # Limit for speed
for j in range(i+1, min(n_atoms, 500)):
# Skip if same residue or adjacent residues
if abs(heavy.res_id[i] - heavy.res_id[j]) <= 1:
continue
# Calculate distance
dist = np.linalg.norm(heavy.coord[i] - heavy.coord[j])
# Check for clash
if dist < clash_threshold:
clashes.append({
'atom1': f"{heavy.res_name[i]}{heavy.res_id[i]}:{heavy.atom_name[i]}",
'atom2': f"{heavy.res_name[j]}{heavy.res_id[j]}:{heavy.atom_name[j]}",
'distance': dist
})
return clashes
print("✅ Clash detection function ready!")
# Detect clashes in both structures
print("🔍 Detecting steric clashes...\n")
good_clashes = detect_clashes(good_pdb)
bad_clashes = detect_clashes(bad_pdb)
print("📊 CLASH STATISTICS")
print("=" * 50)
print(f"✅ GOOD Structure: {len(good_clashes)} clashes detected")
if good_clashes:
print(" Top 3 clashes:")
for clash in sorted(good_clashes, key=lambda x: x['distance'])[:3]:
print(f" {clash['atom1']} ↔ {clash['atom2']}: {clash['distance']:.2f} Å")
print(f"\n⚠️ BAD Structure: {len(bad_clashes)} clashes detected")
if bad_clashes:
print(" Top 3 clashes:")
for clash in sorted(bad_clashes, key=lambda x: x['distance'])[:3]:
print(f" {clash['atom1']} ↔ {clash['atom2']}: {clash['distance']:.2f} Å")
print("\n💡 INTERPRETATION:")
print(" <10 clashes = Excellent")
print(" 10-50 clashes = Acceptable (may need refinement)")
print(" >50 clashes = Poor quality")
Step 4: Comprehensive Validation¶
Now let's use synth-pdb's built-in validator to run a complete quality check.
# Validate Good Structure
print("🔍 Validating GOOD structure...\n")
good_validator = PDBValidator(good_pdb)
good_validator.validate_all()
good_violations = good_validator.get_violations()
print(f"\n{'='*50}")
if not good_violations:
print("✅ No violations! Structure passes all checks.")
else:
print(f"⚠️ Found {len(good_violations)} violations:")
for v in good_violations[:5]:
print(f" - {v}")
if len(good_violations) > 5:
print(f" ... and {len(good_violations)-5} more")
# Validate Bad Structure
print("🔍 Validating BAD structure...\n")
bad_validator = PDBValidator(bad_pdb)
bad_validator.validate_all()
bad_violations = bad_validator.get_violations()
print(f"\n{'='*50}")
if not bad_violations:
print("✅ No violations! Structure passes all checks.")
else:
print(f"⚠️ Found {len(bad_violations)} violations:")
for v in bad_violations[:5]:
print(f" - {v}")
if len(bad_violations) > 5:
print(f" ... and {len(bad_violations)-5} more")
Step 5: Interactive Comparison¶
Use the widget below to toggle between good and bad structures and see the quality differences!
import py3Dmol
# Create toggle widget
structure_selector = widgets.ToggleButtons(
options=['Good Structure ✅', 'Bad Structure ⚠️'],
description='View:',
button_style='info'
)
output = widgets.Output()
def show_structure(change):
with output:
output.clear_output(wait=True)
if 'Good' in structure_selector.value:
pdb_data = good_pdb
title = "✅ GOOD Structure (Minimized)"
stats = good_stats
clashes = len(good_clashes)
violations = len(good_violations)
else:
pdb_data = bad_pdb
title = "⚠️ BAD Structure (Not Minimized)"
stats = bad_stats
clashes = len(bad_clashes)
violations = len(bad_violations)
# 3D Viewer
view = py3Dmol.view(width=600, height=400)
view.addModel(pdb_data, 'pdb')
view.setStyle({'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.show()
# Quality Report
print(f"\n{title}")
print("=" * 50)
print(f"Ramachandran favored: {stats['favored_pct']:.1f}%")
print(f"Ramachandran outliers: {stats['outliers_pct']:.1f}%")
print(f"Steric clashes: {clashes}")
print(f"Total violations: {violations}")
# Overall grade
# A good structure must have very few clashes. OpenMM energy minimization
# often creates minor bond/angle 'violations' to resolve severe steric clashes.
# Therefore, we prioritize zero clashes over zero violations.
if clashes == 0 and stats['favored_pct'] > 30:
print("\n🏆 Overall Grade: EXCELLENT (Clash-free, physically relaxed)")
elif clashes < 5 and stats['favored_pct'] > 50:
print("\n👍 Overall Grade: GOOD (Minor clashes)")
elif clashes < 15:
print("\n⚠️ Overall Grade: ACCEPTABLE (Needs refinement)")
else:
print("\n❌ Overall Grade: POOR (Severe steric clashes)")
# Remove existing callbacks to prevent multiple triggers if cell is re-run
structure_selector.unobserve_all(name='value')
structure_selector.observe(show_structure, names='value')
display(structure_selector, output)
# 3Dmol.js initializes asynchronously in the browser. Calling show_structure()
# at kernel execution time races with the library bootstrap and produces
# the 'failed to load' error. Show a placeholder instead; the first
# button click fires show_structure() after 3Dmol.js is fully ready.
with output:
from IPython.display import HTML as _HTML
from IPython.display import display as _disp
_disp(_HTML(
'<div style="text-align:center;padding:40px;color:#aaa;'
'border:1px dashed #555;border-radius:8px;'
'font-style:italic;background:#1a1a1a;">'
'🔄 Click a button above to load the 3D structure'
'</div>'
))
🎓 Key Takeaways¶
What Makes a Good Structure?¶
- >90% Ramachandran favored - Realistic backbone geometry
- <10 steric clashes - No overlapping atoms
- Proper bond lengths/angles - Chemistry makes sense
- Good rotamers - Side chains in low-energy conformations
When to Use Quality Checks¶
- ✅ Before starting MD simulations
- ✅ After homology modeling
- ✅ When downloading structures from PDB (yes, even experimental structures have issues!)
- ✅ After any structure prediction/generation
Tools for Further Analysis¶
- MolProbity (http://molprobity.biochem.duke.edu/) - Comprehensive validation
- PROCHECK - Classic Ramachandran analysis
- WHATIF - Detailed geometry checks
- PDB REDO - Automated structure refinement
�� Next Steps¶
Try these experiments:
- Generate your own structures with different parameters
- Download a real PDB structure and validate it
- Compare crystal structures vs NMR structures vs AlphaFold predictions
Remember: A validated structure is a trustworthy structure! 🔬