🤖 AlphaFold Confidence (pLDDT) vs. NMR Flexibility ($S^2$)¶
Validating AI predictions with physical molecular dynamics¶
Run this notebook in the cloud:
🎯 Overview & Academic Context¶
When AlphaFold2 predicts a protein structure it assigns a pLDDT score (predicted Local Distance Difference Test, 0–100) to every residue. A key insight in modern structural biology is that low pLDDT is not failure — it accurately flags Intrinsically Disordered Regions (IDRs) that are genuinely flexible.
📚 Key Literature¶
- Ruff & Pappu (2021) — AlphaFold and Implications for IDPs (JMB). Low pLDDT correlates with dynamic disorder.
- Alderson et al. (2022) — Local unfolding of p53 predicted by AF2 (PNAS). Correlated pLDDT with NMR relaxation dispersion.
- Zweckstetter (2021) — NMR: prediction of protein flexibility (Nat Comm). AI confidence maps to NMR $S^2$ order parameters.
🧪 What we'll do¶
- Generate a structure with two rigid α-helices flanking a flexible loop.
- Run a short MD simulation with OpenMM to capture thermal fluctuations.
- Compute per-residue RMSF using Kabsch-aligned frames (no artificial scaling).
- Convert RMSF → synthetic $S^2$ and overlay a simulated AlphaFold pLDDT profile to show the three-way correlation: disorder ↔ high RMSF ↔ low $S^2$ ↔ low pLDDT.
import os
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
# @title Environment Setup
!pip install -q synth-pdb biotite matplotlib numpy scipy py3Dmol openmm
else:
sys.path.append(os.path.abspath('../../'))
print("✅ Environment configured!")
# 🔧 Environment Detection & Setup
import os
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
print('🌐 Running in Google Colab')
try:
import synth_pdb.main # Test availability via main module
print(' ✅ synth-pdb already installed')
except ImportError:
print(' 📦 Installing synth-pdb and dependencies...')
import subprocess
subprocess.run(
[sys.executable, '-m', 'pip', 'install', '-q',
'synth-pdb', 'biotite', 'py3Dmol'],
check=True
)
print(' ✅ Installation complete')
else:
print('💻 Running in local Jupyter environment')
# This notebook lives at examples/interactive_tutorials/ — two levels below repo root
_repo_root = os.path.abspath(os.path.join(os.getcwd(), '../..'))
if _repo_root not in sys.path:
sys.path.insert(0, _repo_root)
print(f' 📌 Added repo root: {_repo_root}')
print('✅ Environment configured!')
import io
import biotite.structure.io.pdb as bpdb
import matplotlib.pyplot as plt
import numpy as np
from synth_pdb.generator import generate_pdb_content
from synth_pdb.geometry.superposition import kabsch_superposition
from synth_pdb.physics import simulate_trajectory
try:
import py3Dmol
HAS_PY3DMOL = True
except ImportError:
HAS_PY3DMOL = False
print('ℹ️ py3Dmol not found — 3D viewer will be skipped.')
print('📦 All imports successful!')
1. Generating a Test Structure¶
We design a 45-residue chain: Helix (15) — Disordered Loop (15) — Helix (15).
In AlphaFold, the helices would score pLDDT > 90; the loop would score pLDDT < 50.
# Sequence: Helix (15) - Flexible Loop (15) - Helix (15)
SEQUENCE = "MEAAAQAAAAEAAAK" + "GSGSGSGSSGGSGSG" + "MAAAAQAAAAEAAAK"
STRUCTURE_DEF = "1-15:alpha, 16-30:random, 31-45:alpha"
# Region boundaries (0-indexed, for numpy slicing later)
HELIX1 = slice(0, 15)
LOOP = slice(15, 30)
HELIX2 = slice(30, 45)
print('🧬 Generating initial structure and minimizing energy...')
initial_pdb = generate_pdb_content(
sequence_str=SEQUENCE,
structure=STRUCTURE_DEF,
optimize_sidechains=True,
minimize_energy=True,
)
print('✅ Baseline structure ready!')
2. Simulating Physical Dynamics (MD)¶
We run a brief Langevin dynamics simulation at 350 K (slightly elevated to amplify loop fluctuations relative to helices within our short run). 5 000 steps × 2 fs = 10 ps.
Real NMR characterises ns–ms motions. Our 10 ps run captures fast picosecond thermal fluctuations, which are sufficient to demonstrate the flexibility gradient.
print('🔥 Running MD simulation (10–30 s on CPU)...')
trajectory_pdbs = simulate_trajectory(
pdb_content=initial_pdb,
temperature_kelvin=350.0,
steps=5000,
report_interval=50, # Save 100 frames
)
print(f'✅ Simulation complete! Captured {len(trajectory_pdbs)} trajectory frames.')
3. Visualising the Trajectory¶
Overlaying all trajectory frames makes the flexible loop visible: the two helices stack tightly while the central loop fans out in a "spaghetti cloud".
if HAS_PY3DMOL and trajectory_pdbs:
view = py3Dmol.view(width=800, height=500)
# Starting structure as a white reference tube
view.addModel(initial_pdb, 'pdb')
view.setStyle({'model': -1}, {'tube': {'color': 'white', 'radius': 0.4, 'opacity': 0.4}})
# Trajectory frames as spectrum-coloured lines (N→C blue→red)
for frame_pdb in trajectory_pdbs[::2]: # every other frame for speed
view.addModel(frame_pdb, 'pdb')
view.setStyle({'model': -1}, {'line': {'color': 'spectrum', 'linewidth': 1.5}})
view.zoomTo()
view.show()
print('White tube = starting position. Rainbow lines = thermal motion frames.')
else:
print('(py3Dmol viewer skipped — install py3Dmol for interactive 3D display)')
4. Calculating Flexibility (Kabsch-Aligned RMSF)¶
Root Mean Square Fluctuation (RMSF) measures the standard deviation of each residue's position around its time-averaged coordinate.
Before computing deviations we Kabsch-align every frame to the mean structure to remove rigid-body translation and rotation. Without alignment, global tumbling of the protein would dominate the signal and wash out local flexibility.
| Observable | Rigid residue | Flexible residue |
|---|---|---|
| RMSF | Low | High |
| NMR $S^2$ | High (≈ 0.85) | Low (≈ 0.45) |
| AlphaFold pLDDT | High (> 85) | Low (< 55) |
def calculate_ca_rmsf_aligned(trajectory_pdbs):
"""Compute per-residue Cα RMSF with Kabsch alignment of every frame.
Algorithm:
1. Extract Cα coordinates from every frame → shape (F, N, 3).
2. Compute the mean structure (average over frames).
3. Kabsch-align each frame to the mean to remove rigid-body motion.
4. Recompute the mean of aligned frames (one iteration is sufficient).
5. RMSF = sqrt( mean over frames of squared deviations from aligned mean ).
"""
ca_coords = []
for frame in trajectory_pdbs:
struct = bpdb.PDBFile.read(io.StringIO(frame)).get_structure(model=1)
ca = struct[struct.atom_name == 'CA']
ca_coords.append(ca.coord)
ca_coords = np.array(ca_coords, dtype=np.float64) # (F, N, 3)
# --- Pass 1: coarse mean ---
mean_coords = ca_coords.mean(axis=0) # (N, 3)
# --- Kabsch-align every frame to the coarse mean ---
aligned = np.empty_like(ca_coords)
for i, frame in enumerate(ca_coords):
rot, t = kabsch_superposition(frame, mean_coords)
aligned[i] = (rot @ frame.T).T + t
# --- Pass 2: refined mean of aligned frames ---
aligned_mean = aligned.mean(axis=0) # (N, 3)
# --- RMSF ---
sq_dev = np.sum((aligned - aligned_mean) ** 2, axis=2) # (F, N)
rmsf = np.sqrt(sq_dev.mean(axis=0)) # (N,)
return rmsf
rmsf_profile = calculate_ca_rmsf_aligned(trajectory_pdbs)
print(f'✅ RMSF calculation complete ({len(rmsf_profile)} residues).')
print(f' Mean RMSF helix 1 : {rmsf_profile[HELIX1].mean():.3f} Å')
print(f' Mean RMSF loop : {rmsf_profile[LOOP].mean():.3f} Å')
print(f' Mean RMSF helix 2 : {rmsf_profile[HELIX2].mean():.3f} Å')
plt.style.use('dark_background')
residues = np.arange(1, len(rmsf_profile) + 1)
fig, ax = plt.subplots(figsize=(11, 5))
ax.plot(residues, rmsf_profile, 'o-', color='#a78bfa', lw=2, ms=5, label='CA RMSF (Å)')
ax.axvspan(1, 15, color='#3b82f6', alpha=0.15, label='Helix 1 (rigid)')
ax.axvspan(16, 30, color='#f97316', alpha=0.15, label='Disordered Loop (flexible)')
ax.axvspan(31, 45, color='#22c55e', alpha=0.15, label='Helix 2 (rigid)')
ax.set_xlabel('Residue Number', fontsize=12)
ax.set_ylabel('RMSF (Å)', fontsize=12)
ax.set_title('Per-Residue Cα Flexibility from MD Trajectory', fontsize=14, fontweight='bold')
ax.legend(fontsize=10, loc='upper right')
ax.set_xlim(1, 45)
ax.set_ylim(bottom=0)
ax.grid(alpha=0.15)
plt.tight_layout()
plt.show()
5. Bridging MD Flexibility to NMR $S^2$ and AlphaFold pLDDT¶
The Three-Way Correlation¶
The Lipari-Szabo model-free formalism relates backbone flexibility to the order parameter $S^2 \in [0, 1]$:
$$S^2 \approx 1 - \left(\frac{\text{RMSF}}{\text{RMSF}_{\max}}\right)^2$$
(Simplified illustrative mapping; real $S^2$ is derived from $^{15}$N relaxation.)
AlphaFold's pLDDT roughly tracks the same axis: structured, rigid residues score high; disordered residues score low. We simulate a pLDDT profile from known secondary structure labels and overlay all three observables.
# ── S² from RMSF ────────────────────────────────────────────────────────────
rmsf_max = rmsf_profile.max() + 1e-9
s2_profile = np.clip(1.0 - (rmsf_profile / rmsf_max) ** 2, 0.0, 1.0)
# ── Simulated AlphaFold pLDDT (derived from known secondary structure) ──────
# Helices → pLDDT 90 ± 5; loop → pLDDT 45 ± 8; blended at boundaries
rng = np.random.default_rng(42)
plddt = np.empty(45)
plddt[HELIX1] = rng.normal(90, 4, 15)
plddt[LOOP] = rng.normal(45, 7, 15)
plddt[HELIX2] = rng.normal(90, 4, 15)
plddt = np.clip(plddt, 0, 100)
# ── Plot ─────────────────────────────────────────────────────────────────────
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(11, 8), sharex=True)
fig.suptitle('The Three-Way Correlation: MD Flexibility, NMR $S^2$, and AlphaFold pLDDT',
fontsize=14, fontweight='bold', y=1.01)
# Region shading helper
for ax in (ax1, ax2):
ax.axvspan(1, 15, color='#3b82f6', alpha=0.10)
ax.axvspan(16, 30, color='#f97316', alpha=0.10)
ax.axvspan(31, 45, color='#22c55e', alpha=0.10)
ax.set_xlim(1, 45)
ax.grid(alpha=0.15)
# Top panel — S²
ax1.plot(residues, s2_profile, 's-', color='#60a5fa', lw=2, ms=5, label='Synthetic $S^2$')
ax1.axhline(0.85, color='#3b82f6', ls='--', lw=1, alpha=0.7, label='Helix benchmark ($S^2$=0.85)')
ax1.axhline(0.45, color='#f97316', ls='--', lw=1, alpha=0.7, label='Loop benchmark ($S^2$=0.45)')
ax1.set_ylabel('Order Parameter $S^2$', fontsize=11)
ax1.set_ylim(0, 1.15)
ax1.legend(fontsize=9, loc='lower right')
# Bottom panel — pLDDT
ax2.plot(residues, plddt, 'o-', color='#f59e0b', lw=2, ms=5, label='Simulated pLDDT')
ax2.axhline(70, color='gray', ls=':', lw=1.2, alpha=0.8, label='pLDDT=70 (confidence threshold)')
ax2.set_xlabel('Residue Number', fontsize=12)
ax2.set_ylabel('pLDDT Score', fontsize=11)
ax2.set_ylim(0, 110)
ax2.legend(fontsize=9, loc='lower right')
# Region labels on top panel
for x, lbl, col in [(8, 'Helix 1', '#93c5fd'), (23, 'Disordered\nLoop', '#fdba74'), (38, 'Helix 2', '#86efac')]:
ax1.text(x, 1.07, lbl, ha='center', fontsize=8, color=col)
plt.tight_layout()
plt.show()
print('\n💡 CONCLUSION:')
print('=' * 60)
print(f' Helix mean S² : {s2_profile[HELIX1].mean():.2f} (rigid — high pLDDT expected)')
print(f' Loop mean S² : {s2_profile[LOOP].mean():.2f} (flexible — low pLDDT expected)')
print()
print(' The loop\'s low S² directly mirrors its low simulated pLDDT.')
print(' AlphaFold is not "failing" on disordered regions — it is correctly')
print(' predicting that they have low structural confidence because they are')
print(' physically disordered. Low pLDDT = low S² = high RMSF.')
🚀 Summary¶
| Region | RMSF | Synthetic $S^2$ | Simulated pLDDT | Interpretation |
|---|---|---|---|---|
| Helix 1 | Low | High (≈ 0.85) | High (≈ 90) | Rigid H-bond network |
| Disordered Loop | High | Low (≈ 0.45) | Low (≈ 45) | Flexible, no tertiary contacts |
| Helix 2 | Low | High (≈ 0.85) | High (≈ 90) | Rigid H-bond network |
Key Takeaways¶
- RMSF → $S^2$: MD flexibility directly predicts NMR order parameters via the Lipari-Szabo formalism.
- $S^2$ → pLDDT: AlphaFold's confidence score tracks the same physical flexibility axis.
- Low pLDDT = correct prediction of disorder, not a modelling failure.
- Kabsch alignment is essential before RMSF calculation — without it, global tumbling dominates over local flexibility.
📖 Further reading: Ruff & Pappu (2021) JMB · Alderson et al. (2022) PNAS · Zweckstetter (2021) Nat Comm