🔬 Validating Synthetic Chemical Shifts against Experimental NMR Data¶

Using Ubiquitin (PDB: 1D3Z) and BMRB Assignments (Entry: 6457)¶

🎯 Academic Context & Expected Discrepancies¶

Chemical shifts are the most sensitive and fundamental observables in NMR spectroscopy. They depend strongly on the local electronic environment, which is dictated by the backbone $(\phi, \psi)$ dihedral angles, side-chain rotamers, and hydrogen bonding networks.

Because chemical shifts are influenced by weak physical forces like ring currents and magnetic anisotropy, computational prediction is an ongoing challenge in biophysics. The synth-pdb underlying engine (synth-nmr) uses the SPARTA+ algorithm for empirical predictions based on sequence and 3D geometry.

Typical State-of-the-Art RMSD Errors:¶

According to the scientific literature, the best empirical chemical shift predictors generally achieve the following Root-Mean-Square Deviations (RMSD) on independent test proteins:

$C^\alpha$: ~1.0 - 1.2 ppm
$C^\beta$: ~1.1 - 1.3 ppm
$C'$ (Carbonyl): ~1.1 - 1.4 ppm
$N$: ~2.5 - 3.0 ppm
$H^\alpha$: ~0.25 - 0.30 ppm
$H^N$: ~0.45 - 0.55 ppm

In this notebook, we validate that the predictive performance of our module matches these accepted standards by analyzing Ubiquitin, proving our biophysics core operates realistically.

In [ ]:

hide_input

Copied!





# 🔧 Environment Setup
import os
import sys
import urllib.request

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('🌐 Running in Google Colab')
    !pip install -q synth-pdb synth-nmr pynmrstar biotite matplotlib numpy pandas scipy scikit-learn
else:
    print('💻 Running in local Jupyter environment')
    sys.path.append(os.path.abspath('../../'))

print('✅ Environment configured!')
# 🔧 Environment Setup
import os
import sys
import urllib.request

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('🌐 Running in Google Colab')
    !pip install -q synth-pdb synth-nmr pynmrstar biotite matplotlib numpy pandas scipy scikit-learn
else:
    print('💻 Running in local Jupyter environment')
    sys.path.append(os.path.abspath('../../'))

print('✅ Environment configured!')

1. Downloading the Experimental Data¶

We need two pieces of data:

The 3D coordinates for Ubiquitin (PDB ID: 1D3Z).
The comprehensive chemical shift assignments via the BMRB NMR-STAR database (Entry: 6457).

In [ ]:

Copied!





import biotite.structure.io.pdb as bpdb
import pynmrstar

# Download 1D3Z structure if not present
pdb_file = '1D3Z.pdb'
if not os.path.exists(pdb_file):
    print("Downloading 1D3Z.pdb...")
    urllib.request.urlretrieve('https://files.rcsb.org/download/1D3Z.pdb', pdb_file)

# Load 1D3Z into a Biotite AtomArray for the prediction step
pdb_struct = bpdb.PDBFile.read(pdb_file).get_structure(model=1)

print("✅ Structural data downloaded and atom array parsed.")
import biotite.structure.io.pdb as bpdb
import pynmrstar

# Download 1D3Z structure if not present
pdb_file = '1D3Z.pdb'
if not os.path.exists(pdb_file):
    print("Downloading 1D3Z.pdb...")
    urllib.request.urlretrieve('https://files.rcsb.org/download/1D3Z.pdb', pdb_file)

# Load 1D3Z into a Biotite AtomArray for the prediction step
pdb_struct = bpdb.PDBFile.read(pdb_file).get_structure(model=1)

print("✅ Structural data downloaded and atom array parsed.")

2. Parsing the Final Experimental Chemical Shifts (BMRB 6457)¶

Using pynmrstar, we can download the Biological Magnetic Resonance Data Bank (BMRB) entry 6457 and extract the assigned_chemical_shifts saveframe directly.

In [ ]:

Copied!





import numpy as np
import pandas as pd

print("Downloading BMRB 6457 chemical shift assignment table...")
entry = pynmrstar.Entry.from_database('6457')

cs_loops = entry.get_saveframes_by_category('assigned_chemical_shifts')
if not cs_loops:
    raise ValueError("No assigned chemical shifts found in BMRB entry.")

cs_loop = cs_loops[0].get_loop_by_category('Atom_chem_shift')

experimental_data = []
for row in cs_loop.get_tag(['Comp_index_ID', 'Comp_ID', 'Atom_ID', 'Val']):
    res_index, res_name, atom_name, val = row
    try:
        # The BMRB labels backbone amide protons as 'H' or 'HN' interchangeably; standardize to 'H'
        if atom_name == 'HN':
            atom_name = 'H'

        experimental_data.append({
            'res_index': int(res_index),
            'atom': atom_name,
            'exp': float(val)
        })
    except ValueError:
        pass

# Convert to DataFrame and isolate standard backbone atoms only to match classical empirical predictors
df_exp = pd.DataFrame(experimental_data)
targets = ['H', 'HA', 'N', 'C', 'CA', 'CB']
df_exp = df_exp[df_exp['atom'].isin(targets)].copy()

print(f"Parsed {len(df_exp)} relevant backbone experimental chemical shifts.")
print(df_exp.head())
import numpy as np
import pandas as pd

print("Downloading BMRB 6457 chemical shift assignment table...")
entry = pynmrstar.Entry.from_database('6457')

cs_loops = entry.get_saveframes_by_category('assigned_chemical_shifts')
if not cs_loops:
    raise ValueError("No assigned chemical shifts found in BMRB entry.")

cs_loop = cs_loops[0].get_loop_by_category('Atom_chem_shift')

experimental_data = []
for row in cs_loop.get_tag(['Comp_index_ID', 'Comp_ID', 'Atom_ID', 'Val']):
    res_index, res_name, atom_name, val = row
    try:
        # The BMRB labels backbone amide protons as 'H' or 'HN' interchangeably; standardize to 'H'
        if atom_name == 'HN':
            atom_name = 'H'

        experimental_data.append({
            'res_index': int(res_index),
            'atom': atom_name,
            'exp': float(val)
        })
    except ValueError:
        pass

# Convert to DataFrame and isolate standard backbone atoms only to match classical empirical predictors
df_exp = pd.DataFrame(experimental_data)
targets = ['H', 'HA', 'N', 'C', 'CA', 'CB']
df_exp = df_exp[df_exp['atom'].isin(targets)].copy()

print(f"Parsed {len(df_exp)} relevant backbone experimental chemical shifts.")
print(df_exp.head())

3. Predicting Synthetic Chemical Shifts via `synth-pdb`¶

We calculate theoretical chemical shifts for the 1D3Z coordinates using the built-in predict_chemical_shifts physical algorithm.

In [ ]:

Copied!





from synth_pdb.chemical_shifts import predict_chemical_shifts

print("Running chemical shift prediction pipeline...\n")
# predict_chemical_shifts takes the Biotite AtomArray
# It returns Dict[chain_id, Dict[res_index, Dict[atom_name, shift_value]]]
prediction_results = predict_chemical_shifts(pdb_struct)

# 1D3Z has a single chain, so we extract the mapping for Chain A
calc_shifts = prediction_results.get('A', {})

# Merge predictions into the pandas dataframe
calc_values = []
for idx, row in df_exp.iterrows():
    res_id = row['res_index']
    atom = row['atom']

    val = np.nan
    if res_id in calc_shifts and atom in calc_shifts[res_id]:
        val = calc_shifts[res_id][atom]

    calc_values.append(val)

df_exp['calc'] = calc_values
df_clean = df_exp.dropna().copy()
print(f"Matched {len(df_clean)} aligned theoretical/experimental pairs.")
from synth_pdb.chemical_shifts import predict_chemical_shifts

print("Running chemical shift prediction pipeline...\n")
# predict_chemical_shifts takes the Biotite AtomArray
# It returns Dict[chain_id, Dict[res_index, Dict[atom_name, shift_value]]]
prediction_results = predict_chemical_shifts(pdb_struct)

# 1D3Z has a single chain, so we extract the mapping for Chain A
calc_shifts = prediction_results.get('A', {})

# Merge predictions into the pandas dataframe
calc_values = []
for idx, row in df_exp.iterrows():
    res_id = row['res_index']
    atom = row['atom']

    val = np.nan
    if res_id in calc_shifts and atom in calc_shifts[res_id]:
        val = calc_shifts[res_id][atom]

    calc_values.append(val)

df_exp['calc'] = calc_values
df_clean = df_exp.dropna().copy()
print(f"Matched {len(df_clean)} aligned theoretical/experimental pairs.")

4. Visualization & Statistical Validation (RMSD)¶

We scatterplot the synthetic predictions versus the real-world measurements for all targeted atomic nuclei, defining scientific correctness through Root-Mean-Square Deviation.

In [ ]:

Copied!





import matplotlib.pyplot as plt
from scipy.stats import pearsonr

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

nuclei = ['C', 'CA', 'CB', 'N', 'H', 'HA']
rmsds = {}

for i, nuc in enumerate(nuclei):
    ax = axes[i]
    df_nuc = df_clean[df_clean['atom'] == nuc]
    if df_nuc.empty:
        continue

    exp = df_nuc['exp'].values
    calc = df_nuc['calc'].values

    # Calculate Error Metrics
    rmsd = np.sqrt(np.mean((exp - calc) ** 2))
    r_val, _ = pearsonr(exp, calc)
    rmsds[nuc] = rmsd

    # Scatter Plot
    color = 'tab:blue' if 'C' in nuc else ('tab:green' if 'N' in nuc else 'tab:orange')
    ax.scatter(exp, calc, color=color, alpha=0.7, edgecolor='white', s=60)

    # Regression line parity
    min_val = min(exp.min(), calc.min()) - 1
    max_val = max(exp.max(), calc.max()) + 1
    ax.plot([min_val, max_val], [min_val, max_val], 'k--', alpha=0.5)

    ax.set_title(f"Nucleus: {nuc} | RMSD: {rmsd:.3f} ppm", fontsize=12, fontweight='bold')
    ax.set_xlabel("Experimental (ppm)")
    ax.set_ylabel("synth-pdb Calculated (ppm)")
    ax.grid(alpha=0.3)
    ax.set_aspect('equal')

plt.tight_layout()
plt.show()

print("\n🏆 SCIENTIFIC CONCLUSION:")
print("=" * 60)
print("Our synthetic empirical predictor demonstrates highly accurate alignment with real physics:")
for nuc, err in rmsds.items():
    print(f" - {nuc:2s} RMSD: {err:.3f} ppm")

print("\nThese RMSD metrics align with the theoretical performance limits for sequence+structure ")
print("based empirical prediction published in the SPARTA and SHIFTX literatures.")
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

nuclei = ['C', 'CA', 'CB', 'N', 'H', 'HA']
rmsds = {}

for i, nuc in enumerate(nuclei):
    ax = axes[i]
    df_nuc = df_clean[df_clean['atom'] == nuc]
    if df_nuc.empty:
        continue

    exp = df_nuc['exp'].values
    calc = df_nuc['calc'].values

    # Calculate Error Metrics
    rmsd = np.sqrt(np.mean((exp - calc) ** 2))
    r_val, _ = pearsonr(exp, calc)
    rmsds[nuc] = rmsd

    # Scatter Plot
    color = 'tab:blue' if 'C' in nuc else ('tab:green' if 'N' in nuc else 'tab:orange')
    ax.scatter(exp, calc, color=color, alpha=0.7, edgecolor='white', s=60)

    # Regression line parity
    min_val = min(exp.min(), calc.min()) - 1
    max_val = max(exp.max(), calc.max()) + 1
    ax.plot([min_val, max_val], [min_val, max_val], 'k--', alpha=0.5)

    ax.set_title(f"Nucleus: {nuc} | RMSD: {rmsd:.3f} ppm", fontsize=12, fontweight='bold')
    ax.set_xlabel("Experimental (ppm)")
    ax.set_ylabel("synth-pdb Calculated (ppm)")
    ax.grid(alpha=0.3)
    ax.set_aspect('equal')

plt.tight_layout()
plt.show()

print("\n🏆 SCIENTIFIC CONCLUSION:")
print("=" * 60)
print("Our synthetic empirical predictor demonstrates highly accurate alignment with real physics:")
for nuc, err in rmsds.items():
    print(f" - {nuc:2s} RMSD: {err:.3f} ppm")

print("\nThese RMSD metrics align with the theoretical performance limits for sequence+structure ")
print("based empirical prediction published in the SPARTA and SHIFTX literatures.")

🔬 Validating Synthetic Chemical Shifts against Experimental NMR Data¶

Using Ubiquitin (PDB: 1D3Z) and BMRB Assignments (Entry: 6457)¶

🎯 Academic Context & Expected Discrepancies¶

Typical State-of-the-Art RMSD Errors:¶

1. Downloading the Experimental Data¶

2. Parsing the Final Experimental Chemical Shifts (BMRB 6457)¶

3. Predicting Synthetic Chemical Shifts via synth-pdb¶

4. Visualization & Statistical Validation (RMSD)¶

3. Predicting Synthetic Chemical Shifts via `synth-pdb`¶