Skip to content

ensemble Module

The synth_pdb.ensemble subpackage provides tools for quantitative analysis of NMR structure bundles — the multi-model ensembles produced by NMR structure calculation software (CYANA, XPLOR-NIH, CNS) that represent the conformational space consistent with experimental restraints.

from synth_pdb.ensemble import DAOPCalculator, EnsembleStatistics

[!NOTE] These tools mirror the analysis performed by PDBStat (Tejero et al. 2013) and are validated against the published NMR benchmarks in the tests/test_scientific_validation.py suite.


Background

An NMR structure ensemble is not a single model but a bundle of conformers — typically 10–40 structures — each consistent with the experimental data (NOE distances, dihedral restraints, RDCs). The spread of the bundle is the key indicator of precision:

  • Tight bundle → well-determined, rigid region
  • Loose bundle → flexible, disordered, or poorly-restrained region

Two complementary metrics characterise ensemble quality:

Metric What it measures
RMSD Pairwise coordinate deviation (Å) — precision of atomic positions
DAOP Dihedral Angle Order Parameter — backbone angular consistency (0–1)

DAOPCalculator

DAOPCalculator

Calculator for Dihedral Angle Order Parameters (DAOP).

The DAOP quantifies the consistency of a dihedral angle across an NMR ensemble. Values range from 0 (completely disordered) to 1 (perfectly ordered). Typical NMR convention: S >= 0.9 means the residue is well-defined (corresponds to <~ +/-24deg circular standard deviation).

All methods are static - no instantiation required.

Examples:

>>> import numpy as np
>>> from synth_pdb.ensemble.daop import DAOPCalculator
>>> # Perfectly ordered angles (all -60deg)
>>> angles = np.full(20, -np.pi / 3)
>>> DAOPCalculator.calculate_order_parameter(angles)
1.0
>>> # Find which residues are well-defined across an ensemble
>>> phi = np.random.normal(-np.pi / 3, np.pi / 36, (50, 20))
>>> psi = np.random.normal( np.pi / 3, np.pi / 36, (50, 20))
>>> mask = DAOPCalculator.find_well_defined_residues(phi, psi)
>>> mask.all()
True

Functions

find_well_defined_residues(phi_angles, psi_angles, threshold=1.8) staticmethod

Identify well-defined residues using the PDBStat S(phi) + S(psi) criterion.

A residue is considered well-defined when:

S(phi_i) + S(psi_i) >= threshold

The default threshold of 1.8 follows the PDBStat convention and corresponds to both angles having S >= 0.9, i.e. a circular standard deviation of <~ +/-24deg for each.

Parameters:

Name Type Description Default
phi_angles NDArray[float64]

Array of shape (n_residues, n_models) with phi angles in radians.

required
psi_angles NDArray[float64]

Array of shape (n_residues, n_models) with psi angles in radians.

required
threshold float

Sum-of-order-parameters cutoff. Default 1.8 (PDBStat convention).

1.8

Returns:

Type Description
NDArray[bool_]

Boolean array of shape (n_residues,). True entries are

NDArray[bool_]

well-defined residues.

Examples:

>>> import numpy as np
>>> from synth_pdb.ensemble.daop import DAOPCalculator
>>> # Well-ordered ensemble
>>> rng = np.random.default_rng(42)
>>> phi = rng.normal(-np.pi / 3, np.pi / 36, (10, 20))
>>> psi = rng.normal( np.pi / 3, np.pi / 36, (10, 20))
>>> mask = DAOPCalculator.find_well_defined_residues(phi, psi)
>>> mask.all()
True

What is DAOP?

The Dihedral Angle Order Parameter (Hyberts et al. 1992) measures the circular variance of backbone dihedral angles (φ, ψ) across all models in an ensemble:

\[S(\alpha) = \left| \frac{1}{N} \sum_{k=1}^{N} e^{i\alpha_k} \right|\]
  • S = 1.0 → all models have identical dihedral angles (perfectly rigid)
  • S = 0.0 → angles are uniformly distributed (completely disordered)

Well-Defined Residues (PDBStat Convention)

A residue is considered well-defined if it satisfies:

\[S(\phi) + S(\psi) \geq 1.8\]

This is the threshold used by PDBStat and the BMRB deposition system. Well-defined residues form the core of the folded structure and are used for global RMSD calculations.

Usage

from synth_pdb.ensemble import DAOPCalculator

# coords_list: List[np.ndarray] — one (N_atoms, 3) array per model
# residue_ids: List[int] — residue numbers matching the atom arrays
calc = DAOPCalculator(coords_list, residue_ids)

daop_per_residue = calc.calculate()
# Returns: Dict[int, Dict[str, float]]
# {residue_id: {"phi": S_phi, "psi": S_psi, "combined": S_phi + S_psi}}

well_defined = calc.find_well_defined_residues(threshold=1.8)
# Returns: List[int] — residue IDs where S(φ)+S(ψ) ≥ threshold
print(f"Well-defined residues: {well_defined}")

EnsembleStatistics

EnsembleStatistics dataclass

Comprehensive, typed container for NMR ensemble quality statistics.

Replaces the plain Dict[str, float] pattern with a type-safe dataclass that adds computed properties, to_dict/from_dict round-trips, and a human-readable __str__.

Attributes:

Name Type Description
n_models int

Number of models in the ensemble.

n_residues int

Number of residues per model.

mean_pairwise_rmsd float

Mean of all pairwise RMSDs (A).

median_pairwise_rmsd float

Median of all pairwise RMSDs (A).

std_pairwise_rmsd float

Standard deviation of pairwise RMSDs (A).

min_pairwise_rmsd float

Minimum pairwise RMSD (A).

max_pairwise_rmsd float

Maximum pairwise RMSD (A).

medoid_index int

0-based index of the medoid model.

medoid_mean_rmsd float

Mean RMSD of the medoid to all other models (A).

rmsd_to_mean float

Mean RMSD-to-ensemble-mean (A). Primary precision indicator; thresholds from Tejero et al. 2013.

mean_rmsf float

Mean per-residue RMSF (A).

max_rmsf float

Maximum per-residue RMSF (A).

well_defined_residues int

Count of residues with RMSF < 1.0 A.

pct_well_defined float

Percentage of well-defined residues (0-100).

Examples:

>>> from synth_pdb.ensemble.statistics import EnsembleStatistics
>>> stats = EnsembleStatistics(
...     n_models=20, n_residues=91,
...     mean_pairwise_rmsd=0.85, rmsd_to_mean=0.72,
...     pct_well_defined=88.0,
... )
>>> stats.precision
'HIGH'
>>> stats.overall_quality
'Excellent quality NMR ensemble'
>>> stats.is_ensemble
True

Attributes

is_single_model property

Return True if the ensemble contains exactly one model.

is_ensemble property

Return True if the ensemble contains more than one model.

precision property

Precision tier based on RMSD-to-mean (Tejero et al. 2013).

Returns:

Type Description
str

"HIGH" if rmsd_to_mean < 1.0 A,

str

"GOOD" if rmsd_to_mean < 2.0 A,

str

"MODERATE" otherwise.

overall_quality property

Human-readable overall quality assessment.

Combines :attr:precision with the percentage of well-defined residues to produce a single summary string.

Returns:

Type Description
str

One of four quality strings ranging from

str

"Excellent quality NMR ensemble" to

str

"Ensemble may need refinement".

Functions

to_dict()

Serialise all fields to a plain dictionary.

Useful for JSON export, backward compatibility with code that previously consumed Dict[str, float] from ensemble analysers, and for persistence.

Returns:

Type Description
dict[str, Any]

Dictionary containing every field of this dataclass.

Examples:

>>> stats = EnsembleStatistics(n_models=5, n_residues=30)
>>> d = stats.to_dict()
>>> d["n_models"]
5
>>> EnsembleStatistics.from_dict(d) == stats
True

from_dict(data) classmethod

Construct an instance from a plain dictionary.

Extra keys in data are silently ignored, making this robust to version skew when deserialising persisted results.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary produced by :meth:to_dict (or any compatible mapping). Must contain at least n_models and n_residues.

required

Returns:

Name Type Description
New EnsembleStatistics

class:EnsembleStatistics instance.

Raises:

Type Description
KeyError

If n_models or n_residues are absent.

Examples:

>>> d = {"n_models": 20, "n_residues": 91, "rmsd_to_mean": 0.68}
>>> stats = EnsembleStatistics.from_dict(d)
>>> stats.precision
'HIGH'

Usage

from synth_pdb.ensemble import EnsembleStatistics

stats = EnsembleStatistics(coords_list, residue_ids=residue_ids)

# Pairwise RMSD matrix (Å)
rmsd_matrix = stats.pairwise_rmsd          # np.ndarray (N_models × N_models)
print(f"Mean pairwise RMSD: {rmsd_matrix.mean():.2f} Å")

# RMSF per residue (Å) — after Kabsch alignment
rmsf = stats.rmsf                          # np.ndarray (N_residues,)

# Representative structure (medoid = lowest mean RMSD to all others)
medoid_idx = stats.medoid_index
print(f"Representative model: #{medoid_idx + 1}")

# Well-defined residues
print(f"Well-defined: {stats.well_defined_residues}")

# Overall quality assessment
qa = stats.quality_assessment              # QualityAssessment dataclass
print(f"Quality: {qa.overall_quality}")

QualityAssessment

QualityAssessment dataclass

Interpretive quality assessment for an NMR ensemble.

A lightweight companion to :class:EnsembleStatistics that carries the human-readable precision tier and overall quality string.

Attributes:

Name Type Description
precision str

Precision tier - "HIGH", "GOOD", "MODERATE", or "UNKNOWN".

overall str

Free-text overall quality description.

Examples:

>>> from synth_pdb.ensemble.statistics import QualityAssessment
>>> qa = QualityAssessment(precision="HIGH",
...                        overall="Excellent quality NMR ensemble")
>>> qa.to_dict()
{'precision': 'HIGH', 'overall': 'Excellent quality NMR ensemble'}

Functions

to_dict()

Serialise to a plain dictionary.

Returns:

Type Description
dict[str, str]

{"precision": ..., "overall": ...}

from_dict(data) classmethod

Construct from a plain dictionary.

Parameters:

Name Type Description Default
data dict[str, str]

Must contain "precision" and "overall" keys.

required

Returns:

Name Type Description
New QualityAssessment

class:QualityAssessment instance.

Raises:

Type Description
KeyError

If required keys are absent.

Quality Thresholds (Tejero et al. 2013)

Field Excellent Acceptable Poor
mean_pairwise_rmsd_backbone < 0.5 Å 0.5–1.5 Å > 1.5 Å
mean_pairwise_rmsd_heavy < 1.0 Å 1.0–2.5 Å > 2.5 Å
fraction_well_defined > 0.85 0.60–0.85 < 0.60
overall_quality "excellent" "acceptable" "poor"

Complete Example: Evaluating a Synthetic NMR Ensemble

import numpy as np
from synth_pdb import generate_structure
from synth_pdb.ensemble import DAOPCalculator, EnsembleStatistics
from synth_pdb.geometry.superposition import find_medoid

# --- 1. Generate a 20-model synthetic ensemble ---
sequence = "LKELEKELEKELEKELEKELEKEL"
models = []
for _ in range(20):
    pdb_text = generate_structure(sequence=sequence, conformation="alpha")
    # Parse backbone Cα coordinates (simplified — use biotite in practice)
    # coords shape: (N_residues, 3)
    coords = parse_ca_coords(pdb_text)   # your parser here
    models.append(coords)

residue_ids = list(range(1, len(models[0]) + 1))

# --- 2. Dihedral Angle Order Parameters ---
daop = DAOPCalculator(models, residue_ids)
daop_scores = daop.calculate()
well_def = daop.find_well_defined_residues(threshold=1.8)
print(f"Well-defined residues: {well_def}")

# --- 3. Full ensemble statistics ---
stats = EnsembleStatistics(models, residue_ids=residue_ids)
print(f"Backbone RMSD:  {stats.quality_assessment.mean_pairwise_rmsd_backbone:.2f} Å")
print(f"Heavy-atom RMSD: {stats.quality_assessment.mean_pairwise_rmsd_heavy:.2f} Å")
print(f"Overall quality: {stats.quality_assessment.overall_quality}")
print(f"Representative model: #{stats.medoid_index + 1}")

References

  1. Hyberts, S. G. et al. (1992). The solution structure of eglin c based on measurements of many NOEs and coupling constants. Protein Science, 1(6), 736–751. (Introduces DAOP)
  2. Tejero, R. et al. (2013). PDBStat: a universal restraint converter and restraint quality analyzer for protein NMR structures. J. Biomol. NMR, 56(4), 337–351. (Defines quality thresholds)

See Also