ensemble Module

The synth_pdb.ensemble subpackage provides tools for quantitative analysis of NMR structure bundles — the multi-model ensembles produced by NMR structure calculation software (CYANA, XPLOR-NIH, CNS) that represent the conformational space consistent with experimental restraints.

from synth_pdb.ensemble import DAOPCalculator, EnsembleStatistics

[!NOTE] These tools mirror the analysis performed by PDBStat (Tejero et al. 2013) and are validated against the published NMR benchmarks in the tests/test_scientific_validation.py suite.

Background

An NMR structure ensemble is not a single model but a bundle of conformers — typically 10–40 structures — each consistent with the experimental data (NOE distances, dihedral restraints, RDCs). The spread of the bundle is the key indicator of precision:

Tight bundle → well-determined, rigid region
Loose bundle → flexible, disordered, or poorly-restrained region

Two complementary metrics characterise ensemble quality:

Metric	What it measures
RMSD	Pairwise coordinate deviation (Å) — precision of atomic positions
DAOP	Dihedral Angle Order Parameter — backbone angular consistency (0–1)

`DAOPCalculator`

Calculator for Dihedral Angle Order Parameters (DAOP).

The DAOP quantifies the consistency of a dihedral angle across an NMR ensemble. Values range from 0 (completely disordered) to 1 (perfectly ordered). Typical NMR convention: S >= 0.9 means the residue is well-defined (corresponds to <~ +/-24deg circular standard deviation).

All methods are static - no instantiation required.

Examples:

>>> import numpy as np
>>> from synth_pdb.ensemble.daop import DAOPCalculator

>>> # Perfectly ordered angles (all -60deg)
>>> angles = np.full(20, -np.pi / 3)
>>> DAOPCalculator.calculate_order_parameter(angles)
1.0

>>> # Find which residues are well-defined across an ensemble
>>> phi = np.random.normal(-np.pi / 3, np.pi / 36, (50, 20))
>>> psi = np.random.normal( np.pi / 3, np.pi / 36, (50, 20))
>>> mask = DAOPCalculator.find_well_defined_residues(phi, psi)
>>> mask.all()
True

Functions

`find_well_defined_residues(phi_angles, psi_angles, threshold=1.8)` `staticmethod`

Identify well-defined residues using the PDBStat S(phi) + S(psi) criterion.

A residue is considered well-defined when:

S(phi_i) + S(psi_i) >= threshold

The default threshold of 1.8 follows the PDBStat convention and corresponds to both angles having S >= 0.9, i.e. a circular standard deviation of <~ +/-24deg for each.

Parameters:

Name	Type	Description	Default
`phi_angles`	`NDArray[float64]`	Array of shape (n_residues, n_models) with phi angles in radians.	required
`psi_angles`	`NDArray[float64]`	Array of shape (n_residues, n_models) with psi angles in radians.	required
`threshold`	`float`	Sum-of-order-parameters cutoff. Default 1.8 (PDBStat convention).	`1.8`

Returns:

Type	Description
`NDArray[bool_]`	Boolean array of shape (n_residues,). `True` entries are
`NDArray[bool_]`	well-defined residues.

Examples:

>>> import numpy as np
>>> from synth_pdb.ensemble.daop import DAOPCalculator

>>> # Well-ordered ensemble
>>> rng = np.random.default_rng(42)
>>> phi = rng.normal(-np.pi / 3, np.pi / 36, (10, 20))
>>> psi = rng.normal( np.pi / 3, np.pi / 36, (10, 20))
>>> mask = DAOPCalculator.find_well_defined_residues(phi, psi)
>>> mask.all()
True

What is DAOP?

The Dihedral Angle Order Parameter (Hyberts et al. 1992) measures the circular variance of backbone dihedral angles (φ, ψ) across all models in an ensemble:

\[S(\alpha) = \left| \frac{1}{N} \sum_{k=1}^{N} e^{i\alpha_k} \right|\]

S = 1.0 → all models have identical dihedral angles (perfectly rigid)
S = 0.0 → angles are uniformly distributed (completely disordered)

Well-Defined Residues (PDBStat Convention)

A residue is considered well-defined if it satisfies:

\[S(\phi) + S(\psi) \geq 1.8\]

This is the threshold used by PDBStat and the BMRB deposition system. Well-defined residues form the core of the folded structure and are used for global RMSD calculations.

Usage

from synth_pdb.ensemble import DAOPCalculator

# coords_list: List[np.ndarray] — one (N_atoms, 3) array per model
# residue_ids: List[int] — residue numbers matching the atom arrays
calc = DAOPCalculator(coords_list, residue_ids)

daop_per_residue = calc.calculate()
# Returns: Dict[int, Dict[str, float]]
# {residue_id: {"phi": S_phi, "psi": S_psi, "combined": S_phi + S_psi}}

well_defined = calc.find_well_defined_residues(threshold=1.8)
# Returns: List[int] — residue IDs where S(φ)+S(ψ) ≥ threshold
print(f"Well-defined residues: {well_defined}")

`EnsembleStatistics`

`EnsembleStatistics` `dataclass`

Comprehensive, typed container for NMR ensemble quality statistics.

Replaces the plain Dict[str, float] pattern with a type-safe dataclass that adds computed properties, to_dict/from_dict round-trips, and a human-readable __str__.

Attributes:

Name	Type	Description
`n_models`	`int`	Number of models in the ensemble.
`n_residues`	`int`	Number of residues per model.
`mean_pairwise_rmsd`	`float`	Mean of all pairwise RMSDs (A).
`median_pairwise_rmsd`	`float`	Median of all pairwise RMSDs (A).
`std_pairwise_rmsd`	`float`	Standard deviation of pairwise RMSDs (A).
`min_pairwise_rmsd`	`float`	Minimum pairwise RMSD (A).
`max_pairwise_rmsd`	`float`	Maximum pairwise RMSD (A).
`medoid_index`	`int`	0-based index of the medoid model.
`medoid_mean_rmsd`	`float`	Mean RMSD of the medoid to all other models (A).
`rmsd_to_mean`	`float`	Mean RMSD-to-ensemble-mean (A). Primary precision indicator; thresholds from Tejero et al. 2013.
`mean_rmsf`	`float`	Mean per-residue RMSF (A).
`max_rmsf`	`float`	Maximum per-residue RMSF (A).
`well_defined_residues`	`int`	Count of residues with RMSF < 1.0 A.
`pct_well_defined`	`float`	Percentage of well-defined residues (0-100).

Examples:

>>> from synth_pdb.ensemble.statistics import EnsembleStatistics
>>> stats = EnsembleStatistics(
...     n_models=20, n_residues=91,
...     mean_pairwise_rmsd=0.85, rmsd_to_mean=0.72,
...     pct_well_defined=88.0,
... )
>>> stats.precision
'HIGH'
>>> stats.overall_quality
'Excellent quality NMR ensemble'
>>> stats.is_ensemble
True

Attributes

`is_single_model` `property`

Return True if the ensemble contains exactly one model.

`is_ensemble` `property`

Return True if the ensemble contains more than one model.

`precision` `property`

Precision tier based on RMSD-to-mean (Tejero et al. 2013).

Returns:

Type	Description
`str`	`"HIGH"` if `rmsd_to_mean < 1.0` A,
`str`	`"GOOD"` if `rmsd_to_mean < 2.0` A,
`str`	`"MODERATE"` otherwise.

`overall_quality` `property`

Human-readable overall quality assessment.

Combines :attr:precision with the percentage of well-defined residues to produce a single summary string.

Returns:

Type	Description
`str`	One of four quality strings ranging from
`str`	`"Excellent quality NMR ensemble"` to
`str`	`"Ensemble may need refinement"`.

Functions

`to_dict()`

Serialise all fields to a plain dictionary.

Useful for JSON export, backward compatibility with code that previously consumed Dict[str, float] from ensemble analysers, and for persistence.

Returns:

Type	Description
`dict[str, Any]`	Dictionary containing every field of this dataclass.

Examples:

>>> stats = EnsembleStatistics(n_models=5, n_residues=30)
>>> d = stats.to_dict()
>>> d["n_models"]
5
>>> EnsembleStatistics.from_dict(d) == stats
True

`from_dict(data)` `classmethod`

Construct an instance from a plain dictionary.

Extra keys in data are silently ignored, making this robust to version skew when deserialising persisted results.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary produced by :meth:`to_dict` (or any compatible mapping). Must contain at least `n_models` and `n_residues`.	required

Returns:

Name	Type	Description
`New`	`EnsembleStatistics`	class:`EnsembleStatistics` instance.

Raises:

Type	Description
`KeyError`	If `n_models` or `n_residues` are absent.

Examples:

>>> d = {"n_models": 20, "n_residues": 91, "rmsd_to_mean": 0.68}
>>> stats = EnsembleStatistics.from_dict(d)
>>> stats.precision
'HIGH'

Usage

from synth_pdb.ensemble import EnsembleStatistics

stats = EnsembleStatistics(coords_list, residue_ids=residue_ids)

# Pairwise RMSD matrix (Å)
rmsd_matrix = stats.pairwise_rmsd          # np.ndarray (N_models × N_models)
print(f"Mean pairwise RMSD: {rmsd_matrix.mean():.2f} Å")

# RMSF per residue (Å) — after Kabsch alignment
rmsf = stats.rmsf                          # np.ndarray (N_residues,)

# Representative structure (medoid = lowest mean RMSD to all others)
medoid_idx = stats.medoid_index
print(f"Representative model: #{medoid_idx + 1}")

# Well-defined residues
print(f"Well-defined: {stats.well_defined_residues}")

# Overall quality assessment
qa = stats.quality_assessment              # QualityAssessment dataclass
print(f"Quality: {qa.overall_quality}")

`QualityAssessment`

`QualityAssessment` `dataclass`

Interpretive quality assessment for an NMR ensemble.

A lightweight companion to :class:EnsembleStatistics that carries the human-readable precision tier and overall quality string.

Attributes:

Name	Type	Description
`precision`	`str`	Precision tier - `"HIGH"`, `"GOOD"`, `"MODERATE"`, or `"UNKNOWN"`.
`overall`	`str`	Free-text overall quality description.

Examples:

>>> from synth_pdb.ensemble.statistics import QualityAssessment
>>> qa = QualityAssessment(precision="HIGH",
...                        overall="Excellent quality NMR ensemble")
>>> qa.to_dict()
{'precision': 'HIGH', 'overall': 'Excellent quality NMR ensemble'}

Functions

`to_dict()`

Serialise to a plain dictionary.

Returns:

Type	Description
`dict[str, str]`	`{"precision": ..., "overall": ...}`

`from_dict(data)` `classmethod`

Construct from a plain dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, str]`	Must contain `"precision"` and `"overall"` keys.	required

Returns:

Name	Type	Description
`New`	`QualityAssessment`	class:`QualityAssessment` instance.

Raises:

Type	Description
`KeyError`	If required keys are absent.

Quality Thresholds (Tejero et al. 2013)

Field	Excellent	Acceptable	Poor
`mean_pairwise_rmsd_backbone`	< 0.5 Å	0.5–1.5 Å	> 1.5 Å
`mean_pairwise_rmsd_heavy`	< 1.0 Å	1.0–2.5 Å	> 2.5 Å
`fraction_well_defined`	> 0.85	0.60–0.85	< 0.60
`overall_quality`	`"excellent"`	`"acceptable"`	`"poor"`

Complete Example: Evaluating a Synthetic NMR Ensemble

import numpy as np
from synth_pdb import generate_structure
from synth_pdb.ensemble import DAOPCalculator, EnsembleStatistics
from synth_pdb.geometry.superposition import find_medoid

# --- 1. Generate a 20-model synthetic ensemble ---
sequence = "LKELEKELEKELEKELEKELEKEL"
models = []
for _ in range(20):
    pdb_text = generate_structure(sequence=sequence, conformation="alpha")
    # Parse backbone Cα coordinates (simplified — use biotite in practice)
    # coords shape: (N_residues, 3)
    coords = parse_ca_coords(pdb_text)   # your parser here
    models.append(coords)

residue_ids = list(range(1, len(models[0]) + 1))

# --- 2. Dihedral Angle Order Parameters ---
daop = DAOPCalculator(models, residue_ids)
daop_scores = daop.calculate()
well_def = daop.find_well_defined_residues(threshold=1.8)
print(f"Well-defined residues: {well_def}")

# --- 3. Full ensemble statistics ---
stats = EnsembleStatistics(models, residue_ids=residue_ids)
print(f"Backbone RMSD:  {stats.quality_assessment.mean_pairwise_rmsd_backbone:.2f} Å")
print(f"Heavy-atom RMSD: {stats.quality_assessment.mean_pairwise_rmsd_heavy:.2f} Å")
print(f"Overall quality: {stats.quality_assessment.overall_quality}")
print(f"Representative model: #{stats.medoid_index + 1}")

References

Hyberts, S. G. et al. (1992). The solution structure of eglin c based on measurements of many NOEs and coupling constants. Protein Science, 1(6), 736–751. (Introduces DAOP)
Tejero, R. et al. (2013). PDBStat: a universal restraint converter and restraint quality analyzer for protein NMR structures. J. Biomol. NMR, 56(4), 337–351. (Defines quality thresholds)

ensemble Module

Background

DAOPCalculator

DAOPCalculator

Functions

find_well_defined_residues(phi_angles, psi_angles, threshold=1.8) staticmethod

What is DAOP?

Well-Defined Residues (PDBStat Convention)

Usage

EnsembleStatistics

EnsembleStatistics dataclass

Attributes

is_single_model property

is_ensemble property

precision property

overall_quality property

Functions

to_dict()

from_dict(data) classmethod

Usage

QualityAssessment

QualityAssessment dataclass

Functions

to_dict()

from_dict(data) classmethod

Quality Thresholds (Tejero et al. 2013)

Complete Example: Evaluating a Synthetic NMR Ensemble

References

See Also

`DAOPCalculator`

`DAOPCalculator`

`find_well_defined_residues(phi_angles, psi_angles, threshold=1.8)` `staticmethod`

`EnsembleStatistics`

`EnsembleStatistics` `dataclass`

`is_single_model` `property`

`is_ensemble` `property`

`precision` `property`

`overall_quality` `property`

`to_dict()`

`from_dict(data)` `classmethod`

`QualityAssessment`

`QualityAssessment` `dataclass`

`to_dict()`

`from_dict(data)` `classmethod`