Skip to content

benchmark — Structure Prediction Benchmarking

The synth_pdb.benchmark and synth_pdb.benchmark_metrics modules provide a complete suite for evaluating AI structure prediction models (AlphaFold, ESMFold, RoseTTAFold, etc.) against ground-truth synthetic structures.

The key insight

Because synth-pdb controls the ground truth, the benchmark is perfectly objective: there is no ambiguity about which experimental structure is "correct" or whether the reference contains modelling errors. This makes it ideal for blind comparison of structure prediction models.


Quick Start

from synth_pdb.benchmark import run_benchmark

# Score 20 structures predicted by ESMFold
results = run_benchmark(n_structures=20, predictor="esmfold")

# Print formatted summary
print(results.summary())

# Export to CSV for further analysis
results.to_csv("benchmark_results.csv")

Or from the command line:

# Install dependencies
pip install synth-pdb[gnn] transformers accelerate

# Run benchmark (downloads ESMFold ~700 MB on first use)
python scripts/run_benchmark.py --n-structures 20 --output results.csv

run_benchmark()

def run_benchmark(
    n_structures: int = 20,
    lengths: list[int] | None = None,
    conformations: list[str] | None = None,
    predictor: str | Callable[[str], str] = "esmfold",
    *,
    compute_shifts: bool = True,
    compute_gnn: bool = True,
    random_state: int = 42,
) -> BenchmarkResults

Generate synthetic ground-truth structures, fold them from sequence using a structure predictor, then evaluate the predictions against the ground truth using a comprehensive set of structural metrics.

Parameters

Parameter Type Default Description
n_structures int 20 Number of test structures to generate and evaluate.
lengths list[int] [20, 30, 50] Pool of chain lengths to sample from uniformly.
conformations list[str] ["alpha", "beta"] Pool of secondary structure types. Options: "alpha", "beta", "random".
predictor str or Callable "esmfold" "esmfold" for the built-in ESMFold backend, or any predictor_fn(sequence: str) → pdb_str callable.
compute_shifts bool True Compute NMR chemical shift RMSD (requires synth_pdb.chemical_shifts).
compute_gnn bool True Score both structures with the GNN pLDDT classifier.
random_state int 42 RNG seed for reproducibility.

Returns

A BenchmarkResults object.

Using a Custom Predictor

def my_predictor(sequence: str) -> str:
    """Return a PDB string for the given amino acid sequence."""
    # ... call your model here ...
    return pdb_string

results = run_benchmark(n_structures=10, predictor=my_predictor)

This accepts any callable — ColabFold, OmegaFold, a structure database lookup, even a simple homology modelling pipeline.


BenchmarkResults

@dataclass
class BenchmarkResults:
    results:      list[StructureResult]
    predictor:    str
    n_structures: int
    n_success:    int

Methods

summary() → str

Returns a formatted multi-line summary report:

━━ Benchmark: ESMFold (18/20 structures) ━━
  TM-score   mean=0.723  std=0.142  min=0.421  max=0.891
  GDT-TS     mean=0.681  std=0.159
  lDDT       mean=0.714  std=0.128
  Cα-RMSD    mean=2.84 Å  std=1.21 Å
  Shift RMSD mean=0.412 ppm  std=0.193 ppm
  GNN pLDDT  mean=0.834 (predicted structures)

  Structures with TM-score > 0.5 (same fold): 16/18 (89%)

to_csv(path: str) → None

Write full per-structure results (all 12 fields) to a CSV file.

to_dataframe() → pd.DataFrame

Return results as a pandas DataFrame (requires pandas).


StructureResult

Per-structure result returned in BenchmarkResults.results.

Field Type Description
sequence str Amino acid sequence (single-letter code).
length int Number of residues.
conformation str Ground-truth secondary structure type.
tm_score float TM-score ∈ [0, 1]. Values > 0.5 indicate the same fold.
gdt_ts float GDT-TS ∈ [0, 1]. CASP standard metric.
lddt_mean float Mean per-residue lDDT ∈ [0, 1].
rmsd float Cα-RMSD in Å after Kabsch superposition.
shift_rmsd float Weighted chemical shift RMSD in ppm. NaN if unavailable.
gnn_score_ref float GNN global quality score for the ground-truth structure.
gnn_score_pred float GNN global quality score for the predicted structure.
predictor_time_s float Wall-clock inference time in seconds.
error str Non-empty string if prediction failed; other fields are NaN.

Benchmark Metrics Reference

All metric functions live in synth_pdb.benchmark_metrics and operate on numpy arrays with no additional dependencies.

tm_score(ca_pred, ca_ref)

def tm_score(
    ca_pred: np.ndarray,   # [N, 3] predicted Cα coordinates
    ca_ref:  np.ndarray,   # [N, 3] reference Cα coordinates
    *,
    normalise_by: int | None = None,
) -> float

Compute TM-score. Returns a value in (0, 1]. Two unrelated structures score ≈ 0.17; two structures with the same fold score > 0.5.

For the mathematical definition and interpretation, see the scientific background page.

lddt(ca_pred, ca_ref)

def lddt(
    ca_pred:          np.ndarray,             # [N, 3]
    ca_ref:           np.ndarray,             # [N, 3]
    *,
    inclusion_radius: float = 15.0,           # Å
    thresholds:       tuple = (0.5, 1, 2, 4), # Å
) -> np.ndarray  # [N] per-residue lDDT ∈ [0, 1]

Per-residue lDDT. Does not require superposition. The global lDDT is float(np.mean(lddt(...))).

gdt_ts(ca_pred, ca_ref)

def gdt_ts(
    ca_pred: np.ndarray,             # [N, 3]
    ca_ref:  np.ndarray,             # [N, 3]
    *,
    cutoffs: tuple = (1.0, 2.0, 4.0, 8.0),  # Å
) -> float

GDT-TS — average fraction of Cα atoms within {1, 2, 4, 8} Å after superposition.

superpose_kabsch(mobile, reference)

def superpose_kabsch(
    mobile:    np.ndarray,  # [N, 3]
    reference: np.ndarray,  # [N, 3]
) -> tuple[np.ndarray, float]  # (rotated_coords, rmsd)

Optimally superpose mobile onto reference using the Kabsch algorithm (SVD-based). Returns the rotated coordinate array and the Cα-RMSD in Å.

shift_rmsd(pred_shifts, ref_shifts)

def shift_rmsd(
    pred_shifts:      dict[str, np.ndarray],  # nucleus → per-residue shifts
    ref_shifts:       dict[str, np.ndarray],
    *,
    nucleus_weights:  dict[str, float] | None = None,
) -> float  # weighted shift RMSD in ppm

Weighted chemical shift RMSD following SPARTA+ nucleus weights (H=1.0, C=0.25, N=0.1). NaN residues (missing assignments) are automatically excluded.

from synth_pdb.benchmark_metrics import shift_rmsd
import numpy as np

# Compare predicted and reference ¹H shifts for 10 residues
rmsd = shift_rmsd(
    {"H": np.array([8.1, 8.2, 8.3, 8.0, 7.9, 8.4, 8.1, 8.2, 8.0, 7.8])},
    {"H": np.array([8.0, 8.1, 8.4, 8.0, 7.8, 8.3, 8.2, 8.1, 8.0, 7.9])},
)
print(f"¹H shift RMSD: {rmsd:.4f} ppm")

extract_ca_coords(pdb_content)

def extract_ca_coords(pdb_content: str) -> np.ndarray  # [N, 3]

Lightweight, pure-Python PDB parser that extracts Cα coordinates in residue order. Handles duplicate residue numbers (keeps first occurrence per chain/residue pair).

from synth_pdb.benchmark_metrics import extract_ca_coords, tm_score

ca_ref  = extract_ca_coords(open("reference.pdb").read())
ca_pred = extract_ca_coords(open("predicted.pdb").read())

n = min(len(ca_ref), len(ca_pred))
score = tm_score(ca_pred[:n], ca_ref[:n])
print(f"TM-score: {score:.3f}")

CLI Reference

python scripts/run_benchmark.py [OPTIONS]

Options:
  --predictor {esmfold}       Structure prediction backend (default: esmfold)
  --n-structures INT          Number of test structures (default: 20)
  --lengths L [L ...]         Chain lengths to sample (default: 20 30 50)
  --conformations {alpha,beta,random} [...]
                              Secondary structure types (default: alpha beta)
  --output PATH               Save CSV results to this path
  --no-shifts                 Skip chemical shift RMSD
  --no-gnn                    Skip GNN quality scoring
  --random-state INT          RNG seed (default: 42)
  -v, --verbose               Enable DEBUG logging

Example Runs

# Full benchmark with all metrics
python scripts/run_benchmark.py \
    --n-structures 50 \
    --lengths 20 30 50 \
    --output results/esmfold_benchmark.csv

# Fast geometry-only benchmark
python scripts/run_benchmark.py \
    --n-structures 100 \
    --no-shifts --no-gnn \
    --output results/fast_benchmark.csv

# Alpha-helix only
python scripts/run_benchmark.py \
    --conformations alpha \
    --n-structures 30 \
    --output results/helix_benchmark.csv

Full API Reference

benchmark

synth_pdb.benchmark. ~~~~~~~~~~~~~~~~~~~~~~ AlphaFold / ESMFold Benchmarking Suite for synth-pdb.

The benchmark asks a simple but powerful question

"Can AI structure prediction models predict synth-pdb synthetic structures?"

Because synth-pdb controls the ground-truth structure and its NMR chemical shifts, the benchmark is perfectly objective - no ambiguity about which experimental structure is "correct".


SCIENTIFIC RATIONALE - Why benchmark against synthetic structures?

Standard benchmarks for protein structure prediction (like CASP or CAMEO) rely on experimental structures from the PDB. While essential, these have noise: * Resolution limits and refinement errors. * Conformational ensemble averaging in crystals/cryo-EM. * Missing loops or disordered regions.

The synth-pdb benchmark provides a Perfect Control Group: 1. Objective Truth: We control every atomic coordinate exactly. 2. Zero Ambiguity: NMR chemical shifts are predicted directly from the ground truth, eliminating experimental measurement error. 3. Targeted Stress: We can generate structures that specifically challenge AI models (e.g., extremely long helices or unusual linkers).


METRICS EXPLAINED - The Structural Biology Toolkit

We use the same standards used to judge AlphaFold in CASP competitions:

  • TM-score (Template Modeling score): Measures global topology overlap. Ranges [0, 1]. >0.5 typically means the "same fold."
  • GDT-TS (Global Distance Test Total Score): Percentage of residues whose Calpha positions are within 1, 2, 4, or 8 A of the target.
  • lDDT (Local Distance Difference Test): Measures how well the local inter-atomic distances are preserved. Unlike RMSD, it doesn't require superposition, making it robust to hinge motions.

Quick start::

from synth_pdb.benchmark import run_benchmark

results = run_benchmark(n_structures=10, lengths=[20, 30])
print(results.summary())
results.to_csv("benchmark_results.csv")

Or with the CLI::

python scripts/run_benchmark.py --predictor esmfold --n-structures 20

Supported predictors

  • ESMFold (default): Meta's sequence-to-structure model via HuggingFace transformers library. No API key required. ~700 MB model download on first use. pip install transformers accelerate
  • Custom callable: Pass any predictor_fn(sequence: str) -> pdb_str for full flexibility with any model (ColabFold, OmegaFold, RoseTTAFold, etc.)

Metrics computed

For each generated structure:

  • tm_score TM-score between ground-truth and predicted Calpha trace
  • gdt_ts GDT-TS score
  • lddt Mean lDDT across all residues
  • rmsd Calpha-RMSD after Kabsch superposition (A)
  • shift_rmsd Chemical shift RMSD (ppm) - compares shifts predicted from ground-truth structure vs. from the AI-predicted structure
  • gnn_score_ref GNN pLDDT of the ground-truth structure
  • gnn_score_pred GNN pLDDT of the AI-predicted structure

Classes

BenchmarkResults dataclass

Aggregated results for a complete benchmarking run.

Attributes

results : list[StructureResult] Per-structure results. predictor : str Name of the predictor used. n_structures : int Number of structures attempted. n_success : int Structures successfully predicted.

Source code in synth_pdb/benchmark.py
@dataclass
class BenchmarkResults:
    """Aggregated results for a complete benchmarking run.

    Attributes
    ----------
    results       : list[StructureResult]  Per-structure results.
    predictor     : str                    Name of the predictor used.
    n_structures  : int                    Number of structures attempted.
    n_success     : int                    Structures successfully predicted.
    """

    results: list[StructureResult] = field(default_factory=list)
    predictor: str = "unknown"
    n_structures: int = 0
    n_success: int = 0

    def summary(self) -> str:
        """Return a formatted summary of the benchmark results."""
        good = [r for r in self.results if not r.error]
        if not good:
            return f"Benchmark '{self.predictor}': 0/{self.n_structures} succeeded."

        tm_scores = [r.tm_score for r in good if np.isfinite(r.tm_score)]
        gdt = [r.gdt_ts for r in good if np.isfinite(r.gdt_ts)]
        lddts = [r.lddt_mean for r in good if np.isfinite(r.lddt_mean)]
        rmsds = [r.rmsd for r in good if np.isfinite(r.rmsd)]
        shift = [r.shift_rmsd for r in good if np.isfinite(r.shift_rmsd)]
        gnn_pred = [r.gnn_score_pred for r in good if np.isfinite(r.gnn_score_pred)]

        lines = [
            f"== Benchmark: {self.predictor} ({self.n_success}/{self.n_structures} structures) ==",
            f"  TM-score   mean={np.mean(tm_scores):.3f}  std={np.std(tm_scores):.3f}  min={np.min(tm_scores):.3f}  max={np.max(tm_scores):.3f}",
            f"  GDT-TS     mean={np.mean(gdt):.3f}  std={np.std(gdt):.3f}",
            f"  lDDT       mean={np.mean(lddts):.3f}  std={np.std(lddts):.3f}",
            f"  Calpha-RMSD    mean={np.mean(rmsds):.2f} A  std={np.std(rmsds):.2f} A",
        ]
        if shift:
            lines.append(f"  Shift RMSD mean={np.mean(shift):.3f} ppm  std={np.std(shift):.3f} ppm")
        if gnn_pred:
            lines.append(f"  GNN pLDDT  mean={np.mean(gnn_pred):.3f} (predicted structures)")

        # Fraction with TM-score > 0.5 (same fold)
        same_fold = sum(1 for t in tm_scores if t > 0.5)
        lines.append(
            f"\n  Structures with TM-score > 0.5 (same fold): "
            f"{same_fold}/{len(tm_scores)} ({100*same_fold/len(tm_scores):.0f}%)"
        )
        return "\n".join(lines)

    def to_csv(self, path: str) -> None:
        """Write full per-structure results to a CSV file."""
        import csv

        fields = [
            "sequence",
            "length",
            "conformation",
            "tm_score",
            "gdt_ts",
            "lddt_mean",
            "rmsd",
            "shift_rmsd",
            "gnn_score_ref",
            "gnn_score_pred",
            "predictor_time_s",
            "error",
        ]
        with open(path, "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=fields)
            writer.writeheader()
            for r in self.results:
                writer.writerow({k: getattr(r, k) for k in fields})
        logger.info("Benchmark results saved to %s", path)

    def to_dataframe(self) -> pd.DataFrame:  # type: ignore[name-defined]
        """Return results as a pandas DataFrame (requires pandas)."""
        try:
            import pandas as pd  # type: ignore[import-untyped]
        except ImportError as exc:
            raise ImportError("pandas is required for to_dataframe().") from exc

        return pd.DataFrame(
            [{k: getattr(r, k) for k in StructureResult.__dataclass_fields__} for r in self.results]
        )
Functions
summary()

Return a formatted summary of the benchmark results.

Source code in synth_pdb/benchmark.py
def summary(self) -> str:
    """Return a formatted summary of the benchmark results."""
    good = [r for r in self.results if not r.error]
    if not good:
        return f"Benchmark '{self.predictor}': 0/{self.n_structures} succeeded."

    tm_scores = [r.tm_score for r in good if np.isfinite(r.tm_score)]
    gdt = [r.gdt_ts for r in good if np.isfinite(r.gdt_ts)]
    lddts = [r.lddt_mean for r in good if np.isfinite(r.lddt_mean)]
    rmsds = [r.rmsd for r in good if np.isfinite(r.rmsd)]
    shift = [r.shift_rmsd for r in good if np.isfinite(r.shift_rmsd)]
    gnn_pred = [r.gnn_score_pred for r in good if np.isfinite(r.gnn_score_pred)]

    lines = [
        f"== Benchmark: {self.predictor} ({self.n_success}/{self.n_structures} structures) ==",
        f"  TM-score   mean={np.mean(tm_scores):.3f}  std={np.std(tm_scores):.3f}  min={np.min(tm_scores):.3f}  max={np.max(tm_scores):.3f}",
        f"  GDT-TS     mean={np.mean(gdt):.3f}  std={np.std(gdt):.3f}",
        f"  lDDT       mean={np.mean(lddts):.3f}  std={np.std(lddts):.3f}",
        f"  Calpha-RMSD    mean={np.mean(rmsds):.2f} A  std={np.std(rmsds):.2f} A",
    ]
    if shift:
        lines.append(f"  Shift RMSD mean={np.mean(shift):.3f} ppm  std={np.std(shift):.3f} ppm")
    if gnn_pred:
        lines.append(f"  GNN pLDDT  mean={np.mean(gnn_pred):.3f} (predicted structures)")

    # Fraction with TM-score > 0.5 (same fold)
    same_fold = sum(1 for t in tm_scores if t > 0.5)
    lines.append(
        f"\n  Structures with TM-score > 0.5 (same fold): "
        f"{same_fold}/{len(tm_scores)} ({100*same_fold/len(tm_scores):.0f}%)"
    )
    return "\n".join(lines)
to_csv(path)

Write full per-structure results to a CSV file.

Source code in synth_pdb/benchmark.py
def to_csv(self, path: str) -> None:
    """Write full per-structure results to a CSV file."""
    import csv

    fields = [
        "sequence",
        "length",
        "conformation",
        "tm_score",
        "gdt_ts",
        "lddt_mean",
        "rmsd",
        "shift_rmsd",
        "gnn_score_ref",
        "gnn_score_pred",
        "predictor_time_s",
        "error",
    ]
    with open(path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        for r in self.results:
            writer.writerow({k: getattr(r, k) for k in fields})
    logger.info("Benchmark results saved to %s", path)
to_dataframe()

Return results as a pandas DataFrame (requires pandas).

Source code in synth_pdb/benchmark.py
def to_dataframe(self) -> pd.DataFrame:  # type: ignore[name-defined]
    """Return results as a pandas DataFrame (requires pandas)."""
    try:
        import pandas as pd  # type: ignore[import-untyped]
    except ImportError as exc:
        raise ImportError("pandas is required for to_dataframe().") from exc

    return pd.DataFrame(
        [{k: getattr(r, k) for k in StructureResult.__dataclass_fields__} for r in self.results]
    )

StructureResult dataclass

Quality metrics for a single (ground-truth, predicted) structure pair.

This container tracks the "Delta" between what synth-pdb generated and what an AI model (like ESMFold) predicted. It captures both global geometry (TM-score) and local physics (NMR Chemical Shift RMSD).

Attributes

sequence : str Amino acid sequence (single-letter). length : int Number of residues. conformation : str Ground-truth conformation type (e.g. "alpha"). tm_score : float TM-score in [0, 1]. High values indicate correct fold. gdt_ts : float GDT-TS in [0, 1]. Captures global Calpha trace fidelity. lddt_mean : float Mean lDDT in [0, 1]. Captures local geometric accuracy. rmsd : float Calpha-RMSD in A after superposition. shift_rmsd : float NMR Chemical shift RMSD in ppm. A "Physical Audit": measures if the AI-predicted structure "sounds" like the ground truth structure under a virtual spectrometer. gnn_score_ref : float GNN quality score for the ground-truth structure. gnn_score_pred : float GNN quality score for the predicted structure. predictor_time_s: float Wall-clock inference time for this structure (s). error : str Non-empty if prediction failed (e.g. CUDA OOM).

Source code in synth_pdb/benchmark.py
@dataclass
class StructureResult:
    """Quality metrics for a single (ground-truth, predicted) structure pair.

    This container tracks the "Delta" between what synth-pdb generated and
    what an AI model (like ESMFold) predicted. It captures both global
    geometry (TM-score) and local physics (NMR Chemical Shift RMSD).

    Attributes
    ----------
    sequence        : str    Amino acid sequence (single-letter).
    length          : int    Number of residues.
    conformation    : str    Ground-truth conformation type (e.g. "alpha").
    tm_score        : float  TM-score in [0, 1]. High values indicate correct fold.
    gdt_ts          : float  GDT-TS in [0, 1]. Captures global Calpha trace fidelity.
    lddt_mean       : float  Mean lDDT in [0, 1]. Captures local geometric accuracy.
    rmsd            : float  Calpha-RMSD in A after superposition.
    shift_rmsd      : float  NMR Chemical shift RMSD in ppm. A "Physical Audit":
                             measures if the AI-predicted structure "sounds" like
                              the ground truth structure under a virtual spectrometer.
    gnn_score_ref   : float  GNN quality score for the ground-truth structure.
    gnn_score_pred  : float  GNN quality score for the predicted structure.
    predictor_time_s: float  Wall-clock inference time for this structure (s).
    error           : str    Non-empty if prediction failed (e.g. CUDA OOM).
    """

    sequence: str = ""
    length: int = 0
    conformation: str = ""
    tm_score: float = float("nan")
    gdt_ts: float = float("nan")
    lddt_mean: float = float("nan")
    rmsd: float = float("nan")
    shift_rmsd: float = float("nan")
    gnn_score_ref: float = float("nan")
    gnn_score_pred: float = float("nan")
    predictor_time_s: float = float("nan")
    error: str = ""

Functions

run_benchmark(n_structures=20, lengths=None, conformations=None, predictor='esmfold', *, compute_shifts=True, compute_gnn=True, random_state=42)

Run the synth-pdb vs. AI structure prediction benchmark.

Generates n_structures synthetic protein structures of varying length and secondary structure content, then asks the specified predictor to reconstruct each structure from sequence alone. The prediction is evaluated against the ground-truth on TM-score, GDT-TS, lDDT, Calpha-RMSD, and (optionally) NMR chemical shift RMSD.


BENCHMARK LIFECYCLE

For each structure in the trial: 1. Generate: Create a ground-truth PDB with specific geometry. 2. Extract: Convert the 3D structure to its 1D amino acid sequence. 3. Predict: Blindly ask the AI model to predict the 3D structure. 4. Evaluate: Compute TM-score, GDT-TS, and lDDT between (1) and (3). 5. Audit: (Optional) Compare predicted NMR chemical shifts.

This "Circular Validation" ensures the AI model is learning the true underlying physics of protein folding, not just memorizing the PDB.

Parameters

n_structures : int Total number of test structures to generate. lengths : list[int], optional Pool of chain lengths to sample from (default: [20, 30, 50]). Lengths are sampled uniformly. conformations : list[str], optional Pool of conformations to sample from (default: ["alpha", "beta"]). predictor : str or callable "esmfold" to use the bundled ESMFold backend, or any callable predictor_fn(sequence: str) -> pdb_str. compute_shifts : bool Whether to compute NMR chemical shift RMSD (requires synth_pdb chemical_shifts module). Default True. compute_gnn : bool Whether to score both structures with the GNN quality scorer. Default True. random_state : int RNG seed for reproducible structure generation.

Returns

BenchmarkResults Call .summary() for a formatted text report or .to_csv() to export full results.

Examples

from synth_pdb.benchmark import run_benchmark results = run_benchmark(n_structures=10) print(results.summary())

Source code in synth_pdb/benchmark.py
def run_benchmark(
    n_structures: int = 20,
    lengths: list[int] | None = None,
    conformations: list[str] | None = None,
    predictor: str | Callable[[str], str] = "esmfold",
    *,
    compute_shifts: bool = True,
    compute_gnn: bool = True,
    random_state: int = 42,
) -> BenchmarkResults:
    """Run the synth-pdb vs. AI structure prediction benchmark.

    Generates ``n_structures`` synthetic protein structures of varying length
    and secondary structure content, then asks the specified predictor to
    reconstruct each structure from sequence alone.  The prediction is
    evaluated against the ground-truth on TM-score, GDT-TS, lDDT, Calpha-RMSD,
    and (optionally) NMR chemical shift RMSD.

    -----------------------------------------------------------------------------
    BENCHMARK LIFECYCLE
    -----------------------------------------------------------------------------
    For each structure in the trial:
      1. **Generate**: Create a ground-truth PDB with specific geometry.
      2. **Extract**: Convert the 3D structure to its 1D amino acid sequence.
      3. **Predict**: Blindly ask the AI model to predict the 3D structure.
      4. **Evaluate**: Compute TM-score, GDT-TS, and lDDT between (1) and (3).
      5. **Audit**: (Optional) Compare predicted NMR chemical shifts.

    This "Circular Validation" ensures the AI model is learning the true
    underlying physics of protein folding, not just memorizing the PDB.

    Parameters
    ----------
    n_structures : int
        Total number of test structures to generate.
    lengths : list[int], optional
        Pool of chain lengths to sample from (default: [20, 30, 50]).
        Lengths are sampled uniformly.
    conformations : list[str], optional
        Pool of conformations to sample from (default: ["alpha", "beta"]).
    predictor : str or callable
        "esmfold" to use the bundled ESMFold backend, or any callable
        ``predictor_fn(sequence: str) -> pdb_str``.
    compute_shifts : bool
        Whether to compute NMR chemical shift RMSD (requires ``synth_pdb``
        chemical_shifts module).  Default True.
    compute_gnn : bool
        Whether to score both structures with the GNN quality scorer.
        Default True.
    random_state : int
        RNG seed for reproducible structure generation.

    Returns
    -------
    BenchmarkResults
        Call ``.summary()`` for a formatted text report or ``.to_csv()`` to
        export full results.

    Examples
    --------
    >>> from synth_pdb.benchmark import run_benchmark
    >>> results = run_benchmark(n_structures=10)
    >>> print(results.summary())

    """
    from synth_pdb.benchmark_metrics import (
        extract_ca_coords,
        gdt_ts as _gdt_ts,
        lddt as _lddt,
        shift_rmsd as _shift_rmsd,
        superpose_kabsch,
        tm_score as _tm_score,
    )
    from synth_pdb.generator import generate_pdb_content

    if lengths is None:
        lengths = [20, 30, 50]
    if conformations is None:
        conformations = ["alpha", "beta"]

    # -- Load predictor -------------------------------------------------
    predictor_name: str
    predictor_fn: Callable[[str], str]
    if callable(predictor):
        predictor_fn = predictor
        predictor_name = getattr(predictor, "__name__", "custom")
    elif predictor == "esmfold":
        predictor_fn = _load_esmfold_predictor()
        predictor_name = "ESMFold"
    else:
        raise ValueError(f"Unknown predictor: {predictor!r}. Use 'esmfold' or a callable.")

    # -- Optional GNN scorer --------------------------------------------
    # The GNN scorer provides an independent physical audit of both the
    # reference and predicted structures. It requires torch and torch_geometric.
    gnn_clf = None
    if compute_gnn:
        try:
            from synth_pdb.quality.gnn.gnn_classifier import GNNQualityClassifier

            gnn_clf = GNNQualityClassifier()
        except ImportError:
            logger.warning(
                "torch/torch_geometric not available - GNN scoring disabled. "
                "Install with: pip install synth-pdb[gnn]"
            )

    # -- Optional chemical shift predictor -----------------------------
    # This is the "Gold Standard" of structural validation in synth-pdb.
    # If the AI model predicts a structure with a high TM-score but a bad
    # Shift RMSD, it implies the global fold is correct but the local
    # backbone packaging or H-bonding is physically unrealistic.
    shift_fn = None
    if compute_shifts:
        try:
            from synth_pdb.chemical_shifts import predict_chemical_shifts as _predict_shifts

            shift_fn = _predict_shifts
        except ImportError:
            logger.warning("chemical_shifts module not available - shift RMSD disabled.")

    # -- Generate and evaluate ------------------------------------------
    rng = np.random.default_rng(random_state)
    benchmark_results = BenchmarkResults(predictor=predictor_name, n_structures=n_structures)

    for i in range(n_structures):
        length = int(rng.choice(lengths))
        conformation = str(rng.choice(conformations))
        result = StructureResult(length=length, conformation=conformation)

        logger.info(
            "Structure %d/%d  length=%d  conformation=%s",
            i + 1,
            n_structures,
            length,
            conformation,
        )

        try:
            # -- 1. Generate ground-truth structure ---------------------
            # We skip energy minimisation to ensure the ground truth is
            # mathematically perfect (exactly at Ramachandran centres).
            ref_pdb = cast(
                str,
                generate_pdb_content(
                    length=length,
                    conformation=conformation,
                    minimize_energy=False,
                ),
            )

            # -- 2. Extract sequence ------------------------------------
            # The AI models only get the sequence - no coordinate hints allowed!
            sequence = _extract_sequence(ref_pdb)
            result.sequence = sequence

            # -- 3. Run predictor ---------------------------------------
            t0 = time.perf_counter()
            pred_pdb = predictor_fn(sequence)
            result.predictor_time_s = time.perf_counter() - t0

            # -- 4. Structural metrics ----------------------------------
            # We extract Calpha coordinates (the "backbone trace") to compute
            # global alignment metrics.
            ca_ref = extract_ca_coords(ref_pdb)
            ca_pred = extract_ca_coords(pred_pdb)

            # Trim to the shorter of the two (in case of predictor truncation).
            n_align = min(len(ca_ref), len(ca_pred))
            ca_ref_a = ca_ref[:n_align]
            ca_pred_a = ca_pred[:n_align]

            # Compute standard CASP metrics
            result.tm_score = _tm_score(ca_pred_a, ca_ref_a)
            result.gdt_ts = _gdt_ts(ca_pred_a, ca_ref_a)
            per_res_lddt = _lddt(ca_pred_a, ca_ref_a)
            result.lddt_mean = float(np.mean(per_res_lddt))

            # RMSD after optimal Kabsch superposition (rigid-body alignment)
            _, result.rmsd = superpose_kabsch(ca_pred_a, ca_ref_a)

            # -- 5. Chemical shift RMSD ---------------------------------
            if shift_fn is not None:
                try:
                    ref_shifts_raw = shift_fn(ref_pdb)
                    pred_shifts_raw = shift_fn(pred_pdb)
                    # Convert to nucleus -> np.array format
                    ref_shifts = _shifts_to_arrays(ref_shifts_raw)
                    pred_shifts = _shifts_to_arrays(pred_shifts_raw)
                    result.shift_rmsd = _shift_rmsd(pred_shifts, ref_shifts)
                except Exception as se:
                    logger.warning("Shift RMSD failed for structure %d: %s", i, se)

            # -- 6. GNN quality scores ----------------------------------
            if gnn_clf is not None:
                try:
                    ref_score = gnn_clf.score(ref_pdb)
                    pred_score = gnn_clf.score(pred_pdb)
                    result.gnn_score_ref = ref_score.global_score
                    result.gnn_score_pred = pred_score.global_score
                except Exception as ge:
                    logger.warning("GNN scoring failed for structure %d: %s", i, ge)

            benchmark_results.n_success += 1

        except Exception as exc:
            result.error = str(exc)
            logger.warning("Structure %d failed: %s", i + 1, exc, exc_info=True)

        benchmark_results.results.append(result)

    return benchmark_results

benchmark_metrics

synth_pdb.benchmark_metrics. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pure-numpy structural biology evaluation metrics for the AlphaFold benchmarking suite.

All functions operate on numpy arrays and have no optional dependencies, making them trivially unit-testable and usable in headless environments.


Metrics implemented

TM-score Template Modelling score in [0, 1]. A TM-score > 0.5 indicates that two proteins share the same global topology regardless of RMSD. It is the primary metric in CASP (Critical Assessment of Structure Prediction). Reference: Zhang & Skolnick (2004) Proteins 57, 702-710.

lDDT Local Distance Difference Test in [0, 1]. Evaluates per-residue local geometry without superposition. Used by AlphaFold to report per-residue confidence (pLDDT) during training. Reference: Mariani et al. (2013) Bioinformatics 29, 2722-2728.

GDT-TS Global Distance Test - Total Score. CASP standard metric computed as the average fraction of Calpha atoms within 1, 2, 4, 8 A of the reference. Reference: Zemla (2003) Nucleic Acids Research 31, 3370-3374.

shift_rmsd NMR Chemical Shift Root-Mean-Square Deviation. Measures the agreement between predicted and reference 1H/13C/15N chemical shifts.

Functions

tm_score(ca_pred, ca_ref, *, normalise_by=None)

Compute TM-score between a predicted and reference Calpha trace.

TM-score is defined as::

TM = (1/L_ref) Sum_i  1 / (1 + (d_i / d0)^2)

where d_i is the distance between the i-th pair of aligned Calpha atoms after optimal superposition, L_ref is the reference chain length, and d0 is a length-normalising constant::

d0 = 1.24 x (L_ref - 15)^(1/3) - 1.8    (for L_ref >= 22)
d0 = 0.5                                   (for L_ref < 22)
Parameters

ca_pred : np.ndarray [N, 3] Predicted Calpha coordinates. ca_ref : np.ndarray [N, 3] Reference Calpha coordinates. normalise_by : int, optional If provided, normalise by this length rather than len(ca_ref). Useful when comparing structures of different lengths.

Returns

float TM-score in (0, 1]. Two random structures ~ 0.17; same fold >= 0.5.

Source code in synth_pdb/benchmark_metrics.py
def tm_score(
    ca_pred: np.ndarray,
    ca_ref: np.ndarray,
    *,
    normalise_by: int | None = None,
) -> float:
    """Compute TM-score between a predicted and reference Calpha trace.

    TM-score is defined as::

        TM = (1/L_ref) Sum_i  1 / (1 + (d_i / d0)^2)

    where ``d_i`` is the distance between the i-th pair of aligned Calpha atoms
    after optimal superposition, ``L_ref`` is the reference chain length, and
    ``d0`` is a length-normalising constant::

        d0 = 1.24 x (L_ref - 15)^(1/3) - 1.8    (for L_ref >= 22)
        d0 = 0.5                                   (for L_ref < 22)

    Parameters
    ----------
    ca_pred      : np.ndarray [N, 3]  Predicted Calpha coordinates.
    ca_ref       : np.ndarray [N, 3]  Reference Calpha coordinates.
    normalise_by : int, optional      If provided, normalise by this length
                                      rather than len(ca_ref).  Useful when
                                      comparing structures of different lengths.

    Returns
    -------
    float  TM-score in (0, 1].  Two random structures ~ 0.17; same fold >= 0.5.

    """
    assert ca_pred.shape == ca_ref.shape and ca_pred.ndim == 2 and ca_pred.shape[1] == 3
    n = len(ca_ref)
    l_ref = normalise_by if normalise_by is not None else n

    # -- d0 normalisation constant --------------------------------------
    if l_ref >= 22:
        d0 = 1.24 * (l_ref - 15) ** (1.0 / 3.0) - 1.8
    else:
        d0 = 0.5
    d0 = max(d0, 0.5)  # never let d0 go below 0.5

    # -- Superpose then compute per-residue distances -------------------
    rotated, _rmsd = superpose_kabsch(ca_pred, ca_ref)
    d = np.linalg.norm(rotated - ca_ref, axis=1)  # [N]

    # -- TM-score sum --------------------------------------------------
    tm = float(np.sum(1.0 / (1.0 + (d / d0) ** 2)) / l_ref)
    return tm

lddt(ca_pred, ca_ref, *, inclusion_radius=15.0, thresholds=(0.5, 1.0, 2.0, 4.0))

Compute per-residue lDDT scores.

lDDT measures how well the local distance geometry of a predicted structure agrees with the reference, without requiring superposition.

For each residue i, we look at all pairs (i, j) where j is within inclusion_radius A of i in the reference structure. We then count what fraction of those reference distances are preserved in the predicted structure within each threshold.

Per-residue lDDT::

lDDT_i = (1 / |T|) Sum_t  (fraction of distances within threshold t)

where |T| = len(thresholds) = 4.

Parameters

ca_pred : np.ndarray [N, 3] Predicted Calpha coordinates. ca_ref : np.ndarray [N, 3] Reference Calpha coordinates. inclusion_radius : float Only pairs within this radius in the reference contribute (default 15 A). thresholds : tuple of float Distance preservation thresholds (A).

Returns

np.ndarray [N] Per-residue lDDT in [0, 1]. Global lDDT = mean().

Source code in synth_pdb/benchmark_metrics.py
def lddt(
    ca_pred: np.ndarray,
    ca_ref: np.ndarray,
    *,
    inclusion_radius: float = 15.0,
    thresholds: tuple[float, ...] = (0.5, 1.0, 2.0, 4.0),
) -> np.ndarray:
    """Compute per-residue lDDT scores.

    lDDT measures how well the local distance geometry of a predicted structure
    agrees with the reference, without requiring superposition.

    For each residue i, we look at all pairs (i, j) where j is within
    ``inclusion_radius`` A of i in the *reference* structure.  We then count
    what fraction of those reference distances are preserved in the predicted
    structure within each threshold.

    Per-residue lDDT::

        lDDT_i = (1 / |T|) Sum_t  (fraction of distances within threshold t)

    where |T| = len(thresholds) = 4.

    Parameters
    ----------
    ca_pred          : np.ndarray [N, 3]  Predicted Calpha coordinates.
    ca_ref           : np.ndarray [N, 3]  Reference Calpha coordinates.
    inclusion_radius : float              Only pairs within this radius in the
                                          reference contribute (default 15 A).
    thresholds       : tuple of float     Distance preservation thresholds (A).

    Returns
    -------
    np.ndarray [N]  Per-residue lDDT in [0, 1].  Global lDDT = mean().

    """
    assert ca_pred.shape == ca_ref.shape
    n = len(ca_ref)

    # Pairwise distances in reference and predicted structures - O(N^2) memory
    ref_dists = np.sqrt(((ca_ref[:, None, :] - ca_ref[None, :, :]) ** 2).sum(axis=-1))
    pred_dists = np.sqrt(((ca_pred[:, None, :] - ca_pred[None, :, :]) ** 2).sum(axis=-1))

    # Mask: pairs within inclusion radius in reference, excluding self-pairs
    mask = (ref_dists < inclusion_radius) & (ref_dists > 0)

    per_residue = np.zeros(n, dtype=np.float64)

    for i in range(n):
        neighbours = np.where(mask[i])[0]
        if len(neighbours) == 0:
            per_residue[i] = 0.0
            continue

        ref_d = ref_dists[i, neighbours]
        pred_d = pred_dists[i, neighbours]
        diff = np.abs(pred_d - ref_d)

        # Fraction within each threshold
        fractions = [float(np.mean(diff < t)) for t in thresholds]
        per_residue[i] = float(np.mean(fractions))

    return per_residue.astype(np.float32)

gdt_ts(ca_pred, ca_ref, *, cutoffs=(1.0, 2.0, 4.0, 8.0))

Compute GDT-TS (Global Distance Test - Total Score).

GDT-TS is the average fraction of Calpha atoms placed within {1, 2, 4, 8} A of the reference after optimal superposition. It is the primary ranking metric used in CASP competitions.

Parameters

ca_pred : np.ndarray [N, 3] Predicted Calpha coordinates. ca_ref : np.ndarray [N, 3] Reference Calpha coordinates. cutoffs : tuple of float Distance cutoffs in A (default CASP standard).

Returns

float GDT-TS in [0, 1]. Perfect prediction = 1.0.

Source code in synth_pdb/benchmark_metrics.py
def gdt_ts(
    ca_pred: np.ndarray,
    ca_ref: np.ndarray,
    *,
    cutoffs: tuple[float, ...] = (1.0, 2.0, 4.0, 8.0),
) -> float:
    """Compute GDT-TS (Global Distance Test - Total Score).

    GDT-TS is the average fraction of Calpha atoms placed within
    {1, 2, 4, 8} A of the reference after optimal superposition.
    It is the primary ranking metric used in CASP competitions.

    Parameters
    ----------
    ca_pred : np.ndarray [N, 3]  Predicted Calpha coordinates.
    ca_ref  : np.ndarray [N, 3]  Reference Calpha coordinates.
    cutoffs : tuple of float     Distance cutoffs in A (default CASP standard).

    Returns
    -------
    float  GDT-TS in [0, 1].  Perfect prediction = 1.0.

    """
    assert ca_pred.shape == ca_ref.shape
    rotated, _rmsd = superpose_kabsch(ca_pred, ca_ref)
    d = np.linalg.norm(rotated - ca_ref, axis=1)  # [N]

    fractions = [float(np.mean(d < c)) for c in cutoffs]
    return float(np.mean(fractions))

superpose_kabsch(mobile, reference)

Optimally superpose mobile onto reference using the Kabsch algorithm.

The Kabsch algorithm finds the rotation matrix R that minimises RMSD between two sets of paired points by SVD-decomposing their cross-covariance matrix.

-- Algorithm ------------------------------------------------------------ 1. Translate both sets to their centroids. 2. Compute cross-covariance H = mobile_centred.T @ ref_centred 3. SVD: H = U Sum V^T 4. R = V diag(1, 1, det(V U^T)) U^T <- det term fixes reflections 5. Apply R to mobile.


Parameters

mobile : np.ndarray [N, 3] Coordinates to rotate. reference : np.ndarray [N, 3] Target coordinates.

Returns

rotated : np.ndarray [N, 3] Mobile after optimal superposition. rmsd : float Calpha-RMSD after superposition (A).

Source code in synth_pdb/benchmark_metrics.py
def superpose_kabsch(
    mobile: np.ndarray,
    reference: np.ndarray,
) -> tuple[np.ndarray, float]:
    """Optimally superpose *mobile* onto *reference* using the Kabsch algorithm.

    The Kabsch algorithm finds the rotation matrix R that minimises RMSD
    between two sets of paired points by SVD-decomposing their cross-covariance
    matrix.

    -- Algorithm ------------------------------------------------------------
    1. Translate both sets to their centroids.
    2. Compute cross-covariance  H = mobile_centred.T @ ref_centred
    3. SVD: H = U Sum V^T
    4. R = V diag(1, 1, det(V U^T)) U^T   <- det term fixes reflections
    5. Apply R to mobile.
    -------------------------------------------------------------------------

    Parameters
    ----------
    mobile    : np.ndarray [N, 3]   Coordinates to rotate.
    reference : np.ndarray [N, 3]   Target coordinates.

    Returns
    -------
    rotated  : np.ndarray [N, 3]  Mobile after optimal superposition.
    rmsd     : float              Calpha-RMSD after superposition (A).

    """
    assert mobile.shape == reference.shape, "mobile and reference must have the same shape"
    n = len(mobile)

    # -- Step 1: Centre both sets ---------------------------------------
    mob_c = mobile - mobile.mean(axis=0)
    ref_c = reference - reference.mean(axis=0)

    # -- Step 2: Cross-covariance matrix -------------------------------
    h = mob_c.T @ ref_c  # [3, 3]

    # -- Step 3: SVD ---------------------------------------------------
    u, _s, vt = np.linalg.svd(h)

    # -- Step 4: Rotation matrix (with reflection correction) ----------
    # det(V U^T) is +1 for a proper rotation, -1 for a reflection.
    # We force a proper rotation by flipping the last singular vector.
    d = np.linalg.det(vt.T @ u.T)
    diag = np.diag([1.0, 1.0, d])
    rot = vt.T @ diag @ u.T  # [3, 3]

    # -- Step 5: Apply rotation -----------------------------------------
    rotated = mob_c @ rot.T + reference.mean(axis=0)

    # -- RMSD ----------------------------------------------------------
    diff = rotated - reference
    rmsd = float(math.sqrt(np.sum(diff**2) / n))

    return rotated, rmsd

shift_rmsd(pred_shifts, ref_shifts, *, nucleus_weights=None)

Compute weighted chemical shift RMSD between predicted and reference shifts.

Parameters

pred_shifts, ref_shifts : dict mapping nucleus -> np.ndarray [N] Nucleus keys are typically "H", "C", "N" (matching BMRB convention). Arrays must be the same length (one entry per residue). nucleus_weights : dict, optional Per-nucleus weighting (default: H=1.0, C=0.25, N=0.1, matching the CamSol / SPARTA+ convention for weighted shift RMSD).

Returns

float Weighted shift RMSD in ppm. Lower is better.

Examples

rmsd = shift_rmsd( ... {"H": np.array([8.1, 8.2, 8.3])}, ... {"H": np.array([8.0, 8.1, 8.4])}, ... ) print(f"{rmsd:.4f} ppm") 0.1155 ppm

Source code in synth_pdb/benchmark_metrics.py
def shift_rmsd(
    pred_shifts: dict[str, np.ndarray],
    ref_shifts: dict[str, np.ndarray],
    *,
    nucleus_weights: dict[str, float] | None = None,
) -> float:
    """Compute weighted chemical shift RMSD between predicted and reference shifts.

    Parameters
    ----------
    pred_shifts, ref_shifts : dict mapping nucleus -> np.ndarray [N]
        Nucleus keys are typically "H", "C", "N" (matching BMRB convention).
        Arrays must be the same length (one entry per residue).
    nucleus_weights : dict, optional
        Per-nucleus weighting (default: H=1.0, C=0.25, N=0.1, matching
        the CamSol / SPARTA+ convention for weighted shift RMSD).

    Returns
    -------
    float  Weighted shift RMSD in ppm.  Lower is better.

    Examples
    --------
    >>> rmsd = shift_rmsd(
    ...     {"H": np.array([8.1, 8.2, 8.3])},
    ...     {"H": np.array([8.0, 8.1, 8.4])},
    ... )
    >>> print(f"{rmsd:.4f} ppm")
    0.1155 ppm

    """
    # Default SPARTA+-style nucleus weights
    if nucleus_weights is None:
        nucleus_weights = {"H": 1.0, "C": 0.25, "N": 0.1}

    total_sq = 0.0
    total_weight = 0.0

    for nucleus, pred_arr in pred_shifts.items():
        ref_arr = ref_shifts.get(nucleus)
        if ref_arr is None:
            logger.warning("shift_rmsd: nucleus '%s' not in ref_shifts, skipping", nucleus)
            continue

        w = nucleus_weights.get(nucleus, 1.0)
        diff = np.asarray(pred_arr, dtype=np.float64) - np.asarray(ref_arr, dtype=np.float64)

        # Filter NaNs (missing assignments are common in experimental spectra)
        valid = np.isfinite(diff)
        if not valid.any():
            continue

        total_sq += w * float(np.sum(diff[valid] ** 2))
        total_weight += w * float(np.sum(valid))

    if total_weight < 1e-10:
        logger.warning("shift_rmsd: no valid (nucleus, residue) pairs found; returning nan")
        return float("nan")

    return float(math.sqrt(total_sq / total_weight))

extract_ca_coords(pdb_content)

Extract Calpha coordinates from a PDB string in residue order.

A lightweight parser - uses only the standard library, no biotite.

Parameters

pdb_content : str Raw PDB-format string.

Returns

np.ndarray [N, 3] Calpha coordinates in A.

Raises

ValueError If fewer than 2 Calpha atoms are found.

Source code in synth_pdb/benchmark_metrics.py
def extract_ca_coords(pdb_content: str) -> np.ndarray:
    """Extract Calpha coordinates from a PDB string in residue order.

    A lightweight parser - uses only the standard library, no biotite.

    Parameters
    ----------
    pdb_content : str  Raw PDB-format string.

    Returns
    -------
    np.ndarray [N, 3]  Calpha coordinates in A.

    Raises
    ------
    ValueError  If fewer than 2 Calpha atoms are found.

    """
    coords = []
    seen: set[tuple[str, int]] = set()

    for line in pdb_content.splitlines():
        if not line.startswith("ATOM"):
            continue
        atom_name = line[12:16].strip()
        if atom_name != "CA":
            continue
        chain = line[21].strip() or "A"
        try:
            res_num = int(line[22:26].strip())
            x = float(line[30:38])
            y = float(line[38:46])
            z = float(line[46:54])
        except ValueError:
            continue

        key = (chain, res_num)
        if key not in seen:
            seen.add(key)
            coords.append([x, y, z])

    if len(coords) < 2:
        raise ValueError(f"Only {len(coords)} Calpha atoms found in PDB - need at least 2.")

    return np.array(coords, dtype=np.float32)

See Also