Scientific Benchmarking & Defensibility

To ensure that the synthetic data generated by synth-pdb is suitable for training AI models and validating structural hypotheses, we perform rigorous benchmarking against peer-reviewed experimental datasets.

The Physical Ground Truth Suite

We use Human Ubiquitin (PDB: 1D3Z / BMRB: 6457) as our primary "Gold Standard" benchmark. Ubiquitin is a small, stable protein with extremely high-quality experimental data, making it the ideal subject for testing the accuracy of our physics engines.

1. NMR Chemical Shift Correlation

Chemical shifts are the most sensitive reporters of local protein geometry. We compare our predicted shifts (using the SPARTA+ engine) against the experimental values deposited in the BMRB.

Atom Type	Pearson Correlation (R)	Target Accuracy
Cα (Alpha Carbon)	> 0.95	High
Cβ (Beta Carbon)	> 0.98	High
N (Nitrogen)	> 0.75	Moderate
Hα (Alpha Proton)	> 0.65	Moderate

Note: Correlations for Nitrogen and Protons are lower because they are highly sensitive to dynamic hydrogen bonding and solvent effects that are difficult to capture in a static synthetic model.

2. Global Biophysical Parameters

We validate global structural dimensions using the Radius of Gyration (\(R_g\)), which is the primary observable in SAXS experiments.

Experimental (1D3Z): ~11.55 Å
Synthetic (synth-pdb): 11.96 Å
Error: < 4.0%

This confirms that the synth-pdb physics engine generates structures with correct global compacting and overall volume.

How to Run the Benchmark

You can generate a defensibility report for your own local installation by running the following script:

python scripts/experimental_benchmark.py --pdb 1D3Z --bmrb 6457

This will output a correlation report and a scatter plot visualization in the artifacts/benchmarks/ directory.

Scientific Conclusion

The synth-pdb software demonstrates high physical fidelity for backbone atoms (\(C_\alpha, C_\beta\)), which are the primary drivers of protein folding topology. While side-chain and solvent-sensitive atoms show more variance, the overall structural ensemble is scientifically defensible for use in multimodal AI training pipelines.