synth-pdb

Generate realistic PDB files with mixed secondary structures for bioinformatics testing, education, and tool development.

⚠️ Important: The generated structures use idealized geometries and may contain violations of standard structural constraints. These files are intended for testing computational tools and educational demonstrations, not for simulation or experimental validation.

Why synth-pdb?

In structural biology and bioinformatics, researchers frequently require datasets of protein structures to test algorithms, train machine learning models, or validate analytical pipelines. While the Protein Data Bank (PDB) contains over 200,000 experimental structures, relying solely on experimental data has limitations:

Bias: PDB data is biased toward crystallizable or stable proteins.
Complexity: Experimental files often contain artifacts, missing atoms, or non-standard residues.
Lack of Ground Truth: For NMR assignment or structure calculation, "perfect" synthetic data is essential for unit testing.

synth-pdb fills this gap by providing a lightweight, deterministic generator that produces chemically valid, full-atom PDB files with user-defined secondary structures (helices, sheets) in seconds.

Educational Philosophy: Code as Textbook 🎓

synth-pdb is built on the principle that scientific software should be readable and educational.

Code as Textbook: We reject "black box" algorithms. Our source code (e.g., generator.py, physics.py) is heavily annotated with biophysical reasons—explaining concepts like Boltzmann weighting, order parameters (\(S^2\)), and NOE distance dependence (\(r^{-6}\)).
Visual Learning: With the --visualize flag, students can instantly see how abstract concepts manifest in 3D, bridging the gap between equations and biology.
Integrity: Specialized tests ensure educational notes remain in the codebase, preventing refactoring from stripping away the scientific context.

Key Features

✨ Structure Generation

Full atomic representation with backbone and side-chain heavy atoms + hydrogens.
Customizable sequence (1-letter or 3-letter amino acid codes).
Conformational diversity: Generate alpha helices, beta sheets, extended chains, or random conformations.
Rotamer-based side-chain placement for all 20 standard amino acids using the Dunbrack library.
Advanced Chemistry: Metal coordination (Zn2+), Disulfide bonds (SSBOND), and PTM support (SEP, TPO, PTR).

🔬 Validation Suite

Geometric Checks: Bond length, bond angle (Engh & Huber Z-scores), and peptide plane planarity.
Ramachandran Checking: Upgraded to Top2018 high-resolution datasets.
Physical Validation: Steric clash detection and SASA-based burial ratios.

⚙️ Quality Control & Physics

--best-of-N: Generate multiple structures and select the one with the fewest violations.
Energy Minimization: Relax structures using OpenMM (Implicit Solvent / AMBER forcefield).
Quality Filtering: Integrated Random Forest and GNN classifiers for structural plausibility.

📚 Interactive Tutorial Catalog

Explore synth-pdb through our curated interactive tutorials. Each notebook can be opened directly in Google Colab.

🔬 Core Biophysics & NMR

Tutorial	Difficulty	Time
The Virtual NMR Spectrometer	⭐⭐	25 min
Cryo-EM & SAXS Lab	⭐	20 min
BMRB Validation Pipeline	⭐⭐	25 min
Ubiquitin Validation Suite	⭐⭐⭐	45 min
RDC Alignment Tensor Explorer	⭐⭐	30 min
RPF Score Validation	⭐⭐	25 min
NeRF Geometry Lab	⭐⭐	25 min
Modern Formats: mmCIF & BCIF	⭐⭐	15 min
The GFP Molecular Forge	⭐⭐	30 min
IDP Conformational Ensembles	⭐⭐⭐	30 min
AlphaFold pLDDT vs NMR S²	⭐⭐⭐	35 min
GNN pLDDT Explorer	⭐⭐	30 min

🤖 ML & AI Integration

Tutorial	Difficulty	Time
Bulk Dataset Factory	⭐	15 min
Hard Decoy Challenge	⭐⭐⭐	35 min
PLM Embeddings (ESM-2)	⭐⭐	30 min
Co-evolution Factory	⭐⭐⭐	35 min
6D Orientogram Lab	⭐⭐⭐	30 min
Drug Discovery Pipeline	⭐⭐⭐	35 min

🎓 Learning Paths

Choose a path based on your background and goals:

🤖 For ML Engineers

Build AI models with synthetic protein data

AI Protein Data Factory (15 min) - Learn zero-copy data handover to PyTorch/JAX.
Bulk Dataset Factory (15 min) - Generate thousands of training samples.
Hard Decoy Challenge (35 min) - Create negative samples for robust training.
PLM Embeddings (ESM-2) (30 min) - Add evolutionary context as per-residue node features.

🔬 For Biophysicists

Understand structure, dynamics, and spectroscopy

NeRF Geometry Lab (25 min) - Learn internal coordinate systems.
Virtual NMR Spectrometer (25 min) - Predict relaxation rates and chemical shifts.
Protein Quality Assessment (25 min) - Validate structure quality and geometry.
GNN pLDDT Explorer (30 min) - Score with a GNN; interpret per-residue pLDDT; compute TM-score and lDDT.
GFP Molecular Forge (30 min) - Explore chromophore chemistry.
AlphaFold pLDDT vs NMR S² (35 min) - Contrast AI rigidity with physical dynamics.

💊 For Drug Designers

Design and optimize therapeutic peptides

Drug Discovery Pipeline (35 min) - End-to-end peptide library to lead selection.
Macrocycle Design Lab (20 min) - Create head-to-tail cyclic peptides.
Bio-Active Hormone Lab (20 min) - Model bioactive peptide hormones.
Hard Decoy Challenge (35 min) - Generate decoys for docking validation.

Quick Visual Demo

Run this command to generate a Leucine Zipper, minimize its energy using OpenMM, and visualize it in your browser:

synth-pdb --sequence "LKELEKELEKELEKELEKELEKEL" --conformation alpha --minimize --visualize

Citation

If you use synth-pdb in your research, please cite it:

@software{elkins_synth_pdb_2026,
  author = {Elkins, George},
  title = {synth-pdb: High-Performance Protein Structure Generator},
  url = {https://github.com/elkins/synth-pdb},
  version = {1.37.0},
  year = {2026}
}