synth-pdb
Generate realistic PDB files with mixed secondary structures for bioinformatics testing, education, and tool development.
β οΈ Important: The generated structures use idealized geometries and may contain violations of standard structural constraints. These files are intended for testing computational tools and educational demonstrations, not for simulation or experimental validation.
Why synth-pdb?
In structural biology and bioinformatics, researchers frequently require datasets of protein structures to test algorithms, train machine learning models, or validate analytical pipelines. While the Protein Data Bank (PDB) contains over 200,000 experimental structures, relying solely on experimental data has limitations:
- Bias: PDB data is biased toward crystallizable or stable proteins.
- Complexity: Experimental files often contain artifacts, missing atoms, or non-standard residues.
- Lack of Ground Truth: For NMR assignment or structure calculation, "perfect" synthetic data is essential for unit testing.
synth-pdb fills this gap by providing a lightweight, deterministic generator that produces chemically valid, full-atom PDB files with user-defined secondary structures (helices, sheets) in seconds.
Educational Philosophy: Code as Textbook π
synth-pdb is built on the principle that scientific software should be readable and educational.
- Code as Textbook: We reject "black box" algorithms. Our source code (e.g.,
generator.py,physics.py) is heavily annotated with biophysical reasonsβexplaining concepts like Boltzmann weighting, order parameters (\(S^2\)), and NOE distance dependence (\(r^{-6}\)). - Visual Learning: With the
--visualizeflag, students can instantly see how abstract concepts manifest in 3D, bridging the gap between equations and biology. - Integrity: Specialized tests ensure educational notes remain in the codebase, preventing refactoring from stripping away the scientific context.
Key Features
β¨ Structure Generation
- Full atomic representation with backbone and side-chain heavy atoms + hydrogens.
- Customizable sequence (1-letter or 3-letter amino acid codes).
- Conformational diversity: Generate alpha helices, beta sheets, extended chains, or random conformations.
- Rotamer-based side-chain placement for all 20 standard amino acids using the Dunbrack library.
- Advanced Chemistry: Metal coordination (Zn2+), Disulfide bonds (SSBOND), and PTM support (SEP, TPO, PTR).
π¬ Validation Suite
- Geometric Checks: Bond length, bond angle (Engh & Huber Z-scores), and peptide plane planarity.
- Ramachandran Checking: Upgraded to Top2018 high-resolution datasets.
- Physical Validation: Steric clash detection and SASA-based burial ratios.
βοΈ Quality Control & Physics
- --best-of-N: Generate multiple structures and select the one with the fewest violations.
- Energy Minimization: Relax structures using OpenMM (Implicit Solvent / AMBER forcefield).
- Quality Filtering: Integrated Random Forest and GNN classifiers for structural plausibility.
π Interactive Tutorial Catalog
Explore synth-pdb through our curated interactive tutorials. Each notebook can be opened directly in Google Colab.
π¬ Core Biophysics & NMR
| Tutorial | Difficulty | Time | Action |
|---|---|---|---|
| The Virtual NMR Spectrometer | ββ | 25 min | |
| Cryo-EM & SAXS Lab | β | 20 min | |
| BMRB Validation Pipeline | ββ | 25 min | |
| Ubiquitin Validation Suite | βββ | 45 min | |
| RDC Alignment Tensor Explorer | ββ | 30 min | |
| RPF Score Validation | ββ | 25 min | |
| NeRF Geometry Lab | ββ | 25 min | |
| Modern Formats: mmCIF & BCIF | ββ | 15 min | |
| The GFP Molecular Forge | ββ | 30 min | |
| IDP Conformational Ensembles | βββ | 30 min | |
| AlphaFold pLDDT vs NMR SΒ² | βββ | 35 min | |
| GNN pLDDT Explorer | ββ | 30 min |
π€ ML & AI Integration
| Tutorial | Difficulty | Time | Action |
|---|---|---|---|
| Bulk Dataset Factory | β | 15 min | |
| Hard Decoy Challenge | βββ | 35 min | |
| PLM Embeddings (ESM-2) | ββ | 30 min | |
| Co-evolution Factory | βββ | 35 min | |
| 6D Orientogram Lab | βββ | 30 min | |
| Drug Discovery Pipeline | βββ | 35 min |
π Learning Paths
Choose a path based on your background and goals:
π€ For ML Engineers
Build AI models with synthetic protein data
- AI Protein Data Factory (15 min) - Learn zero-copy data handover to PyTorch/JAX.
- Bulk Dataset Factory (15 min) - Generate thousands of training samples.
- Hard Decoy Challenge (35 min) - Create negative samples for robust training.
- PLM Embeddings (ESM-2) (30 min) - Add evolutionary context as per-residue node features.
π¬ For Biophysicists
Understand structure, dynamics, and spectroscopy
- NeRF Geometry Lab (25 min) - Learn internal coordinate systems.
- Virtual NMR Spectrometer (25 min) - Predict relaxation rates and chemical shifts.
- Protein Quality Assessment (25 min) - Validate structure quality and geometry.
- GNN pLDDT Explorer (30 min) - Score with a GNN; interpret per-residue pLDDT; compute TM-score and lDDT.
- GFP Molecular Forge (30 min) - Explore chromophore chemistry.
- AlphaFold pLDDT vs NMR SΒ² (35 min) - Contrast AI rigidity with physical dynamics.
π For Drug Designers
Design and optimize therapeutic peptides
- Drug Discovery Pipeline (35 min) - End-to-end peptide library to lead selection.
- Macrocycle Design Lab (20 min) - Create head-to-tail cyclic peptides.
- Bio-Active Hormone Lab (20 min) - Model bioactive peptide hormones.
- Hard Decoy Challenge (35 min) - Generate decoys for docking validation.
Quick Visual Demo
Run this command to generate a Leucine Zipper, minimize its energy using OpenMM, and visualize it in your browser:
Citation
If you use synth-pdb in your research, please cite it: