Project Roadmap & Future Horizons
Project: synth-pdb
Focus: Biophysical Realism, Computational Efficiency, AI Research Support
Strategy: Weighted by Highest Reward / Lowest Risk
โ Recently Completed
The following features have been successfully implemented and verified: - Batched Generation (GPU-First): Vectorized SIMD/GPU-ready data factory. - D-Amino Acid Support: Mirror-image peptide generation and validation. - Hard Decoy Generation: Gaussian drift and sequence threading for AI training. - Numba JIT Support: Core geometry kernels optimized for high-performance. - Cyclic Peptide Support: Head-to-tail cyclization for therapeutic modeling.
๐ Next Horizons (Prioritized)
Sorted by High-Value / Low-Risk first.
1. ML Integration Examples ๐ค (High Value / Low Risk)
- Concept: Provide Jupyter notebooks/scripts demonstrating zero-copy handover from
BatchedPeptideto PyTorch/JAX/MLX. - Why: Closes the loop for AI researchers. Shows how to use the "contiguous tensor" output in real training pipelines.
- Implementation: Create an
examples/ml_loading/directory with clear demonstrations.
2. Hardware Benchmarking Suite โฑ๏ธ (Medium-High Value / Low Risk)
- Concept: A formal tool to compare Serial vs. Batched performance across hardware (CPU vs. Apple Silicon M-series).
- Why: Quantifies the efficiency gains of the "GPU-First" architecture and justifies its use for large-scale data production.
- Implementation: New
benchmarks/module with automated performance plotting.
3. "Chromophore" Educational Case (GFP) ๐ (Medium-High Value / Medium Risk)
- Concept: Specialized test case for a Green Fluorescent Protein (GFP) fragment, focusing on the Ser-Tyr-Gly cyclization.
- Why: Adds coordination chemistry and post-translational modification (PTM) education.
- Implementation: Extend
cofactors.pyor createspecial_chemistry.pyto handle the unique SYG ring closure.
๐ฌ Section 1: Biophysical Realism
Goal: Generate structures that are indistinguishable from experimental data.
1. Explicit Solvent "Water Box" ๐ง
- Concept: Simulate the protein in a box of explicit water molecules (TIP3P model) rather than implicit solvent.
- Use Case: Capturing "water-bridging" interactions that stabilize specific conformations.
- Risk: Medium. Increases minimization time significantly.
2. โ Macrocycle Design Lab (Cyclic Peptides)
- Result: Implemented modeling for Oxytocin (disulfide) and Cyclosporine A (cyclic/D-amino).
- Demo: hormone_lab.ipynb
- Value: Macrocycles are at the forefront of drug discovery. An AI model that can design cyclic peptides needs a training set of realistic closed loops, which standard PDBs lack in sufficient quantity.
๐ค Section 2: AI Research Support
Goal: Provide data that trains better models (AlphaFold, RFDiffusion, etc.).
1. Co-Evolutionary Constraints (MSA) ๐งฌ
- Concept: Generate a family of sequences that imply a specific fold, simulating evolution.
- Implementation: Enhance
dataset.pyto generate correlated mutations based on the contact map.
2. โ Full Orientogram Factory (\(\omega, \theta, \phi\) angles)
- Result: Implemented vectorized 6D inter-residue orientations with virtual GLY \(C\beta\) support.
- Demo: orientogram_lab.ipynb
3. โ Bulk Dataset Factory (NPZ Pipeline)
- Result: Implemented high-performance NPZ export and PyTorch integration.
- Demo: dataset_factory.ipynb
4. โ Multi-Modal Transformer Training (Structure + NMR)
- Result: Integrated synchronized structural generation with synthetic NMR observables.
- Demo: neural_nmr_pipeline.ipynb
- Value: Demonstrates how synth-pdb can train models to predict experimental observables (like NMR shifts) directly from sequence or 3D geometry.
5. ๐ Torsion Angle Drift & Distribution Analysis
- Objective: Statistically analyze how adding "Drift" to Ramachandran angles affects the overall fold and the radius of gyration.
- Key Features:
--drift,dihedral_backbone, andBatchedGenerator. - Value: Useful for researchers testing the robustness of their GNNs (Graph Neural Networks) to small backbone perturbations.
๐ Long-Term / High Risk
- Protein-Protein Docking: Complex multi-chain physics.
- Metadynamics Simulation: Better suited for dedicated MD engines.
- Fragment Assembly (Rosetta-style): Massive database dependencies.