generator Module
The generator module is the core of synth-pdb, responsible for creating protein structures from amino acid sequences.
Overview
The generator uses the NeRF (Natural Extension Reference Frame) algorithm to build 3D protein structures from internal coordinates (bond lengths, angles, and dihedrals).
Main Classes
PeptideGenerator
Object-oriented wrapper for protein structure generation.
This class provides a stateful interface for generating synthetic protein structures. It allows users to pre-configure generation parameters (like forcefields or PTM rates) and then generate multiple structures from the same configuration.
EDUCATIONAL RATIONALE: Encapsulating the generation logic in a class makes it easier to manage complex experiments, such as generating an ensemble of decoys with varying levels of torsion drift.
Source code in synth_pdb/generator.py
2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 | |
Functions
__init__(sequence='ALA-GLY-SER', **kwargs)
Initialize the generator with a target sequence and config.
generate(**overrides)
Generates the protein structure and returns a Result object.
Supports on-the-fly overrides for any configuration parameter.
Source code in synth_pdb/generator.py
Main Functions
generate_pdb_content(length: int | None = None, sequence_str: str | None = None, use_plausible_frequencies: bool = False, conformation: str = 'alpha', structure: str | None = None, optimize_sidechains: bool = False, minimize_energy: bool = False, forcefield: str = 'amber14-all.xml', solvent_model: str = 'obc2', solvent_padding: float = 1.0, keep_solvent: bool = False, seed: int | None = None, ph: float = 7.4, cap_termini: bool = False, equilibrate: bool = False, equilibrate_steps: int = 1000, metal_ions: str = 'auto', minimization_k: float = 10.0, minimization_max_iter: int = 0, cis_proline_frequency: float = 0.05, phosphorylation_rate: float = 0.0, cyclic: bool = False, drift: float = 0.0, phi_list: list[float] | None = None, psi_list: list[float] | None = None, omega_list: list[float] | None = None, platform: str | None = None, precision: str | None = None, output_format: str = 'pdb') -> str | bytes
Generates a realistic protein structure in PDB, mmCIF, or BinaryCIF format.
EDUCATIONAL NOTE - New Feature: Cyclic Peptides Cyclic peptides have their N-terminus bonded to their C-terminus. This modification increases metabolic stability and is common in therapeutic peptides (e.g., Cyclosporin).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
length
|
int | None
|
Number of residues (ignored if sequence_str provided) |
None
|
sequence_str
|
str | None
|
Explicit amino acid sequence (1-letter or 3-letter codes) |
None
|
use_plausible_frequencies
|
bool
|
Use biologically realistic amino acid frequencies |
False
|
conformation
|
str
|
Default secondary structure conformation. Options: 'alpha', 'beta', 'ppii', 'extended', 'random' Default: 'alpha' (alpha helix) Used for all residues if structure is not provided, or for residues not specified in structure parameter. |
'alpha'
|
structure
|
str | None
|
Per-region conformation specification (NEW!) Format: "start-end:conformation,start-end:conformation,..." Example: "1-10:alpha,11-15:random,16-30:beta" If provided, overrides conformation for specified regions. Unspecified residues use the default conformation parameter. |
None
|
drift
|
float
|
Maximum random perturbation applied to phi/psi angles (degrees). Used for "hard decoy" generation to create near-native conformations. |
0.0
|
optimize_sidechains
|
bool
|
Run Monte Carlo side-chain optimization |
False
|
minimize_energy
|
bool
|
Run OpenMM energy minimization (REQUIRED for cyclic closure) |
False
|
forcefield
|
str
|
Forcefield to use for minimization |
'amber14-all.xml'
|
seed
|
int | None
|
Random seed for reproducible generation |
None
|
ph
|
float
|
pH for titration |
7.4
|
cap_termini
|
bool
|
Add ACE/NME caps (Disabled if cyclic=True) |
False
|
equilibrate
|
bool
|
Run MD equilibration |
False
|
equilibrate_steps
|
int
|
Number of MD steps |
1000
|
metal_ions
|
str
|
Handle metal ions |
'auto'
|
minimization_k
|
float
|
Tolerance |
10.0
|
minimization_max_iter
|
int
|
Max iterations |
0
|
cis_proline_frequency
|
float
|
Frequency of cis-proline |
0.05
|
phosphorylation_rate
|
float
|
Frequency of phosphorylation |
0.0
|
cyclic
|
bool
|
Whether to generate a cyclic peptide (Head-to-Tail) |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str | bytes
|
Complete PDB file content |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid conformation name or structure syntax provided |
EDUCATIONAL NOTE - Why Per-Region Conformations Matter: Real proteins have mixed secondary structures. For example: - Zinc fingers: beta sheets + alpha helices - Immunoglobulins: multiple beta sheets connected by loops - Helix-turn-helix motifs: two alpha helices connected by a turn This feature allows users to create these realistic structures.
EDUCATIONAL NOTE - Macrocyclization (Cyclic Peptides):
Cyclic peptides (macrocycles) are chains where the N-terminus and C-terminus are covalently linked. This has profound biological implications: 1. Metabolic Stability: Resistance to exopeptidases that chew protein ends. 2. Binding Affinity: By "locking" the molecule into a specific shape, the entropic penalty of binding to a target is greatly reduced. 3. Bioavailability: Many legendary drugs (like Cyclosporine A) are macrocycles.
EDUCATIONAL NOTE - Multi-Chain Complex Generation (Phase 16):
A major frontier in structural biology is the study of the "Interactome"-how individual proteins assemble into complexes. This generator supports the creation of dimers, trimers, and larger multimers by accepting multiple sequences separated by a colon (':').
Spatial Rationale: When generating complexes, each chain is placed in its own coordinate frame. We apply a deterministic spatial offset to each chain based on its index to ensure that they do not overlap initially. This allows users to then perform energy minimization or manual docking to find the native interface.
EDUCATIONAL NOTE - Hard Decoy Support (AI Training):
This generator includes specialized parameters for "Hard Decoy" generation:
1. Torsion Drift (drift): Adds controlled Gaussian noise to ideal \(\\phi/\\psi\)
angles. This simulates "near-native" local structural errors that
challenge the resolution of AI scoring functions.
2. Threading (phi_list, psi_list, omega_list): Allows constructing
one sequence using the backbone torsion angles of another. This maps a
"wrong" sequence to a "right" fold, a key test for discriminative models.
Usage Examples
Basic Generation
from synth_pdb.generator import PeptideGenerator
# Create generator
gen = PeptideGenerator("ALA-GLY-SER-LEU-VAL")
# Generate structure
peptide = gen.generate(conformation="alpha")
# Get PDB content
pdb_content = peptide.to_pdb()
# Save to file
with open("output.pdb", "w") as f:
f.write(pdb_content)
Mixed Secondary Structures
# Helix-turn-helix motif
gen = PeptideGenerator("ACDEFGHIKLMNPQRSTVWY")
peptide = gen.generate(
structure_regions="1-5:alpha,6-10:random,11-15:alpha"
)
Random Sequence Generation
from synth_pdb.generator import generate_pdb_content
# Generate random 20-residue peptide
pdb_content = generate_pdb_content(
length=20,
conformation="random",
use_plausible_frequencies=True # Use biologically realistic frequencies
)
With Energy Minimization
pdb_content = generate_pdb_content(
sequence_str="LKELEKELEKELEKEL", # Leucine zipper
conformation="alpha",
minimize_energy=True,
cap_termini=True
)
Helper Functions
_resolve_sequence = _get_sequence
module-attribute
_sample_ramachandran_angles(res_name, next_res_name=None, rng=None)
Sample phi/psi angles from Ramachandran probability distribution.
Uses residue-specific distributions for GLY and PRO, general distribution for all other amino acids. Samples from favored regions using weighted Gaussian distributions.
New Feature: Pre-Proline Bias If next_res_name is 'PRO' and current residue is not GLY or PRO, uses a specific 'PRE_PRO' distribution (favors beta/extended).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
res_name
|
str
|
Three-letter amino acid code |
required |
next_res_name
|
str | None
|
(Optional) Code of the next residue |
None
|
rng
|
Random | None
|
Optional local random generator. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (phi, psi) angles in degrees |
Reference
Lovell et al. (2003) Proteins: Structure, Function, and Bioinformatics
_detect_disulfide_bonds(peptide)
Detect potential disulfide bonds between cysteine residues.
EDUCATIONAL NOTE - Disulfide Bond Detection:
Disulfide bonds form between two cysteine (CYS) residues when their sulfur atoms (SG) are close enough to form a covalent S-S bond.
Detection Criteria: - Both residues must be CYS - SG-SG distance: 2.0-2.2 A (slightly relaxed from ideal 2.0-2.1 A) - Only report each pair once (avoid duplicates)
Why Distance Matters: - < 2.0 A: Too close (steric clash, not realistic) - 2.0-2.1 A: Ideal disulfide bond distance - 2.1-2.2 A: Acceptable (allows for flexibility) - > 2.2 A: Too far (no covalent bond possible)
Biological Context: - Disulfides stabilize protein structure - Common in extracellular proteins - Rare in cytoplasm (reducing environment) - Important for protein folding and stability
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
peptide
|
AtomArray
|
Biotite AtomArray structure |
required |
Returns:
| Type | Description |
|---|---|
list
|
List of tuples (res_id1, res_id2) representing disulfide bonds |
Example
disulfides = _detect_disulfide_bonds(structure) print(disulfides) [(3, 8), (12, 20)] # CYS 3-8 and CYS 12-20 are bonded
Educational Notes
NeRF Algorithm
The NeRF (Natural Extension Reference Frame) algorithm builds 3D structures from internal coordinates:
- Bond Length: Distance between consecutive atoms (e.g., N-CA = 1.46 Å)
- Bond Angle: Angle formed by three consecutive atoms (e.g., N-CA-C = 111°)
- Dihedral Angle: Torsion angle formed by four consecutive atoms (e.g., phi, psi)
Mathematical Foundation:
Given three atoms (A, B, C) and internal coordinates (bond_length, bond_angle, dihedral), the position of a new atom D is calculated by:
- Creating a local coordinate system at C
- Rotating by the dihedral angle
- Placing D at the specified bond length and angle
This allows building complex 3D structures from simple 1D sequences.
B-factor Calculation
B-factors (temperature factors) represent atomic mobility:
Where \(\langle u^2 \rangle\) is the mean square displacement.
synth-pdb calculates B-factors from Order Parameters (\(S^2\)) using the Lipari-Szabo formalism:
Realistic Ranges: - Backbone atoms: 15-25 Ų - Side-chain atoms: 20-35 Ų - Terminal residues: 30-50 Ų
See Also
- geometry Module - 3D coordinate calculations
- physics Module - Energy minimization
- validator Module - Structure validation
- Scientific Background: NeRF Geometry