msa Module
The msa module implements a physical sequence-level evolutionary simulator to generate Multiple Sequence Alignments (MSAs) with co-evolutionary constraints.
Overview
Based on Direct Coupling Analysis (DCA) theory, this module models sequence probability using a Potts Energy Model. It uses a Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithm to simulate evolutionary drift, ensuring that produced sequences respect the native 3D fold (Contact Map).
Main Classes
CoevolutionModel
Defines the statistical Potts Model energy landscape for a given Protein Fold.
The energy function is defined as
E(S) = sum_i(h_i(S_i)) + sum_{i<j}(J_ij(S_i, S_j))
Where: - h_i is the local site preference (Fields) - J_ij is the pairwise coupling constraint between contacting residues.
Source code in synth_pdb/msa.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
Functions
__init__(base_sequence, contact_map, rel_sasa=None)
Initialize the Coevolution Model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_sequence
|
str
|
The native, starting sequence. |
required |
contact_map
|
ndarray
|
N x N boolean matrix where True indicates physical contact. |
required |
Source code in synth_pdb/msa.py
calculate_energy(sequence)
Calculate the pseudo-energy of a given sequence on this fold's energetic landscape. Lower energy indicates a more stable, evolutionarily favored protein.
Source code in synth_pdb/msa.py
calculate_delta_energy(current_seq, site1, new_aa1, site2=None, new_aa2=None)
Calculate the change in energy (Delta E) for 1 or 2 simultaneous mutations in O(L) time.
Educational Note: Big-O Performance Breakthrough
A full evaluation of the Potts Model energy function requires summing the J_ij matrix across all N residues, an O(N^2) operation. For a 200-residue sequence, that's 40,000 lookups per step. Over 1 million generations, that's 40 billion operations!
Because the MCMC sampler only ever mutates 1 or 2 residues at a time, 99% of the pairwise interactions remain completely unchanged. Instead of brute-force calculating the new system energy entirely from scratch, we can simply calculate the difference (Delta):
- Subtract the old energy of the mutated sites.
- Add the new energy of the mutated sites.
This reduces the computational complexity to strictly O(L) (where L is the number of contacts for the mutated site), saving over 500x computation time and allowing for blisteringly fast synthetic MSA generation architectures.
Source code in synth_pdb/msa.py
MetropolisHastingsSampler
Simulates the evolutionary drift of a sequence using Markov Chain Monte Carlo. A mutation is proposed: - If Energy decreases: It is always accepted (favorable). - If Energy increases: It is accepted with probability P = exp(-DeltaE / T).
Source code in synth_pdb/msa.py
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 | |
Functions
__init__(model, temperature=1.0, coupled_mutation_prob=0.2)
Source code in synth_pdb/msa.py
start(base_sequence)
step()
Propose exactly ONE or TWO point mutations and probabilistically accept or reject them. Returns True if the mutation was accepted.
Source code in synth_pdb/msa.py
Main Functions
generate_msa(base_sequence, contact_map, num_sequences=100, temperature=1.0, steps_between_samples=20, rel_sasa=None, coupled_mutation_prob=0.2)
Generates a Synthetic Multiple Sequence Alignment (MSA) imbued with Co-Evolutionary constraints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_sequence
|
str
|
Starting 'wild type' amino acid sequence string. |
required |
contact_map
|
ndarray
|
N x N valid binary contact map bounding the 3D structure. |
required |
num_sequences
|
int
|
Number of homologous sequences to scrape from the simulation. |
100
|
temperature
|
float
|
"Thermal Noise" of evolution. Higher = more divergence, lower contact fidelity. |
1.0
|
steps_between_samples
|
int
|
Number of MCMC mutation proposals to try between saving a sequence. |
20
|
rel_sasa
|
ndarray | None
|
(Optional) Array of lengths N indicating solvent accessibility. Core residues will heavily favor aliphatics. |
None
|
coupled_mutation_prob
|
float
|
(Optional) Probability of simultaneously mutating two contacting residues to traverse steric gaps (The Magic Step). |
0.2
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of strings representing the aligned MSA. |
Source code in synth_pdb/msa.py
Usage Examples
Generating a Synthetic MSA
import numpy as np
from synth_pdb.msa import generate_msa
# base_sequence: str
# contact_map: np.ndarray (N x N boolean)
msa = generate_msa(
base_sequence="ACDEFGHIKL",
contact_map=my_contact_map,
num_sequences=50,
temperature=1.0
)
for seq in msa:
print(seq)
Educational Notes
Hydrophobic Core Collapse
Solvent Accessible Surface Area (SASA) is the physical mechanism mapping 3D structure back to 1D sequence constraints. If a residue is "buried" deep inside the protein core (low SASA), evolutionary drift must strictly eliminate charged/polar mutations. Placing a hydrophilic amino acid in the water-free hydrophobic core would rupture the hydrogen-bond network and unfold the protein. The msa module enforces this by penalizing hydrophilic mutations at buried positions.
Electrostatic Compatibility
Proteins use localized regions of electrical charge (Salt Bridges) to lock their tertiary folds into stable, lower-energy states. Conversely, placing two like-charges in close proximity causes strong Coulombic repulsion. The Potts model in this module rewards opposite-charge pairs while aggressively penalizing like-charge complexes in contacting residues.
The "Magic Step" Coupled Mutation
In traditional MCMC, only one site is mutated at a time. However, in evolution, getting from a [Large:Small] pair to a [Small:Large] pair is impossible if the intermediate [Large:Large] state causes a massive steric clash. The "Magic Step" proposes mutations at two contacting residues simultaneously, allowing the simulation to traverse these steric gaps and capture true Direct Coupling covariance.
References
- Direct Coupling Analysis (DCA): Morcos, F., et al. (2011). "Direct-coupling analysis of residue coevolution captures native contacts across many protein families." Proceedings of the National Academy of Sciences (PNAS). DOI: 10.1073/pnas.1111471108
- Potts Models in Evolution: Weigt, M., et al. (2009). "Identification of direct residue contacts in protein-protein interaction by message passing." PNAS. DOI: 10.1073/pnas.0805923106
See Also
- generator Module - Generating the initial 3D fold
- Scientific Background: Co-Evolution & MSAs