Skip to content

msa Module

The msa module implements a physical sequence-level evolutionary simulator to generate Multiple Sequence Alignments (MSAs) with co-evolutionary constraints.

Overview

Based on Direct Coupling Analysis (DCA) theory, this module models sequence probability using a Potts Energy Model. It uses a Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithm to simulate evolutionary drift, ensuring that produced sequences respect the native 3D fold (Contact Map).

Main Classes

CoevolutionModel

Defines the statistical Potts Model energy landscape for a given Protein Fold.

The energy function is defined as

E(S) = sum_i(h_i(S_i)) + sum_{i<j}(J_ij(S_i, S_j))

Where: - h_i is the local site preference (Fields) - J_ij is the pairwise coupling constraint between contacting residues.

Source code in synth_pdb/msa.py
class CoevolutionModel:
    """Defines the statistical Potts Model energy landscape for a given Protein Fold.

    The energy function is defined as:
        E(S) = sum_i(h_i(S_i)) + sum_{i<j}(J_ij(S_i, S_j))
    Where:
        - h_i is the local site preference (Fields)
        - J_ij is the pairwise coupling constraint between contacting residues.
    """

    def __init__(self, base_sequence: str, contact_map: np.ndarray, rel_sasa: np.ndarray | None = None):
        """Initialize the Coevolution Model.

        Args:
            base_sequence (str): The native, starting sequence.
            contact_map (np.ndarray): N x N boolean matrix where True indicates physical contact.

        """
        self.length = len(base_sequence)
        self.contact_map = contact_map
        self.vocab_size = len(AMINO_ACIDS)
        self.aa_to_idx = {aa: i for i, aa in enumerate(AMINO_ACIDS)}

        # Initialize empty Fields and Couplings
        self.fields = np.zeros((self.length, self.vocab_size))
        self.couplings = np.zeros((self.length, self.length, self.vocab_size, self.vocab_size))

        self._build_local_fields(rel_sasa)
        self._build_physical_couplings()

    def _build_local_fields(self, rel_sasa: np.ndarray | None) -> None:
        """Constructs the h_i local fields to enforce SASA constraints.
        Buried residues (rel_sasa < 0.2) receive massive energy penalties for Polar/Charged AAs.

        Educational Note: Hydrophobic Core Collapse
        -------------------------------------------
        Solvent Accessible Surface Area (SASA) is the physical mechanism mapping
        the 3D structural core back to 1D sequence constraints.
        If a residue is statistically "buried" deep inside the protein core
        with extreme desolvation, evolutionary drift must strictly eliminate
        charged/polar mutations. Placing an Arginine or Aspartate in the
        water-free hydrophobic core will rupture the hydrogen-bond network and
        unfold the protein. We enforce this geometrically by massively penalizing
        the h_i local fields for Hydrophilic amino acids at any position where
        rel_sasa < 0.2.
        """
        if rel_sasa is None:
            return  # No structural pressure

        for i in range(self.length):
            # Check if buried
            if rel_sasa[i] < 0.2:
                for aa in POLAR_CHARGED_AMINO_ACIDS:
                    aa_idx = self.aa_to_idx[aa]
                    self.fields[i, aa_idx] += 100.0  # Massive penalty

    def _build_physical_couplings(self) -> None:
        """Constructs the J_ij coupling matrix based on simplified sterics and electrostatics.
        If residues closely pack in 3D space, massive-massive combinations are penalized (steric clash).
        To lower the energy back down, one residue must mutate to a Tiny volume (compensatory mutation).
        We also strictly penalize like-charges and reward opposite-charges (Salt Bridges).
        """
        for i in range(self.length):
            for j in range(i + 1, self.length):
                if self.contact_map[i, j]:
                    # They are in 3D contact! They must co-evolve.
                    for aa1_idx, aa1 in enumerate(AMINO_ACIDS):
                        vol1 = RESIDUE_VOLUMES[aa1]
                        for aa2_idx, aa2 in enumerate(AMINO_ACIDS):
                            vol2 = RESIDUE_VOLUMES[aa2]

                            # Simple Steric Potential:
                            total_vol = vol1 + vol2

                            if total_vol > 5:
                                penalty = (total_vol - 5) * 10.0  # Extreme steric clash penalty
                            elif total_vol < 2:
                                penalty = (2 - total_vol) * 5.0  # Moderate void penalty
                            else:
                                penalty = 0.0  # Perfect packing

                            # Educational Note: Electrostatic Compatibility
                            # ---------------------------------------------
                            # Proteins utilize localized regions of electrical charge (Salt Bridges)
                            # to lock their tertiary folds into favorable lower-energy states.
                            # Conversely, slamming two like-charges together completely blows up
                            # the stability of the construct via Coloumbic repulsion.
                            # We deeply embed these physical phenomena directly into the J_ij
                            # interaction couplings, rewarding +/- pairs while aggressively
                            # destroying +/+ or -/- complexes.

                            # Charge Compatibility
                            if aa1 in POSITIVE_AMINO_ACIDS and aa2 in NEGATIVE_AMINO_ACIDS:
                                penalty -= 10.0  # Favorable Salt Bridge
                            elif aa1 in NEGATIVE_AMINO_ACIDS and aa2 in POSITIVE_AMINO_ACIDS:
                                penalty -= 10.0  # Favorable Salt Bridge
                            elif aa1 in POSITIVE_AMINO_ACIDS and aa2 in POSITIVE_AMINO_ACIDS:
                                penalty += 15.0  # Charge Repulsion
                            elif aa1 in NEGATIVE_AMINO_ACIDS and aa2 in NEGATIVE_AMINO_ACIDS:
                                penalty += 15.0  # Charge Repulsion

                            # Symmetric coupling
                            self.couplings[i, j, aa1_idx, aa2_idx] = penalty
                            self.couplings[j, i, aa2_idx, aa1_idx] = penalty

    def calculate_energy(self, sequence: str) -> float:
        """Calculate the pseudo-energy of a given sequence on this fold's energetic landscape.
        Lower energy indicates a more stable, evolutionarily favored protein.
        """
        energy = 0.0

        # Add local fields (h_i) - currently zeroed out for pure co-evolution focus
        for i, aa in enumerate(sequence):
            aa_idx = self.aa_to_idx[aa]
            energy += self.fields[i, aa_idx]

        # Add pairwise couplings (J_ij)
        for i in range(self.length):
            aa1_idx = self.aa_to_idx[sequence[i]]
            for j in range(i + 1, self.length):
                if self.contact_map[i, j]:
                    aa2_idx = self.aa_to_idx[sequence[j]]
                    energy += self.couplings[i, j, aa1_idx, aa2_idx]

        return energy

    def calculate_delta_energy(self, current_seq: str, site1: int, new_aa1: str, site2: int | None = None, new_aa2: str | None = None) -> float:
        """Calculate the change in energy (Delta E) for 1 or 2 simultaneous mutations in O(L) time.

        Educational Note: Big-O Performance Breakthrough
        ------------------------------------------------
        A full evaluation of the Potts Model energy function requires summing the J_ij
        matrix across all N residues, an O(N^2) operation. For a 200-residue sequence,
        that's 40,000 lookups per step. Over 1 million generations, that's 40 billion operations!

        Because the MCMC sampler only ever mutates 1 or 2 residues at a time, 99% of the
        pairwise interactions remain completely unchanged. Instead of brute-force calculating
        the new system energy entirely from scratch, we can simply calculate the difference (Delta):

        1. Subtract the old energy of the mutated sites.
        2. Add the new energy of the mutated sites.

        This reduces the computational complexity to strictly O(L) (where L is the number of
        contacts for the mutated site), saving over 500x computation time and allowing for
        blisteringly fast synthetic MSA generation architectures.
        """
        delta = 0.0

        old_aa1 = current_seq[site1]
        old_idx1 = self.aa_to_idx[old_aa1]
        new_idx1 = self.aa_to_idx[new_aa1]

        delta += self.fields[site1, new_idx1] - self.fields[site1, old_idx1]

        if site2 is not None and new_aa2 is not None:
            old_aa2 = current_seq[site2]
            old_idx2 = self.aa_to_idx[old_aa2]
            new_idx2 = self.aa_to_idx[new_aa2]
            delta += self.fields[site2, new_idx2] - self.fields[site2, old_idx2]

            # Coupling between the two mutating sites themselves
            if self.contact_map[site1, site2]:
                old_coup = self.couplings[site1, site2, old_idx1, old_idx2]
                new_coup = self.couplings[site1, site2, new_idx1, new_idx2]
                delta += (new_coup - old_coup)
        else:
            site2 = -1
            old_idx2 = -1
            new_idx2 = -1

        # Coupling with the rest of the sequence
        for k in range(self.length):
            if k == site1 or k == site2:
                continue
            seq_idx_k = self.aa_to_idx[current_seq[k]]
            if self.contact_map[site1, k]:
                delta += self.couplings[site1, k, new_idx1, seq_idx_k] - self.couplings[site1, k, old_idx1, seq_idx_k]

            if site2 != -1 and self.contact_map[site2, k]:
                delta += self.couplings[site2, k, new_idx2, seq_idx_k] - self.couplings[site2, k, old_idx2, seq_idx_k]

        return float(delta)

Functions

__init__(base_sequence, contact_map, rel_sasa=None)

Initialize the Coevolution Model.

Parameters:

Name Type Description Default
base_sequence str

The native, starting sequence.

required
contact_map ndarray

N x N boolean matrix where True indicates physical contact.

required
Source code in synth_pdb/msa.py
def __init__(self, base_sequence: str, contact_map: np.ndarray, rel_sasa: np.ndarray | None = None):
    """Initialize the Coevolution Model.

    Args:
        base_sequence (str): The native, starting sequence.
        contact_map (np.ndarray): N x N boolean matrix where True indicates physical contact.

    """
    self.length = len(base_sequence)
    self.contact_map = contact_map
    self.vocab_size = len(AMINO_ACIDS)
    self.aa_to_idx = {aa: i for i, aa in enumerate(AMINO_ACIDS)}

    # Initialize empty Fields and Couplings
    self.fields = np.zeros((self.length, self.vocab_size))
    self.couplings = np.zeros((self.length, self.length, self.vocab_size, self.vocab_size))

    self._build_local_fields(rel_sasa)
    self._build_physical_couplings()

calculate_energy(sequence)

Calculate the pseudo-energy of a given sequence on this fold's energetic landscape. Lower energy indicates a more stable, evolutionarily favored protein.

Source code in synth_pdb/msa.py
def calculate_energy(self, sequence: str) -> float:
    """Calculate the pseudo-energy of a given sequence on this fold's energetic landscape.
    Lower energy indicates a more stable, evolutionarily favored protein.
    """
    energy = 0.0

    # Add local fields (h_i) - currently zeroed out for pure co-evolution focus
    for i, aa in enumerate(sequence):
        aa_idx = self.aa_to_idx[aa]
        energy += self.fields[i, aa_idx]

    # Add pairwise couplings (J_ij)
    for i in range(self.length):
        aa1_idx = self.aa_to_idx[sequence[i]]
        for j in range(i + 1, self.length):
            if self.contact_map[i, j]:
                aa2_idx = self.aa_to_idx[sequence[j]]
                energy += self.couplings[i, j, aa1_idx, aa2_idx]

    return energy

calculate_delta_energy(current_seq, site1, new_aa1, site2=None, new_aa2=None)

Calculate the change in energy (Delta E) for 1 or 2 simultaneous mutations in O(L) time.

Educational Note: Big-O Performance Breakthrough

A full evaluation of the Potts Model energy function requires summing the J_ij matrix across all N residues, an O(N^2) operation. For a 200-residue sequence, that's 40,000 lookups per step. Over 1 million generations, that's 40 billion operations!

Because the MCMC sampler only ever mutates 1 or 2 residues at a time, 99% of the pairwise interactions remain completely unchanged. Instead of brute-force calculating the new system energy entirely from scratch, we can simply calculate the difference (Delta):

  1. Subtract the old energy of the mutated sites.
  2. Add the new energy of the mutated sites.

This reduces the computational complexity to strictly O(L) (where L is the number of contacts for the mutated site), saving over 500x computation time and allowing for blisteringly fast synthetic MSA generation architectures.

Source code in synth_pdb/msa.py
def calculate_delta_energy(self, current_seq: str, site1: int, new_aa1: str, site2: int | None = None, new_aa2: str | None = None) -> float:
    """Calculate the change in energy (Delta E) for 1 or 2 simultaneous mutations in O(L) time.

    Educational Note: Big-O Performance Breakthrough
    ------------------------------------------------
    A full evaluation of the Potts Model energy function requires summing the J_ij
    matrix across all N residues, an O(N^2) operation. For a 200-residue sequence,
    that's 40,000 lookups per step. Over 1 million generations, that's 40 billion operations!

    Because the MCMC sampler only ever mutates 1 or 2 residues at a time, 99% of the
    pairwise interactions remain completely unchanged. Instead of brute-force calculating
    the new system energy entirely from scratch, we can simply calculate the difference (Delta):

    1. Subtract the old energy of the mutated sites.
    2. Add the new energy of the mutated sites.

    This reduces the computational complexity to strictly O(L) (where L is the number of
    contacts for the mutated site), saving over 500x computation time and allowing for
    blisteringly fast synthetic MSA generation architectures.
    """
    delta = 0.0

    old_aa1 = current_seq[site1]
    old_idx1 = self.aa_to_idx[old_aa1]
    new_idx1 = self.aa_to_idx[new_aa1]

    delta += self.fields[site1, new_idx1] - self.fields[site1, old_idx1]

    if site2 is not None and new_aa2 is not None:
        old_aa2 = current_seq[site2]
        old_idx2 = self.aa_to_idx[old_aa2]
        new_idx2 = self.aa_to_idx[new_aa2]
        delta += self.fields[site2, new_idx2] - self.fields[site2, old_idx2]

        # Coupling between the two mutating sites themselves
        if self.contact_map[site1, site2]:
            old_coup = self.couplings[site1, site2, old_idx1, old_idx2]
            new_coup = self.couplings[site1, site2, new_idx1, new_idx2]
            delta += (new_coup - old_coup)
    else:
        site2 = -1
        old_idx2 = -1
        new_idx2 = -1

    # Coupling with the rest of the sequence
    for k in range(self.length):
        if k == site1 or k == site2:
            continue
        seq_idx_k = self.aa_to_idx[current_seq[k]]
        if self.contact_map[site1, k]:
            delta += self.couplings[site1, k, new_idx1, seq_idx_k] - self.couplings[site1, k, old_idx1, seq_idx_k]

        if site2 != -1 and self.contact_map[site2, k]:
            delta += self.couplings[site2, k, new_idx2, seq_idx_k] - self.couplings[site2, k, old_idx2, seq_idx_k]

    return float(delta)

MetropolisHastingsSampler

Simulates the evolutionary drift of a sequence using Markov Chain Monte Carlo. A mutation is proposed: - If Energy decreases: It is always accepted (favorable). - If Energy increases: It is accepted with probability P = exp(-DeltaE / T).

Source code in synth_pdb/msa.py
class MetropolisHastingsSampler:
    """Simulates the evolutionary drift of a sequence using Markov Chain Monte Carlo.
    A mutation is proposed:
        - If Energy decreases: It is always accepted (favorable).
        - If Energy increases: It is accepted with probability P = exp(-DeltaE / T).
    """

    def __init__(self, model: CoevolutionModel, temperature: float = 1.0, coupled_mutation_prob: float = 0.20):
        self.model = model
        self.temperature = temperature
        self.coupled_mutation_prob = coupled_mutation_prob
        # State
        self.current_sequence = ""
        self.current_energy = 0.0

        # Cache contact list for fast random sampling
        self.contact_list = []
        for i in range(model.length):
            for j in range(i + 1, model.length):
                if model.contact_map[i, j]:
                    self.contact_list.append((i, j))

    def start(self, base_sequence: str) -> None:
        """Initialize the MCMC chain."""
        self.current_sequence = base_sequence
        self.current_energy = self.model.calculate_energy(base_sequence)

    def step(self) -> bool:
        """Propose exactly ONE or TWO point mutations and probabilistically accept or reject them.
        Returns True if the mutation was accepted.
        """
        if not self.current_sequence:
            return False

        # Decide if we do a single or double mutation taking advantage of "Magic Steps"
        #
        # Educational Note: The "Magic Step" coupled mutation
        # ---------------------------------------------------
        # Traditional MCMC walkers only jump one dimension at a time (e.g., mutating only X or Y).
        # In evolutionary landscapes, getting from a massive [Tryptophan:Tiny] pair to a
        # complementary [Tiny:Tryptophan] pair is impossible if you can only make one
        # mutation at a time. The intermediate state [Tryptophan:Tryptophan] involves a massive
        # steric clash, imposing an astronomical energy penalty that a low-temperature
        # simulation can never mathematically cross.
        #
        # By proposing a "Coupled Mutation" across two contacting residues simultaneously,
        # the simulation evaluates the Delta Energy of the final state exactly. The clash
        # never physically occurs, capturing the hallmark of true Direct Coupling covariance.
        do_coupled = (random.random() < self.coupled_mutation_prob) and len(self.contact_list) > 0

        site1 = -1
        new_aa1 = ""
        site2 = None
        new_aa2 = None

        if do_coupled:
            site1, site2 = random.choice(self.contact_list)
            current_aa1 = self.current_sequence[site1]
            current_aa2 = self.current_sequence[site2]

            new_aa1 = random.choice(AMINO_ACIDS)
            new_aa2 = random.choice(AMINO_ACIDS)

            if current_aa1 == new_aa1 and current_aa2 == new_aa2:
                return False
        else:
            site1 = random.randint(0, self.model.length - 1)
            current_aa = self.current_sequence[site1]
            new_aa1 = random.choice(AMINO_ACIDS)

            if current_aa == new_aa1:
                return False

        # 2. Evaluate energetic change efficiently using O(L) delta evaluation
        delta_e = self.model.calculate_delta_energy(self.current_sequence, site1, new_aa1, site2, new_aa2)
        proposed_energy = self.current_energy + delta_e

        # 3. Metropolis Criterion
        accept = False
        if delta_e <= 0:
            accept = True  # Favorable drift
        else:
            # Unfavorable drift (thermal noise allows escaping local minima)
            if self.temperature <= 0.0001:
                prob = 0.0
            else:
                prob = np.exp(-delta_e / self.temperature)

            if random.random() < prob:
                accept = True

        if accept:
            proposed_seq_list = list(self.current_sequence)
            proposed_seq_list[site1] = new_aa1
            if site2 is not None and new_aa2 is not None:
                proposed_seq_list[site2] = new_aa2

            self.current_sequence = "".join(proposed_seq_list)
            self.current_energy = proposed_energy
            return True

        return False

Functions

__init__(model, temperature=1.0, coupled_mutation_prob=0.2)

Source code in synth_pdb/msa.py
def __init__(self, model: CoevolutionModel, temperature: float = 1.0, coupled_mutation_prob: float = 0.20):
    self.model = model
    self.temperature = temperature
    self.coupled_mutation_prob = coupled_mutation_prob
    # State
    self.current_sequence = ""
    self.current_energy = 0.0

    # Cache contact list for fast random sampling
    self.contact_list = []
    for i in range(model.length):
        for j in range(i + 1, model.length):
            if model.contact_map[i, j]:
                self.contact_list.append((i, j))

start(base_sequence)

Initialize the MCMC chain.

Source code in synth_pdb/msa.py
def start(self, base_sequence: str) -> None:
    """Initialize the MCMC chain."""
    self.current_sequence = base_sequence
    self.current_energy = self.model.calculate_energy(base_sequence)

step()

Propose exactly ONE or TWO point mutations and probabilistically accept or reject them. Returns True if the mutation was accepted.

Source code in synth_pdb/msa.py
def step(self) -> bool:
    """Propose exactly ONE or TWO point mutations and probabilistically accept or reject them.
    Returns True if the mutation was accepted.
    """
    if not self.current_sequence:
        return False

    # Decide if we do a single or double mutation taking advantage of "Magic Steps"
    #
    # Educational Note: The "Magic Step" coupled mutation
    # ---------------------------------------------------
    # Traditional MCMC walkers only jump one dimension at a time (e.g., mutating only X or Y).
    # In evolutionary landscapes, getting from a massive [Tryptophan:Tiny] pair to a
    # complementary [Tiny:Tryptophan] pair is impossible if you can only make one
    # mutation at a time. The intermediate state [Tryptophan:Tryptophan] involves a massive
    # steric clash, imposing an astronomical energy penalty that a low-temperature
    # simulation can never mathematically cross.
    #
    # By proposing a "Coupled Mutation" across two contacting residues simultaneously,
    # the simulation evaluates the Delta Energy of the final state exactly. The clash
    # never physically occurs, capturing the hallmark of true Direct Coupling covariance.
    do_coupled = (random.random() < self.coupled_mutation_prob) and len(self.contact_list) > 0

    site1 = -1
    new_aa1 = ""
    site2 = None
    new_aa2 = None

    if do_coupled:
        site1, site2 = random.choice(self.contact_list)
        current_aa1 = self.current_sequence[site1]
        current_aa2 = self.current_sequence[site2]

        new_aa1 = random.choice(AMINO_ACIDS)
        new_aa2 = random.choice(AMINO_ACIDS)

        if current_aa1 == new_aa1 and current_aa2 == new_aa2:
            return False
    else:
        site1 = random.randint(0, self.model.length - 1)
        current_aa = self.current_sequence[site1]
        new_aa1 = random.choice(AMINO_ACIDS)

        if current_aa == new_aa1:
            return False

    # 2. Evaluate energetic change efficiently using O(L) delta evaluation
    delta_e = self.model.calculate_delta_energy(self.current_sequence, site1, new_aa1, site2, new_aa2)
    proposed_energy = self.current_energy + delta_e

    # 3. Metropolis Criterion
    accept = False
    if delta_e <= 0:
        accept = True  # Favorable drift
    else:
        # Unfavorable drift (thermal noise allows escaping local minima)
        if self.temperature <= 0.0001:
            prob = 0.0
        else:
            prob = np.exp(-delta_e / self.temperature)

        if random.random() < prob:
            accept = True

    if accept:
        proposed_seq_list = list(self.current_sequence)
        proposed_seq_list[site1] = new_aa1
        if site2 is not None and new_aa2 is not None:
            proposed_seq_list[site2] = new_aa2

        self.current_sequence = "".join(proposed_seq_list)
        self.current_energy = proposed_energy
        return True

    return False

Main Functions

generate_msa(base_sequence, contact_map, num_sequences=100, temperature=1.0, steps_between_samples=20, rel_sasa=None, coupled_mutation_prob=0.2)

Generates a Synthetic Multiple Sequence Alignment (MSA) imbued with Co-Evolutionary constraints.

Parameters:

Name Type Description Default
base_sequence str

Starting 'wild type' amino acid sequence string.

required
contact_map ndarray

N x N valid binary contact map bounding the 3D structure.

required
num_sequences int

Number of homologous sequences to scrape from the simulation.

100
temperature float

"Thermal Noise" of evolution. Higher = more divergence, lower contact fidelity.

1.0
steps_between_samples int

Number of MCMC mutation proposals to try between saving a sequence.

20
rel_sasa ndarray | None

(Optional) Array of lengths N indicating solvent accessibility. Core residues will heavily favor aliphatics.

None
coupled_mutation_prob float

(Optional) Probability of simultaneously mutating two contacting residues to traverse steric gaps (The Magic Step).

0.2

Returns:

Type Description
list[str]

List of strings representing the aligned MSA.

Source code in synth_pdb/msa.py
def generate_msa(
    base_sequence: str,
    contact_map: np.ndarray,
    num_sequences: int = 100,
    temperature: float = 1.0,
    steps_between_samples: int = 20,
    rel_sasa: np.ndarray | None = None,
    coupled_mutation_prob: float = 0.20,
) -> list[str]:
    """Generates a Synthetic Multiple Sequence Alignment (MSA) imbued with Co-Evolutionary constraints.

    Args:
        base_sequence: Starting 'wild type' amino acid sequence string.
        contact_map: N x N valid binary contact map bounding the 3D structure.
        num_sequences: Number of homologous sequences to scrape from the simulation.
        temperature: "Thermal Noise" of evolution. Higher = more divergence, lower contact fidelity.
        steps_between_samples: Number of MCMC mutation proposals to try between saving a sequence.
        rel_sasa: (Optional) Array of lengths N indicating solvent accessibility. Core residues will heavily favor aliphatics.
        coupled_mutation_prob: (Optional) Probability of simultaneously mutating two contacting residues to traverse steric gaps (The Magic Step).

    Returns:
        List of strings representing the aligned MSA.

    """
    # 1. Initialize Physics Engine
    model = CoevolutionModel(base_sequence, contact_map, rel_sasa=rel_sasa)
    sampler = MetropolisHastingsSampler(model, temperature=temperature, coupled_mutation_prob=coupled_mutation_prob)
    sampler.start(base_sequence)

    msa = []

    # 2. Burn-in Phase
    burn_in_steps = len(base_sequence) * 10
    for _ in range(burn_in_steps):
        sampler.step()

    # 3. Generation Phase
    for _ in range(num_sequences):
        for _ in range(steps_between_samples):
            sampler.step()
        msa.append(sampler.current_sequence)

    return msa

Usage Examples

Generating a Synthetic MSA

import numpy as np
from synth_pdb.msa import generate_msa

# base_sequence: str
# contact_map: np.ndarray (N x N boolean)

msa = generate_msa(
    base_sequence="ACDEFGHIKL",
    contact_map=my_contact_map,
    num_sequences=50,
    temperature=1.0
)

for seq in msa:
    print(seq)

Educational Notes

Hydrophobic Core Collapse

Solvent Accessible Surface Area (SASA) is the physical mechanism mapping 3D structure back to 1D sequence constraints. If a residue is "buried" deep inside the protein core (low SASA), evolutionary drift must strictly eliminate charged/polar mutations. Placing a hydrophilic amino acid in the water-free hydrophobic core would rupture the hydrogen-bond network and unfold the protein. The msa module enforces this by penalizing hydrophilic mutations at buried positions.

Electrostatic Compatibility

Proteins use localized regions of electrical charge (Salt Bridges) to lock their tertiary folds into stable, lower-energy states. Conversely, placing two like-charges in close proximity causes strong Coulombic repulsion. The Potts model in this module rewards opposite-charge pairs while aggressively penalizing like-charge complexes in contacting residues.

The "Magic Step" Coupled Mutation

In traditional MCMC, only one site is mutated at a time. However, in evolution, getting from a [Large:Small] pair to a [Small:Large] pair is impossible if the intermediate [Large:Large] state causes a massive steric clash. The "Magic Step" proposes mutations at two contacting residues simultaneously, allowing the simulation to traverse these steric gaps and capture true Direct Coupling covariance.

References

  • Direct Coupling Analysis (DCA): Morcos, F., et al. (2011). "Direct-coupling analysis of residue coevolution captures native contacts across many protein families." Proceedings of the National Academy of Sciences (PNAS). DOI: 10.1073/pnas.1111471108
  • Potts Models in Evolution: Weigt, M., et al. (2009). "Identification of direct residue contacts in protein-protein interaction by message passing." PNAS. DOI: 10.1073/pnas.0805923106

See Also