Skip to content

cofactors Module

The cofactors module handles the detection and modeling of inorganic cofactors and metal ions within protein structures.

Overview

Many proteins require metal ions (like Zn²⁺, Ca²⁺, or Mg²⁺) for structural stability or catalytic function. This module automatically identifies coordination motifs—such as Zinc Fingers—and inserts the appropriate ions with correct geometric coordination.

Key Features

  • Motif Detection: Scans the structure for residues capable of coordinating metals (Cys, His, Asp, Glu).
  • Geometric Coordination: Places metal ions at the geometric centroid of their coordinating ligands.
  • Automated Insertion: Integrates with the generation pipeline to produce holoproteins (proteins with cofactors).

API Reference

cofactors

Cofactor and Metal Ion Coordination Module.

THE "AI TRINITY" PHASE 15: Inorganic Coordination.

Biological proteins aren't just chains of amino acids; they often require inorganic "cofactors" or metal ions to function. Zinc (Zn2+) is one of the most common, found in "Zinc Finger" motifs where it stabilizes the fold by coordinating with Cysteine (C) and Histidine (H) residues.

Educational Note - Coordination Chemistry:

  1. Ligands: The atoms that donate electrons to the metal (e.g., Cys Sulfur, His Nitrogen).
  2. Coordination Number: The number of ligands (Zinc is usually 4 - Tetrahedral).
  3. Geometric Centroid: The ideal position for the metal is the center of its ligands.

This module automatically detects these motifs and inserts the appropriate ions.

Functions

find_metal_binding_sites(structure, distance_threshold=20.0)

Scans the structure for clusters of residues that could coordinate a metal ion.

Parameters:

Name Type Description Default
structure AtomArray

Biotite AtomArray.

required
distance_threshold float

Max distance between any two coordinating atoms in a cluster.

20.0

Returns:

Type Description
list[dict]

List[Dict]: A list of detected sites, each with 'type' and 'ligand_indices'.

Source code in synth_pdb/cofactors.py
def find_metal_binding_sites(
    structure: struc.AtomArray, distance_threshold: float = 20.0
) -> list[dict]:
    """Scans the structure for clusters of residues that could coordinate a metal ion.

    Args:
        structure: Biotite AtomArray.
        distance_threshold: Max distance between any two coordinating atoms in a cluster.

    Returns:
        List[Dict]: A list of detected sites, each with 'type' and 'ligand_indices'.

    """
    logger.info("Scanning for metal binding motifs (Zinc Fingers)...")

    # Standard ligands for Zinc
    # CYS (SG), HIS (NE2 or ND1) - including HID/HIE/HIP variants
    ligand_mask = ((structure.res_name == "CYS") & (structure.atom_name == "SG")) | (
        (np.isin(structure.res_name, ["HIS", "HID", "HIE", "HIP"]))
        & ((structure.atom_name == "NE2") | (structure.atom_name == "ND1"))
    )

    candidate_indices = np.where(ligand_mask)[0]
    if len(candidate_indices) < 3:
        return []

    candidate_coords = structure.coord[candidate_indices]
    sites = []
    assigned_residues: set[tuple[str, int]] = set()

    # -- Iterative Motif Detection (Tightest-Cluster First) ------------------
    # Unlike a simple greedy search which might pick arbitrary nearby atoms,
    # we iteratively identify the "best" coordination sites.
    #
    # The algorithm:
    # 1. For each unassigned ligand, find all neighbors within threshold.
    # 2. If 4+ ligands are found, calculate the 'spread' (average pair-wise
    #    distance) of the tightest sub-group of 4.
    # 3. Identify the cluster with the global minimum spread across the whole protein.
    # 4. 'Commit' that cluster as a site and mark the entire RESIDUE as assigned.
    # 5. Repeat until no more 4-ligand clusters can be formed.
    while True:
        best_cluster = None
        min_spread = float("inf")

        # Outer loop: Evaluate every possible ligand as a potential cluster 'seed'
        for i in range(len(candidate_indices)):
            idx_i = candidate_indices[i]
            c_id_i = structure.chain_id[idx_i]
            r_id_i = structure.res_id[idx_i]

            # Skip if this residue was already claimed by a previous site
            if (c_id_i, r_id_i) in assigned_residues:
                continue

            # Calculate distances from this seed to all other candidate ligands
            diffs = candidate_coords - candidate_coords[i]
            dists = np.sqrt(np.sum(diffs**2, axis=-1))

            # Filter unassigned neighbors that fall within the threshold
            neighbor_mask = dists < distance_threshold

            # UNIQUE RESIDUE SELECTION:
            # To avoid false positives (e.g. counting NE2 and ND1 of the same HIS),
            # we must ensure each ligand comes from a unique residue.
            # We map (chain, res_id) -> (distance, candidate_index)
            res_to_best_atom: dict[tuple[str, int], tuple[float, int]] = {}
            for j in range(len(candidate_indices)):
                idx_j = candidate_indices[j]
                c_id_j = structure.chain_id[idx_j]
                r_id_j = structure.res_id[idx_j]
                rid = (c_id_j, r_id_j)

                if neighbor_mask[j] and rid not in assigned_residues:
                    d = float(dists[j])
                    if rid not in res_to_best_atom or d < res_to_best_atom[rid][0]:
                        res_to_best_atom[rid] = (d, j)

            # Zinc Fingers traditionally coordinate with exactly 4 ligands (Cys4 or Cys2His2)
            if len(res_to_best_atom) >= 4:
                # EDUCATIONAL NOTE: Local Optimization
                # We sort residues by distance and take the 4 closest to the seed
                sorted_rids = sorted(res_to_best_atom.keys(), key=lambda r: res_to_best_atom[r][0])
                cluster_indices = [res_to_best_atom[rid][1] for rid in sorted_rids[:4]]

                # EDUCATIONAL NOTE: Geometric Realism
                # We calculate 'spread' as the average distance between ALL atoms in the group.
                # This prevents picking a string of atoms that happen to be near the threshold
                # but are far from each other.
                cluster_coords = candidate_coords[cluster_indices]
                spread = 0.0
                count = 0
                for ci in range(4):
                    for cj in range(ci + 1, 4):
                        spread += float(np.linalg.norm(cluster_coords[ci] - cluster_coords[cj]))
                        count += 1
                avg_spread = float(spread / count)

                # Track the tightest cluster found in this iteration
                if avg_spread < min_spread:
                    min_spread = avg_spread
                    best_cluster = [candidate_indices[idx] for idx in cluster_indices]

        # If a valid cluster was found, register it and continue scanning
        if best_cluster:
            sites.append({"type": "ZN", "ligand_indices": best_cluster})
            for idx in best_cluster:
                assigned_residues.add((structure.chain_id[idx], structure.res_id[idx]))
        else:
            # Termination condition: No more unassigned groups of 4 found
            break

    if sites:
        logger.info(f"Found {len(sites)} potential metal binding sites.")
    return sites

add_metal_ion(structure, site)

Adds a metal ion (HETATM) to the structure at the centroid of its ligands.

Parameters:

Name Type Description Default
structure AtomArray

Original AtomArray.

required
site dict

Identification of the site (from find_metal_binding_sites).

required

Returns:

Type Description
AtomArray

struc.AtomArray: New AtomArray with the ion appended.

Source code in synth_pdb/cofactors.py
def add_metal_ion(structure: struc.AtomArray, site: dict) -> struc.AtomArray:
    """Adds a metal ion (HETATM) to the structure at the centroid of its ligands.

    Args:
        structure: Original AtomArray.
        site: Identification of the site (from find_metal_binding_sites).

    Returns:
        struc.AtomArray: New AtomArray with the ion appended.

    """
    ion_type = site["type"]
    ligand_indices = site["ligand_indices"]

    # Calculate Centroid
    ligand_coords = structure.coord[ligand_indices]
    centroid = np.mean(ligand_coords, axis=0)

    # Create the Ion Atom
    ion = struc.AtomArray(1)
    ion.res_name = np.array([ion_type])
    ion.atom_name = np.array([ion_type])
    ion.element = np.array([ion_type])  # "ZN" not "Z" for Zinc
    ion.coord = np.array([centroid])

    # Metadata
    # Pick a residue ID higher than existing
    max_res_id = np.max(structure.res_id)
    ion.res_id = np.array([max_res_id + 1])
    ion.chain_id = np.array([structure.chain_id[0]])
    ion.hetero = np.array([True])

    logger.info(f"Injected {ion_type} ion at coordinated site {centroid}.")

    return structure + ion

Scientific Principles

Coordination Chemistry

In biological systems, metal ions are coordinated by specific amino acid side chains (ligands). For example, a Zinc Finger typically coordinates a Zn²⁺ ion using two Cysteines and two Histidines (\(C_2H_2\)).

The module calculates the ideal position for the metal ion as the centroid (\(\mathbf{C}\)) of the coordinating atoms:

\[\mathbf{C} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i\]

where \(\mathbf{x}_i\) are the coordinates of the ligand atoms (e.g., SG for Cys, NE2/ND1 for His).

Usage Example

from synth_pdb.generator import PeptideGenerator
from synth_pdb.cofactors import find_metal_binding_sites, add_metal_ion

# 1. Generate a sequence known to bind Zinc (e.g., a simple Zinc Finger)
seq = "CPYCGKSFSDSSALRNHVRTH"
gen = PeptideGenerator(seq)
structure = gen.generate(conformation="alpha")

# 2. Detect potential binding sites
sites = find_metal_binding_sites(structure)

# 3. Add the metal ion if a site is found
if sites:
    structure_with_metal = add_metal_ion(structure, sites[0])