Skip to content

export Module

The export module provides tools for converting internal data structures (like contact maps and distance matrices) into standard text formats for AI/ML modeling and structural competition benchmarks.

Overview

Generating 3D structures is only half the battle for ML researchers; the other half is exporting the ground-truth data in a format that training pipelines can ingest. This module handles the conversion of \(N \times N\) matrices into CASP-style residue-residue (RR) files or simple CSVs.

Key Features

  • CASP RR Support: Export contact maps in the standard format used by the Critical Assessment of Structure Prediction (CASP).
  • CSV Export: Simple, human-readable comma-separated values for rapid prototyping in Python or Excel.
  • Separation Cutoffs: Filter out short-range contacts (neighbors) to focus on the long-range interactions that define the protein fold.
  • Probability Handling: Supports both binary contacts (0/1) and continuous probability values.

API Reference

export

Functions

export_constraints(contact_map, sequence, fmt='casp', separation_cutoff=0, threshold=8.0)

Export a Contact Map or Distance Matrix to text format for AI modeling.

Parameters

contact_map : np.ndarray NxN matrix. Values can be Binary (0/1), Probabilities (0.0-1.0), or raw distances (Angstroms). sequence : str The protein sequence (required for CASP header). fmt : str "casp" (CASP RR format) or "csv" (Simple list). separation_cutoff : int Minimum sequence separation |i-j| to include. Default 0 includes neighbors. threshold : float Distance cutoff for including a pair in the export.

Returns

content : str The textual content of the file.

Source code in synth_pdb/export.py
def export_constraints(
    contact_map: np.ndarray,
    sequence: str,
    fmt: str = "casp",
    separation_cutoff: int = 0,
    threshold: float = 8.0,
) -> str:
    """Export a Contact Map or Distance Matrix to text format for AI modeling.

    Parameters
    ----------
    contact_map : np.ndarray
        NxN matrix. Values can be Binary (0/1), Probabilities (0.0-1.0),
        or raw distances (Angstroms).
    sequence : str
        The protein sequence (required for CASP header).
    fmt : str
        "casp" (CASP RR format) or "csv" (Simple list).
    separation_cutoff : int
        Minimum sequence separation |i-j| to include.
        Default 0 includes neighbors.
    threshold : float
        Distance cutoff for including a pair in the export.

    Returns
    -------
    content : str
        The textual content of the file.

    """
    n_res = contact_map.shape[0]
    lines = []

    # Heuristic: Is this a distance matrix or a probability map?
    # If values are > 1.0, it's almost certainly distances.
    is_distance_matrix = np.any(contact_map > 1.0)

    if fmt == "casp":
        # CASP RR Format: i j d_minor d_major prob
        lines.append(sequence)

        for i in range(n_res):
            for j in range(i + 1 + separation_cutoff, n_res):
                val = contact_map[i, j]

                if is_distance_matrix:
                    # Input is raw distances
                    if val <= threshold:
                        res_i, res_j = i + 1, j + 1
                        # Use actual distance as the upper bound for the bin
                        lines.append(f"{res_i} {res_j} 0.0 {val:.1f} 1.00000")
                else:
                    # Input is binary or probabilities
                    if val > 0.0:
                        res_i, res_j = i + 1, j + 1
                        lines.append(f"{res_i} {res_j} 0.0 {threshold:.1f} {val:.5f}")

    elif fmt == "csv":
        # CSV Format: Res1,Res2,Distance_or_Prob
        lines.append("Res1,Res2,Value")
        for i in range(n_res):
            for j in range(i + 1 + separation_cutoff, n_res):
                val = contact_map[i, j]
                if is_distance_matrix:
                    if val <= threshold:
                        lines.append(f"{i + 1},{j + 1},{val:.5f}")
                else:
                    if val > 0.0:
                        lines.append(f"{i + 1},{j + 1},{val:.5f}")

    else:
        raise ValueError(f"Unknown format: {fmt}")

    return "\n".join(lines)

Scientific Background

The CASP RR Format

The CASP Residue-Residue (RR) format is the industry standard for inter-residue contact predictions. A typical line looks like:

i j d_minor d_major prob

  • i, j: Residue indices.
  • d_minor, d_major: The distance bin (e.g., 0.0 8.0 for a standard contact).
  • prob: The confidence or probability (1.0 for ground-truth data).

This module automatically maps distances to these bins, allowing synthetic structures from synth-pdb to be used as ground-truth targets for benchmarking structure prediction algorithms.

Usage Example

from synth_pdb.batch_generator import BatchedGenerator
from synth_pdb.export import export_constraints

# 1. Generate a batch of data (including contact maps)
gen = BatchedGenerator(batch_size=1, length=50)
batch = gen.generate_batch()
contact_map = batch["contacts"][0] # 50x50 matrix
sequence = "A" * 50 # Example sequence

# 2. Export to CASP RR format
casp_content = export_constraints(
    contact_map, 
    sequence, 
    fmt="casp", 
    threshold=8.0, 
    separation_cutoff=5
)

with open("contacts.rr", "w") as f:
    f.write(casp_content)

# 3. Export to simple CSV
csv_content = export_constraints(
    contact_map, 
    sequence, 
    fmt="csv", 
    threshold=12.0
)