Dataset Factory Hero

📁 Bulk Dataset Factory: Beyond the Memory Wall 🤖¶

Objective: Master the transition from "Single-Protein" bioinformatics to "Tensor-Driven" AI research.

🧠 The Educational Mindset Shift¶

Traditional structural biology focuses on the PDB File—a static, human-readable text record. Modern AI (like AlphaFold-3 or ESM-Fold) requires a Tensor—a massive, multi-dimensional array of numbers.

In this lab, we break through the "Memory Wall": the bottleneck where AI models spend more time reading files than actually learning biology.

We will cover:

Vectorized Generation: Producing 10,000 unique structures in milliseconds.
The Tensor Envelope: Visualizing the structural diversity of your dataset.
Zero-Copy NPZ Pipelines: Feeding binary data directly into high-performance GPUs.
PyTorch Integration: Building a production-ready DataLoader.

In [ ]:

Copied!





# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

# Ensure the local synth_pdb source code is prioritized if running from the repo
try:
    current_path = Path(".").resolve()
    repo_root = current_path.parent.parent
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"📌 Added local library to path: {repo_root}")
except Exception:
    pass

if 'google.colab' in str(get_ipython()):
    if not os.path.exists("installed.marker"):
        print("Running on Google Colab. Installing dependencies...")
        get_ipython().run_line_magic('pip', 'install synth-pdb torch numpy matplotlib py3Dmol')

        with open("installed.marker", "w") as f:
            f.write("done")

        print("🔄 Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
        print("⚠️ Please wait 10 seconds, then Run All Cells again.")
        os.kill(os.getpid(), 9)
    else:
        print("✅ Dependencies Ready.")
else:
    import synth_pdb
    print(f"✅ Running locally. Using synth-pdb version: {synth_pdb.__version__} from {synth_pdb.__file__}")
# @title Setup & Installation { display-mode: "form" }
import os
import sys
from pathlib import Path

# Ensure the local synth_pdb source code is prioritized if running from the repo
try:
    current_path = Path(".").resolve()
    repo_root = current_path.parent.parent
    if (repo_root / "synth_pdb").exists():
        if str(repo_root) not in sys.path:
            sys.path.insert(0, str(repo_root))
            print(f"📌 Added local library to path: {repo_root}")
except Exception:
    pass

if 'google.colab' in str(get_ipython()):
    if not os.path.exists("installed.marker"):
        print("Running on Google Colab. Installing dependencies...")
        get_ipython().run_line_magic('pip', 'install synth-pdb torch numpy matplotlib py3Dmol')

        with open("installed.marker", "w") as f:
            f.write("done")

        print("🔄 Installation complete. KERNEL RESTARTING AUTOMATICALLY...")
        print("⚠️ Please wait 10 seconds, then Run All Cells again.")
        os.kill(os.getpid(), 9)
    else:
        print("✅ Dependencies Ready.")
else:
    import synth_pdb
    print(f"✅ Running locally. Using synth-pdb version: {synth_pdb.__version__} from {synth_pdb.__file__}")

In [ ]:

Copied!





import time

import matplotlib.pyplot as plt
import numpy as np
import py3Dmol
import torch
from torch.utils.data import DataLoader, Dataset

from synth_pdb.batch_generator import BatchedGenerator, BatchedPeptide

print("Libraries Loaded. Accelerating with PyTorch! 🚀")
import time

import matplotlib.pyplot as plt
import numpy as np
import py3Dmol
import torch
from torch.utils.data import DataLoader, Dataset

from synth_pdb.batch_generator import BatchedGenerator, BatchedPeptide

print("Libraries Loaded. Accelerating with PyTorch! 🚀")

1. High-Speed Generation: 10,000 Structures¶

We leverage Numba-optimized vectorization. Instead of generating one CA atom at a time, we treat the entire batch as a single 3D tensor operation.

In [ ]:

Copied!





n_samples = 10000
# FIX: Use explicit hyphenation for the whole sequence to avoid 'METALA' merging errors
sequence = "-".join(["ALA-GLY-SER-LEU-VAL-ILE-MET"] * 4) # 28 residues

print(f"🚀 Generating {n_samples} structures...")
start = time.time()

generator = BatchedGenerator(sequence, n_batch=n_samples, full_atom=False)
batch = generator.generate_batch(drift=5.0)

elapsed = time.time() - start
print(f"✅ Done! {n_samples} structures generated in {elapsed:.3f}s")
print(f"Throughput: {n_samples/elapsed:.0f} structures/sec")
n_samples = 10000
# FIX: Use explicit hyphenation for the whole sequence to avoid 'METALA' merging errors
sequence = "-".join(["ALA-GLY-SER-LEU-VAL-ILE-MET"] * 4) # 28 residues

print(f"🚀 Generating {n_samples} structures...")
start = time.time()

generator = BatchedGenerator(sequence, n_batch=n_samples, full_atom=False)
batch = generator.generate_batch(drift=5.0)

elapsed = time.time() - start
print(f"✅ Done! {n_samples} structures generated in {elapsed:.3f}s")
print(f"Throughput: {n_samples/elapsed:.0f} structures/sec")

2. Visualizing Structural Diversity (Statistical Plot)¶

A dataset is only as good as its diversity. If all 10,000 structures look the same, the model learns nothing. Let's visualize the "Atomic Variance" across our batch.

In [ ]:

Copied!





# Calculate the variance of CA positions across the batch
variance = np.var(batch.coords, axis=0).mean(axis=1)

plt.figure(figsize=(10, 5))
plt.plot(variance, color='#667eea', linewidth=3, label="Positional Variance")
plt.fill_between(range(len(variance)), variance, alpha=0.2, color='#667eea')
plt.title("The Entropy Profile: Data Diversity across the Chain")
plt.xlabel("Residue Number")
plt.ylabel("Variance (Å²)")
plt.grid(alpha=0.3)
plt.legend()
plt.show()

print("Educational Insight: Notice how variance typically increases at the 'tail' of the peptide?")
print("This is the 'Propagating Error' of structural drift—a key feature for generating negative samples.")
# Calculate the variance of CA positions across the batch
variance = np.var(batch.coords, axis=0).mean(axis=1)

plt.figure(figsize=(10, 5))
plt.plot(variance, color='#667eea', linewidth=3, label="Positional Variance")
plt.fill_between(range(len(variance)), variance, alpha=0.2, color='#667eea')
plt.title("The Entropy Profile: Data Diversity across the Chain")
plt.xlabel("Residue Number")
plt.ylabel("Variance (Å²)")
plt.grid(alpha=0.3)
plt.legend()
plt.show()

print("Educational Insight: Notice how variance typically increases at the 'tail' of the peptide?")
print("This is the 'Propagating Error' of structural drift—a key feature for generating negative samples.")

3. Interactive 3D Ensemble View¶

Let's overlay the first 5 structures in the batch to see the "Envelope" of noise we've created.

In [ ]:

Copied!





try:
    view = py3Dmol.view(width=800, height=400)
    view.setBackgroundColor("#fdfdfd")
    colors = ["#ff9999", "#66b3ff", "#99ff99", "#ffcc99", "#c2c2f0"]

    for i in range(5):
        # 1. Clean and mask coordinates with strict zero-tolerance
        c = batch.coords[i].copy()
        mask = np.any(np.abs(c) > 1e-4, axis=1) # Strip zeros and ghost atoms
        c_clean = c[mask]

        if len(c_clean) == 0: continue

        # 2. Individual Centering (Per-Model Anchor)
        # Using CA centroid for much better stability than min/max
        ca_idxs = [j for j, name in enumerate(batch.atom_names) if name == "CA"]
        valid_ca = [idx for idx in ca_idxs if mask[idx]]
        if valid_ca:
            center = c[valid_ca].mean(axis=0)
        else:
            center = c_clean.mean(axis=0)

        c_centered = c_clean - center

        p_tmp = BatchedPeptide(
            c_centered[np.newaxis, ...],
            batch.sequence,
            np.array(batch.atom_names)[mask].tolist(),
            np.array(batch.residue_indices)[mask].tolist()
        )

        view.addModel(p_tmp.to_pdb(0), 'pdb')
        # HIGH-VISIBILITY STYLE: Large Spheres (radius 0.3) + Thick Sticks
        view.setStyle({'model': i}, {
            "cartoon": {"color": colors[i], "opacity": 0.5},
            "stick": {"color": colors[i], "radius": 0.3},
            "sphere": {"color": colors[i], "scale": 0.3}
        })

    # 3. Aggressive manual zoom targeting model 0 to ensure viewport is filled
    view.zoomTo({'model': 0})
    view.zoom(2.0)
    view.center()
    view.show()

    # Diagnostic Info to prove sanity
    print("✅ Ensemble Visualized with PDB Column-Shift Guard.")
    print(f"Residue 1 Name: '{batch.sequence[0]}' | Residue 7 Name: '{batch.sequence[6]}'")

except Exception as e:
    print(f"3D Viewer Error: {e}")
try:
    view = py3Dmol.view(width=800, height=400)
    view.setBackgroundColor("#fdfdfd")
    colors = ["#ff9999", "#66b3ff", "#99ff99", "#ffcc99", "#c2c2f0"]

    for i in range(5):
        # 1. Clean and mask coordinates with strict zero-tolerance
        c = batch.coords[i].copy()
        mask = np.any(np.abs(c) > 1e-4, axis=1) # Strip zeros and ghost atoms
        c_clean = c[mask]

        if len(c_clean) == 0: continue

        # 2. Individual Centering (Per-Model Anchor)
        # Using CA centroid for much better stability than min/max
        ca_idxs = [j for j, name in enumerate(batch.atom_names) if name == "CA"]
        valid_ca = [idx for idx in ca_idxs if mask[idx]]
        if valid_ca:
            center = c[valid_ca].mean(axis=0)
        else:
            center = c_clean.mean(axis=0)

        c_centered = c_clean - center

        p_tmp = BatchedPeptide(
            c_centered[np.newaxis, ...],
            batch.sequence,
            np.array(batch.atom_names)[mask].tolist(),
            np.array(batch.residue_indices)[mask].tolist()
        )

        view.addModel(p_tmp.to_pdb(0), 'pdb')
        # HIGH-VISIBILITY STYLE: Large Spheres (radius 0.3) + Thick Sticks
        view.setStyle({'model': i}, {
            "cartoon": {"color": colors[i], "opacity": 0.5},
            "stick": {"color": colors[i], "radius": 0.3},
            "sphere": {"color": colors[i], "scale": 0.3}
        })

    # 3. Aggressive manual zoom targeting model 0 to ensure viewport is filled
    view.zoomTo({'model': 0})
    view.zoom(2.0)
    view.center()
    view.show()

    # Diagnostic Info to prove sanity
    print("✅ Ensemble Visualized with PDB Column-Shift Guard.")
    print(f"Residue 1 Name: '{batch.sequence[0]}' | Residue 7 Name: '{batch.sequence[6]}'")

except Exception as e:
    print(f"3D Viewer Error: {e}")

4. Binary Export (NPZ) vs. Legacy Text (PDB)¶

Why save to NPZ? It's not just about size; it's about Zero-Copy loading.

In [ ]:

Copied!





os.makedirs("dataset_factory", exist_ok=True)
dataset_path = "dataset_factory/batch_001.npz"

print("Saving to compressed NPZ...")
np.savez_compressed(
    dataset_path,
    coords=batch.coords,
    sequence=np.array([sequence] * n_samples)
)

# Benchmark Loading
start_npz = time.time()
tensor_npz = torch.from_numpy(np.load(dataset_path)['coords'])
npz_time = time.time() - start_npz

print(f"✅ NPZ Load (10k samples): {npz_time:.4f}s")
os.makedirs("dataset_factory", exist_ok=True)
dataset_path = "dataset_factory/batch_001.npz"

print("Saving to compressed NPZ...")
np.savez_compressed(
    dataset_path,
    coords=batch.coords,
    sequence=np.array([sequence] * n_samples)
)

# Benchmark Loading
start_npz = time.time()
tensor_npz = torch.from_numpy(np.load(dataset_path)['coords'])
npz_time = time.time() - start_npz

print(f"✅ NPZ Load (10k samples): {npz_time:.4f}s")

5. Production PyTorch DataLoader¶

The final piece of the pipeline is the DataLoader, which handles batching, shuffling, and multi-threaded loading.

In [ ]:

Copied!





class SyntheticProteinDataset(Dataset):
    def __init__(self, npz_path):
        data = np.load(npz_path)
        self.coords = torch.from_numpy(data['coords']).float()

    def __len__(self):
        return len(self.coords)

    def __getitem__(self, idx):
        return self.coords[idx]

ds = SyntheticProteinDataset(dataset_path)
loader = DataLoader(ds, batch_size=64, shuffle=True)

sample_batch = next(iter(loader))
print(f"Success! Batch Shape: {sample_batch.shape} (Ready for Neural Network training)")
class SyntheticProteinDataset(Dataset):
    def __init__(self, npz_path):
        data = np.load(npz_path)
        self.coords = torch.from_numpy(data['coords']).float()

    def __len__(self):
        return len(self.coords)

    def __getitem__(self, idx):
        return self.coords[idx]

ds = SyntheticProteinDataset(dataset_path)
loader = DataLoader(ds, batch_size=64, shuffle=True)

sample_batch = next(iter(loader))
print(f"Success! Batch Shape: {sample_batch.shape} (Ready for Neural Network training)")

🏆 Next Steps¶

Modify the drift parameter in Section 1. How does it change the Variance Plot in Section 2?
Try generating a batch with full_atom=True. How does it affect the NPZ file size?

Mastering the Data Plane is 80% of successful AI engineering. Now go build some biology! 🧬🤖