💊 Drug Discovery Pipeline¶
Welcome to the Peptide Drug Discovery Pipeline!
In this tutorial, you'll build an end-to-end computational workflow for discovering therapeutic peptides.
🎯 Goal¶
Design, screen, and optimize peptide drug candidates using computational methods.
📚 What You'll Learn¶
- Library Generation: Create diverse peptide libraries
- Property Prediction: Calculate ADME (Absorption, Distribution, Metabolism, Excretion)
- Binding Scoring: Estimate target affinity
- Cyclization: Improve stability with macrocycles
- Lead Selection: Rank and filter candidates
💡 Real-World Context¶
This pipeline mirrors workflows used in biotech/pharma for:
- Peptide therapeutics (e.g., GLP-1 agonists like Ozempic)
- Cyclic peptide antibiotics
- Protein-protein interaction inhibitors
# 🛠️ SETUP: Install dependencies
try:
import google.colab
IN_COLAB = True
print("🌐 Running in Google Colab")
!pip install -q synth-pdb matplotlib numpy biotite py3Dmol pandas seaborn
except ImportError:
IN_COLAB = False
print("💻 Running locally")
print("✅ Setup complete!")
import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from synth_pdb.generator import generate_pdb_content
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100
print("📦 All imports successful!")
Step 1: Peptide Library Generation¶
We'll create a focused library of peptides targeting a hypothetical receptor.
🧬 Design Strategy¶
- Length: 8-12 residues (typical for bioactive peptides)
- Composition: Biased toward drug-like amino acids
- Diversity: Random sequences with structural constraints
🎲 Amino Acid Preferences¶
We'll favor residues common in successful peptide drugs:
- Hydrophobic: L, V, I, F, W (binding pockets)
- Polar: S, T, N, Q (H-bonding)
- Charged: K, R, D, E (electrostatic interactions)
- Special: P (turns), G (flexibility), C (disulfides)
# Define amino acid library (drug-like bias)
AA_LIBRARY = {
'hydrophobic': ['L', 'V', 'I', 'F', 'W', 'A'],
'polar': ['S', 'T', 'N', 'Q'],
'positive': ['K', 'R'],
'negative': ['D', 'E'],
'special': ['P', 'G', 'C']
}
def generate_peptide_sequence(length=10, seed=None):
"""Generate a random peptide sequence with drug-like composition."""
if seed is not None:
random.seed(seed)
# Composition bias (percentages)
composition = (
AA_LIBRARY['hydrophobic'] * 4 + # 40%
AA_LIBRARY['polar'] * 3 + # 30%
AA_LIBRARY['positive'] * 1 + # 10%
AA_LIBRARY['negative'] * 1 + # 10%
AA_LIBRARY['special'] * 1 # 10%
)
sequence = ''.join(random.choices(composition, k=length))
return sequence
# Generate library
print("🧬 Generating peptide library...\n")
LIBRARY_SIZE = 50
library = []
for i in range(LIBRARY_SIZE):
length = random.randint(8, 12)
seq = generate_peptide_sequence(length, seed=i)
library.append({
'id': f"PEP{i+1:03d}",
'sequence': seq,
'length': length
})
# Convert to DataFrame
df = pd.DataFrame(library)
print(f"✅ Generated {len(df)} peptides")
print(f" Length range: {df['length'].min()}-{df['length'].max()} residues")
print("\nFirst 5 sequences:")
for idx, row in df.head().iterrows():
print(f" {row['id']}: {row['sequence']}")
Step 2: Property Prediction (ADME)¶
We'll calculate key physicochemical properties that affect drug-likeness.
📊 Properties to Calculate¶
- Molecular Weight - Affects bioavailability
- Net Charge - Affects solubility and membrane permeability
- Hydrophobicity (GRAVY) - Affects membrane binding
- Instability Index - Predicts degradation
- Aromatic Content - Affects binding affinity
✅ Drug-Like Criteria¶
- MW: 800-1500 Da (peptide range)
- Net charge: -2 to +3 (soluble but membrane-permeable)
- GRAVY: -0.5 to +0.5 (balanced)
- Instability: <40 (stable)
# Amino acid properties
AA_MW = {'A': 89, 'C': 121, 'D': 133, 'E': 147, 'F': 165, 'G': 75, 'H': 155,
'I': 131, 'K': 146, 'L': 131, 'M': 149, 'N': 132, 'P': 115, 'Q': 146,
'R': 174, 'S': 105, 'T': 119, 'V': 117, 'W': 204, 'Y': 181}
AA_HYDRO = {'A': 1.8, 'C': 2.5, 'D': -3.5, 'E': -3.5, 'F': 2.8, 'G': -0.4,
'H': -3.2, 'I': 4.5, 'K': -3.9, 'L': 3.8, 'M': 1.9, 'N': -3.5,
'P': -1.6, 'Q': -3.5, 'R': -4.5, 'S': -0.8, 'T': -0.7, 'V': 4.2,
'W': -0.9, 'Y': -1.3}
AA_CHARGE = {'D': -1, 'E': -1, 'K': 1, 'R': 1} # at pH 7
def calculate_properties(sequence):
"""Calculate physicochemical properties."""
# Molecular weight
mw = sum(AA_MW.get(aa, 110) for aa in sequence) - 18 * (len(sequence) - 1)
# Net charge at pH 7
charge = sum(AA_CHARGE.get(aa, 0) for aa in sequence)
# GRAVY (Grand Average of Hydropathy)
gravy = sum(AA_HYDRO.get(aa, 0) for aa in sequence) / len(sequence)
# Aromatic content
aromatic = sum(1 for aa in sequence if aa in 'FWY') / len(sequence)
# Instability index (simplified)
# Real calculation uses dipeptide instability weights
unstable_pairs = sum(1 for i in range(len(sequence)-1)
if sequence[i:i+2] in ['DP', 'PD', 'DD', 'EE'])
instability = (unstable_pairs / len(sequence)) * 100
return {
'mw': mw,
'charge': charge,
'gravy': gravy,
'aromatic_pct': aromatic * 100,
'instability': instability
}
# Calculate for all peptides
print("📊 Calculating ADME properties...\n")
for idx, row in df.iterrows():
props = calculate_properties(row['sequence'])
for key, value in props.items():
df.at[idx, key] = value
print("✅ Properties calculated!")
print("\nProperty ranges:")
print(f" MW: {df['mw'].min():.0f} - {df['mw'].max():.0f} Da")
print(f" Charge: {df['charge'].min():.0f} to {df['charge'].max():.0f}")
print(f" GRAVY: {df['gravy'].min():.2f} to {df['gravy'].max():.2f}")
print(f" Instability: {df['instability'].min():.1f} - {df['instability'].max():.1f}")
Step 3: Binding Score Estimation¶
We'll use a simplified scoring function to estimate binding affinity.
🎯 Scoring Components¶
- Hydrophobic contacts - Favorable binding
- Electrostatic interactions - Charge complementarity
- Aromatic stacking - π-π interactions
- Size penalty - Too large = entropic cost
📝 Note¶
This is a toy model! Real docking uses:
- AutoDock Vina
- Rosetta
- Schrödinger Glide
- AlphaFold-Multimer
def estimate_binding_score(sequence, target_profile=None):
"""Simplified binding score estimation.
Lower score = better binding (like docking scores).
"""
if target_profile is None:
# Default: hydrophobic pocket with some charged residues
target_profile = {
'prefers_hydrophobic': True,
'prefers_aromatic': True,
'charge_preference': +1 # Prefers positive ligands
}
score = 0.0
# Hydrophobic contribution
hydrophobic_count = sum(1 for aa in sequence if aa in 'LVIFWA')
if target_profile['prefers_hydrophobic']:
score -= hydrophobic_count * 0.5 # Favorable
# Aromatic contribution
aromatic_count = sum(1 for aa in sequence if aa in 'FWY')
if target_profile['prefers_aromatic']:
score -= aromatic_count * 0.8 # Very favorable
# Charge complementarity
net_charge = sum(AA_CHARGE.get(aa, 0) for aa in sequence)
charge_match = abs(net_charge - target_profile['charge_preference'])
score += charge_match * 0.3 # Penalty for mismatch
# Size penalty (entropic cost)
if len(sequence) > 10:
score += (len(sequence) - 10) * 0.2
# Add some noise (real binding is complex!)
score += random.gauss(0, 0.5)
return score
# Calculate binding scores
print("🎯 Estimating binding affinities...\n")
for idx, row in df.iterrows():
score = estimate_binding_score(row['sequence'])
df.at[idx, 'binding_score'] = score
print("✅ Binding scores calculated!")
print(f" Best score: {df['binding_score'].min():.2f}")
print(f" Worst score: {df['binding_score'].max():.2f}")
print("\nTop 5 binders:")
top5 = df.nsmallest(5, 'binding_score')[['id', 'sequence', 'binding_score']]
for idx, row in top5.iterrows():
print(f" {row['id']}: {row['sequence']} (score: {row['binding_score']:.2f})")
Step 4: Drug-Likeness Filtering¶
Apply filters to identify candidates with good ADME properties.
# Define drug-like criteria
def is_druglike(row):
"""Check if peptide meets drug-like criteria."""
checks = {
'mw_ok': 800 <= row['mw'] <= 1500,
'charge_ok': -2 <= row['charge'] <= 3,
'gravy_ok': -0.5 <= row['gravy'] <= 0.5,
'stable': row['instability'] < 40
}
return all(checks.values()), checks
# Apply filters
druglike_flags = []
for idx, row in df.iterrows():
is_dl, checks = is_druglike(row)
druglike_flags.append(is_dl)
df['druglike'] = druglike_flags
# Summary
n_druglike = df['druglike'].sum()
print("📊 Drug-Likeness Filtering Results")
print("=" * 50)
print(f"Total library: {len(df)} peptides")
print(f"Drug-like: {n_druglike} ({100*n_druglike/len(df):.1f}%)")
print(f"Filtered out: {len(df) - n_druglike}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# MW distribution
axes[0,0].hist(df['mw'], bins=20, alpha=0.7, edgecolor='black')
axes[0,0].axvline(800, color='red', linestyle='--', label='Min threshold')
axes[0,0].axvline(1500, color='red', linestyle='--', label='Max threshold')
axes[0,0].set_xlabel('Molecular Weight (Da)')
axes[0,0].set_ylabel('Count')
axes[0,0].set_title('Molecular Weight Distribution')
axes[0,0].legend()
# Charge distribution
axes[0,1].hist(df['charge'], bins=range(-3, 6), alpha=0.7, edgecolor='black')
axes[0,1].axvline(-2, color='red', linestyle='--')
axes[0,1].axvline(3, color='red', linestyle='--')
axes[0,1].set_xlabel('Net Charge')
axes[0,1].set_ylabel('Count')
axes[0,1].set_title('Charge Distribution')
# GRAVY distribution
axes[1,0].hist(df['gravy'], bins=20, alpha=0.7, edgecolor='black')
axes[1,0].axvline(-0.5, color='red', linestyle='--')
axes[1,0].axvline(0.5, color='red', linestyle='--')
axes[1,0].set_xlabel('GRAVY Score')
axes[1,0].set_ylabel('Count')
axes[1,0].set_title('Hydrophobicity Distribution')
# Binding score vs MW
colors = ['green' if dl else 'red' for dl in df['druglike']]
axes[1,1].scatter(df['mw'], df['binding_score'], c=colors, alpha=0.6, edgecolors='black')
axes[1,1].set_xlabel('Molecular Weight (Da)')
axes[1,1].set_ylabel('Binding Score')
axes[1,1].set_title('Binding vs Size (Green=Drug-like)')
plt.tight_layout()
plt.show()
Step 5: Lead Selection & Ranking¶
Combine all criteria to select top candidates.
# Filter to drug-like candidates
candidates = df[df['druglike']].copy()
# Rank by binding score (lower = better)
candidates = candidates.sort_values('binding_score')
print("🏆 TOP 10 LEAD CANDIDATES")
print("=" * 80)
print(f"{'Rank':<6} {'ID':<8} {'Sequence':<15} {'MW':<8} {'Charge':<8} {'Score':<8}")
print("=" * 80)
for rank, (idx, row) in enumerate(candidates.head(10).iterrows(), 1):
print(f"{rank:<6} {row['id']:<8} {row['sequence']:<15} "
f"{row['mw']:<8.0f} {row['charge']:<8.0f} {row['binding_score']:<8.2f}")
# Save top candidates
top_candidates = candidates.head(10)
print(f"\n✅ Selected {len(top_candidates)} leads for further optimization")
Step 6: Cyclization Strategy¶
Convert top linear peptides to cyclic forms for improved stability.
🔗 Why Cyclize?¶
- Protease resistance - No free termini to cleave
- Conformational rigidity - Better binding specificity
- Improved bioavailability - Longer half-life
🧪 Cyclization Methods¶
- Head-to-tail - Backbone cyclization (most common)
- Disulfide - Cys-Cys bridge
- Side-chain - Lys-Asp/Glu lactam bridge
# Generate 3D structure for top candidate
top_peptide = top_candidates.iloc[0]
print(f"🧬 Generating 3D structure for top candidate: {top_peptide['id']}")
print(f" Sequence: {top_peptide['sequence']}")
print(f" Binding score: {top_peptide['binding_score']:.2f}")
print("\n⏳ This may take a minute...\n")
# Generate linear structure
linear_pdb = generate_pdb_content(
sequence_str=top_peptide['sequence'],
conformation="random", # Random coil
optimize_sidechains=True,
minimize_energy=True
)
# Generate cyclic structure
cyclic_pdb = generate_pdb_content(
sequence_str=top_peptide['sequence'],
conformation="random",
cyclic=True, # Head-to-tail cyclization
optimize_sidechains=True,
minimize_energy=True
)
print("✅ Structures generated!")
# Visualize linear vs cyclic
import py3Dmol
print("🔬 3D Structure Comparison\n")
# Linear structure
print("Linear Peptide:")
view1 = py3Dmol.view(width=400, height=300)
view1.addModel(linear_pdb, 'pdb')
view1.setStyle({'cartoon': {'color': 'spectrum'}})
view1.setStyle({'resn': top_peptide['sequence'][0]},
{'sphere': {'color': 'red', 'radius': 0.5}}) # N-terminus
view1.setStyle({'resn': top_peptide['sequence'][-1]},
{'sphere': {'color': 'blue', 'radius': 0.5}}) # C-terminus
view1.zoomTo()
view1.show()
print("\nCyclic Peptide (head-to-tail):")
view2 = py3Dmol.view(width=400, height=300)
view2.addModel(cyclic_pdb, 'pdb')
view2.setStyle({'cartoon': {'color': 'spectrum'}})
view2.zoomTo()
view2.show()
print("\n💡 Notice: The cyclic form has no free termini (red/blue spheres in linear).")
print(" This makes it resistant to exopeptidases!")
Step 7: Export Results¶
Save top candidates for experimental validation.
# Export summary
export_df = top_candidates[['id', 'sequence', 'length', 'mw', 'charge',
'gravy', 'binding_score']].copy()
print("📄 Exportable Results\n")
print(export_df.to_string(index=False))
# Save to CSV (if running locally)
if not IN_COLAB:
export_df.to_csv('lead_peptides.csv', index=False)
print("\n✅ Saved to lead_peptides.csv")
else:
print("\n💡 In Colab: Copy the table above for your records")
# Save top structure
print("\n💾 Top candidate PDB (cyclic):")
print(f" ID: {top_peptide['id']}")
print(f" Sequence: {top_peptide['sequence']}")
print(" Ready for molecular dynamics or docking studies!")
🎓 Key Takeaways¶
Drug Discovery Pipeline Steps¶
- Library Design - Diversity + drug-like bias
- Property Prediction - Filter early (fail fast!)
- Binding Estimation - Prioritize candidates
- Cyclization - Improve stability
- Experimental Validation - Synthesize and test top hits
Real-World Considerations¶
- Synthesis cost - Longer peptides = more expensive
- Solubility - Must dissolve for assays
- Cell permeability - Cyclic peptides can cross membranes
- Off-target effects - Test selectivity
Next Steps in Real Projects¶
- Molecular Dynamics - Simulate binding
- Docking - Predict binding pose
- Synthesis - Order top 5-10 candidates
- Biochemical assays - IC50, EC50 measurements
- Cell-based assays - Efficacy and toxicity
- Lead optimization - Iterate based on SAR
🚀 Challenge Exercises¶
Try these extensions:
- Modify target profile - Change binding preferences and re-rank
- Add disulfide bridges - Include Cys pairs for additional cyclization
- Larger library - Generate 500+ peptides and analyze trends
- Multi-objective optimization - Balance binding, stability, and cost
Remember: Computational predictions guide experiments, but nothing beats real data! 🧪