Prompt-to-Protein Interface
The synth-pdb tool features a powerful, natural language interface that allows you to generate complex protein structures and simulated data by simply describing what you want in plain English.
Instead of memorizing dozens of complex command-line flags, you can pass a --prompt argument and the system will automatically configure the generation engine for you.
Quick Start
1. Choose a Backend
Option A: Cloud API (OpenAI)
By default, if you have an OpenAI API key, you can use the ultra-fast cloud backend.
- Set your API key in your environment:
Option B: Local SLM (Offline & Private)
For total privacy and zero API costs, synth-pdb supports downloading and running a highly compressed Small Language Model (SLM) locally on your CPU or GPU.
-
Install the optional local AI dependencies:
-
Use the
--llm-backend localflag in your commands.
Note: The first time you use the local backend, it will automatically download a ~2.4GB GGUF model file (like Phi-3-Mini) to your HuggingFace cache. Subsequent runs will load it instantly.
2. Provide your Prompt
There are three ways to pass a natural language request to synth-pdb.
A. Inline Argument (Standard)
Pass the prompt directly on the command line:
B. Piped Input
Pipe the output of another tool or a file into synth-pdb:
C. Interactive Mode
Run --prompt without a value to open an interactive buffer. This is useful for complex, multi-line instructions:
synth-pdb --prompt
# Enter your natural language prompt (type 'exit' on a new line or press Ctrl+D to finish):
# > Build an alpha helix with
# > a zinc binding motif.
# > exit
Example Prompts
The natural language interface is designed to understand complex, multi-step biological requirements. Here are some examples of what you can ask it to do:
1. Specific Sequences and Secondary Structure
"Generate the sequence ALAGLYVAL in a beta sheet conformation." (Automatically parses
--sequence ALAGLYVAL --conformation beta)
2. Post-Translational Modifications & Capping
"Build a 50-residue random coil, but make sure to cap the N and C termini and minimize it so there are no clashes." (Automatically parses
--length 50 --conformation random --cap-termini --minimize)
3. Advanced Biophysics (Cyclization & Metals)
"Create a 15-mer alpha helix, fold it into a cyclic peptide, and automatically insert a zinc metal ion if there's a binding motif." (Automatically parses
--length 15 --conformation alpha --cyclic --metal-ions auto)
4. Synthetic Data Generation (NMR / CD)
"I need a 40-residue polyproline II extended structure. Please simulate its Circular Dichroism spectrum and generate its expected NMR chemical shifts." (Automatically parses
--length 40 --conformation ppii --gen-cd --gen-shifts)
Tips for Reliable Generation
While the Prompt-to-Protein interface is designed to be intuitive, performance varies depending on the LLM backend you choose.
1. Verify the Interpretation
When you run a prompt, synth-pdb will log exactly how it interpreted your request. Always check this line to ensure the model understood your structural requirements:
2. Handling "SLM Hallucinations" (Local Backend)
The local backend uses Small Language Models (SLMs) like Phi-3. These models are efficient but can occasionally hallucinate long sequences or "forget" to set structural flags if the prompt is too conversational.
- Avoid Conversation Filler: Instead of "Hey there, would you please generate me a...", try clinical phrasing: "Length 30, conformation beta, minimize".
- Sequence Over-Generation: If the model starts generating an extremely long sequence (e.g., "Successfully predicted S2 for 225 residues") when you didn't ask for one, try being more explicit about the length: "Generate exactly 20 residues".
- Defaulting to Alpha: If you don't see
--conformation beta(or your requested type) in the translation logs, the system will default to an Alpha Helix.
3. Use the Cloud for Precision
If you need high-precision mapping of multiple complex requirements (e.g., specific solvent models, rdc tensors, and PTM rates all in one prompt), the openai backend (using GPT-4o-mini) is significantly more robust and less prone to hallucinations than the smaller local models.
4. Nonsense Prompts
If you provide a prompt that contains no biological or structural instructions (e.g., "The quick brown fox..."), the system will log a warning:
In this case, it will fall back to a standard 10-residue linear peptide.How it Works (Under the Hood)
The LLM interface is strictly isolated from the core biophysics engine. It uses a Zero-Shot JSON Translation technique:
- The text of your
--promptis combined with a strict system prompt containing thesynth-pdbCLI schema. - The LLM translates your English sentence into a valid JSON object.
- The interface intercepts this JSON, safely parses it, and dynamically injects those keys directly into the execution flow.
- The generation engine runs normally, unaware that an LLM was used.
Educational Example: Fine-Tuning Your Own Model
If you wish to train your own specialized model (so it deeply understands synth-pdb and runs even faster), we have provided a complete educational script to generate the necessary training corpus.
You can find the SLM Fine-Tuning Pipeline script in:
examples/ml_integration/slm_finetuning_pipeline.py
This script demonstrates how to:
- Procedurally generate (Prompt -> JSON) pairs.
- Format the data into the strict JSONL schema required for HuggingFace/LoRA training.
- Learn the concepts of GGUF Quantization and Low-Rank Adaptation (LoRA) through inline pedagogical documentation.