Theory
The Ensemble Problem for Disordered Proteins
Intrinsically Disordered Proteins (IDPs) do not adopt a single stable three-dimensional structure. Instead, they interconvert among an astronomically large number of conformations in solution — a structural ensemble. Classical structure-determination tools (X-ray crystallography, single-particle cryo-EM) are designed for ordered proteins and cannot capture this conformational heterogeneity. Solution-state techniques such as SAXS, NMR, and smFRET are sensitive to ensemble-averaged properties, but converting those measurements back into atomic coordinates requires solving an underdetermined inverse problem.
DiffEnsemble addresses this by learning a probabilistic generative model that maps sequence features directly to an ensemble of conformations, and training it end-to-end against experimental observables.
Variational Autoencoder
DiffEnsemble uses a Variational Autoencoder (VAE) (Kingma & Welling, 2013) as its generative backbone.
Encoder
The encoder \(q_\phi(\mathbf{z} \mid \mathbf{x})\) maps sequence features \(\mathbf{x} \in \mathbb{R}^{L \times F}\) to the parameters of a Gaussian distribution over a low-dimensional latent space \(\mathbf{z} \in \mathbb{R}^d\):
Reparameterisation Trick
To allow gradient flow through the stochastic sampling step, we write:
Drawing $K = $ ensemble_size samples gives the structural ensemble directly.
Decoder
The decoder \(p_\theta(\boldsymbol{\tau} \mid \mathbf{z})\) maps each latent sample to a vector of backbone torsion angles:
Differentiable Physics Engine
NeRF: Torsions → Cartesian Coordinates
The Natural Extension Reference Frame (NeRF) algorithm (Parsons et al., 2005)
converts internal coordinates (bond lengths, bond angles, dihedral angles) into
Cartesian coordinates in \(O(N)\) time and — crucially — is fully differentiable
with respect to all dihedral angles. DiffEnsemble uses the implementation
provided by diff_biophys.geometry.nerf.chain_nerf.
For each residue \(i\) we place three backbone heavy atoms (N, Cα, C) using fixed idealised bond lengths and angles, and the predicted \((\phi_i, \psi_i)\) torsions. Peptide bonds are assumed to be trans (\(\omega = \pi\)).
SAXS: Debye Formula
The small-angle X-ray scattering intensity of a single conformation is calculated via the Debye formula:
where \(f_i(q)\) are atomic form factors and \(r_{ij}\) is the distance between atoms \(i\) and \(j\). The ensemble-averaged profile is:
NMR Observables (planned)
Support for Residual Dipolar Couplings (RDCs) and backbone chemical shifts
via the diff_biophys.nmr kernels is planned for a future release.
Training Objective
DiffEnsemble minimises an Evidence Lower BOund (ELBO) that combines a biophysical reconstruction term with a KL regulariser:
The \(\beta\) coefficient (default 0.1) controls the trade-off between fitting the data and maintaining a well-regularised latent space (analogous to \(\beta\)-VAE).
Scientific Validation
| Benchmark | Metric | Status |
|---|---|---|
| Flory scaling (\(R_g \propto N^{0.588}\)) | Exponent ν ∈ [0.35, 0.75] | ✅ Implemented |
| Sic1 Rg (Gomes et al., JACS 2020) | \(R_g \approx 30.5\) Å | ✅ Implemented |
| CASP16 T1200 SpA RDC Q-factor | Q < 0.3 | 🔄 Pending NMR kernels |
| PED database parity | RMSD of observables | 🔄 Planned |
References
- Kingma & Welling (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114
- Gomes et al. (2020). Conformational Ensembles of an IDP Consistent with NMR, SAXS, and smFRET (Forman-Kay Lab). JACS.
- McBride et al. (2025). Predicting Pose Distribution of Protein Domains (Montelione Lab).
- Parsons et al. (2005). Practical conversion from torsion space to Cartesian space for in silico protein synthesis. J. Comput. Chem. 26(10), 1063–1068.