Machine-guided path sampling to discover mechanisms of molecular self-organization

Maximum likelihood estimation of the committor function

The committor p_B(x) is the probability that a trajectory initiated at configuration X with Maxwell–Boltzmann velocities reaches the (meta)stable state B before reaching A. Trajectory shooting thus constitutes a Bernoulli process. We expect to observe n_A and n_B trajectories to end in A and B, respectively, with binomial probability $p({n}_{{\mathrm{A}}},{n}_{{\mathrm{B}}}| {{{\bf{x}}}})=\binom{{{n}_{{\mathrm{A}}}+{n}_{{\mathrm{B}}}}}{{{n}_{{\mathrm{A}}}}}{(1-{p}_{{\mathrm{B}}}({{{\bf{x}}}}))}^{{n}_{{\mathrm{A}}}}{p}_{{\mathrm{B}}}{({{{\bf{x}}}})}^{{n}_{{\mathrm{B}}}}$. For k shooting points x_i, the combined probability defines the likelihood ${{{\mathcal{L}}}}=\mathop{\prod }\nolimits_{i = 1}^{k}p({n}_{{\mathrm{A}}}(i),{n}_{{\mathrm{B}}}(i)| {{{{\bf{x}}}}}_{i})$. Here we ignore the correlations that arise in fast inertia-dominated transitions for trajectories shot off with opposite initial velocities^11,18. We model the unknown committor with a parametric function and estimate its parameters w by maximizing the likelihood ${{{\mathcal{L}}}}$ (refs. ^5,15). We ensure that 0 ≤ p_B(x) ≤ 1 by writing the committor in terms of a sigmoidal activation function, ${p}_{{\mathrm{B}}}[q({{{\bf{x}}}}| {{{\bf{w}}}})]=1/(1+\exp [-q({{{\bf{x}}}}| {{{\bf{w}}}})])$. Here we model the log-probability q(x|w) using a neural network¹⁵ and represent the configuration with a vector x of features. For N > 2 states S, the multinomial distribution provides a model for p(n₁, n₂, . . . , n_N|x), and writing the committors to states S in terms of the softmax activation function ensures normalization, $\mathop{\sum }\nolimits_{S = 1}^{N}{p}_{S}=1$. The loss function l(w|θ) used in the training is the negative logarithm of the likelihood ${{{\mathcal{L}}}}$.

Training points from transition path sampling

TPS^4,12 is a powerful Markov chain MC method in path space to sample TPs. The two-way shooting move is an efficient proposal move in TPS⁴. It consists of randomly selecting a shooting point X_SP on the current TP χ according to probability p_sel(X_SP|χ), drawing random Maxwell–Boltzmann velocities, and propagating two trial trajectories from X_SP until they reach either one of the states. Because one of the trial trajectories is propagated after first inverting all momenta at the starting point, that is, it is propagated backward in time, a continuous TP can be constructed if both trials end in different states. Given a TP χ, a new TP χ′ generated by two-way shooting is accepted into the Markov chain with probability³⁷${p}_{{{{\rm{acc}}}}}({\chi }^{{\prime} }| \chi )=\min (1,{p}_{{{{\rm{sel}}}}}({{{{\bf{X}}}}}_{{{{\rm{SP}}}}}| {\chi }^{{\prime} })/{p}_{{{{\rm{sel}}}}}({{{{\bf{X}}}}}_{{{{\rm{SP}}}}}| \chi ))$. If the new path is rejected, χ is repeated.

Knowing the committor, it is possible to increase the rate at which TPs are generated by biasing the selection of shooting points towards the transition state ensemble³⁷, that is, regions with high reactive probability p(TP|X). For the two-state case, this is equivalent to biasing towards the p_B(x) = 1/2 isosurface defining the transition states with q(x) = 0. To construct an algorithm that selects new shooting points biased toward the current best guess for the transition state ensemble and that iteratively learns to improve its guess based on every newly observed shooting outcome, we need to balance exploration with exploitation. To this end, we select the new shooting point X from the current TP χ using a Lorentzian distribution centered around the transition state ensemble, ${p}_{{{{\rm{sel}}}}}({{{\bf{X}}}}| \chi )=1/\mathop{\sum}\limits_{{{{{\bf{x}}}}}^{{\prime} }\in \chi }[(q{({{{\bf{x}}}})}^{2}+{\gamma }^{2})/(q{({{{{\bf{x}}}}}^{{\prime} })}^{2}+{\gamma }^{2})],$ where larger values of γ lead to an increase of exploration. The Lorentzian distribution provides a trade-off between production efficiency and the occasional exploration away from the transition state, which is necessary to sample alternative reaction channels.

With the learned committor function, one can optimize the definition of the state boundaries. An initially tight state definition can be softened by moving the boundaries outward to, say, p_B(x) = 0.1 and p_B(x) = 0.9. This loosening leads to shorter TPs and speeds up the sampling.

Real-time validation of committor model prediction

The relation between the committor and the transition probability¹¹ enables us to calculate the expected number of TPs generated by shooting from a configuration X. We validate the learned committor on-the-fly by estimating the expected number of transitions before shooting from a configuration and comparing it with the observed shooting result. The expected number of transitions ${n}_{{{{\rm{TP}}}}}^{\exp }$ calculated over a window containing the k most recent two-way shooting⁴ attempts is ${n}_{{{{\rm{TP}}}}}^{\exp }=\mathop{\sum }\nolimits_{i = 1}^{k}2(1-{p}_{{\mathrm{B}}}({{{{\bf{x}}}}}_{i},i)){p}_{{\mathrm{B}}}({{{{\bf{x}}}}}_{i},i)$, where p_B(x_i, i) is the committor estimate for trial shooting point X_i at step i before observing the shooting result. We initiate learning when the predicted (${n}_{{{{\rm{TP}}}}}^{\exp }$) and actually generated (${n}_{{{{\rm{TP}}}}}^{{{{\rm{gen}}}}}$) number of TPs differ. We define an efficiency factor, ${\alpha }_{{{{\rm{eff}}}}}=\min (1,{(1-{n}_{{{{\rm{TP}}}}}^{{\mathrm{gen}} }/{n}_{{{{\rm{TP}}}}}^{{\mathrm{exp}} })}^{2})$, where a value of zero indicates perfect prediction (Extended Data Fig. 9). By training only when necessary, we avoid overfitting. Here we used α_eff to scale the learning rate in the gradient descent algorithm. In addition, no training takes place if α_eff is below a certain threshold (specified further below for each system).

Neural network architectures

Molecular mechanisms can be described at different levels of resolution. One can use many high-resolution features that quantify local properties or fewer low-resolution features that measure global properties. While high-resolution features tend to be readily available, the choice of meaningful low-resolution features relies on physical understanding. With a focus on rare molecular events, we aimed to arrange features in a resolution hierarchy, going from Cartesian coordinates of atomic positions—the highest possible resolution—to a single quantity, the committor.

We designed the neural networks in this study to encourage them to learn the resolution hierarchy of features. Neural networks have shown the ability to learn low-resolution features from high-resolution ones, for example, when used in image recognition. From a practical point of view, the layer width of our networks is constantly decreasing, moving from inputs to output. In addition, as the learned features become increasingly global (and therefore less redundant) while going to deeper layers, we decrease the dropout probability moving up in the network. This is also reflected in the different architecture used for the clathrate formation where, due to the already coarse-grained and system-specific features, we used a comparatively simple pyramidal feed-forward network.

Distilling explicit expressions for the committor

In any specific molecular process, we expect that only a few of the many degrees of freedom actually control the transition dynamics. We identify the inputs to the committor model that have the largest role in determining its output after training. To this end, we first calculate a reference loss, l_ref = l(w, θ), over the unperturbed training set to compare with the values obtained by perturbing every input one by one³⁸. We then average the loss $l({{{\bf{w}}}},{\widetilde{{{{\bf{\uptheta }}}}}}_{i})$ over ≥100 perturbed training sets ${\widetilde{{{{\bf{\uptheta }}}}}}_{i}$ with randomly permuted values of the input coordinate i in the batch dimension. The average loss difference ${{\Delta }}{l}_{i}=\left\langle l({{{\bf{w}}}},{\widetilde{{{{\bf{\uptheta }}}}}}_{i})\right\rangle -{l}_{{{{\rm{ref}}}}}$ is large if the ith input strongly influences the output of the trained model, that is, it is relevant for predicting the committor.

In the low-dimensional subset consisting of only the most relevant inputs z (the ones with the highest Δl_i), symbolic regression generates compact mathematical expressions that approximate the full committor. Our implementation of symbolic regression is based on the Python package dcgpy³⁹ and uses a genetic algorithm with a (N + 1) evolution strategy. In every generation, N new expressions are generated through random changes to the mathematical structure of the fittest expression of the parent generation. A gradient-based optimization is subsequently used to find the best parameters for every expression. The fittest expression is then chosen as parent for the next generation. The fitness of each trial expression p_B(z) is measured by ${l}_{{{{\rm{sr}}}}}({{{{\bf{w}}}}}_{{{{\rm{sr}}}}}| {{{\bf{\uptheta }}}})\equiv -\log {{{\mathcal{L}}}}[{p}_{{\mathrm{B}}}({{{{\bf{z}}}}}_{{{{\rm{sp}}}}})]+\lambda C$, where we added the regularization term λC to the log-likelihood (see ‘Maximum likelihood estimation of the committor function’) to keep expressions simple and avoid overfitting. Here λ > 0 and C is a measure of the complexity of the trial expression, estimated in our case by the number of mathematical operations.

Symbolic regression will produce expressions of differing complexity depending on the regularization strength. We select expressions with a reasonable trade-off between simplicity and accuracy using a Pareto plot (Fig. 2c), in which we plot the complexity (measured as the number of mathematical operations) against the accuracy (measured as the loss on validation data). By scanning a range of λ values, we can identify models at the Pareto front for further analysis.

Assembly of ion pairs in water

We investigated the formation of monovalent ion pairs in water to assess the ability of the algorithm to accurately learn the committor for transitions that are strongly influenced by solvent degrees of freedom. We used six different system set-ups (LiCl, LiI, NaCl, NaI, CsCl and CsI), each consisting of one cation and one anion in water.

All MD simulations were carried out in cubic simulation boxes using the Joung and Cheatham force field⁴⁰ together with TIP3P⁴¹ water. Each simulation box contained a single ion pair solvated with 370 TIP3P water molecules. We used the openMM MD engine⁴² to propagate the equations of motion in time steps of Δt = 2 fs with a velocity Verlet integrator with velocity randomization⁴³ from the Python package openmmtools. After an initial NPT equilibration at constant pressure P = 1 bar and constant temperature T = 300 K, all production simulations were performed in the NVT ensemble at a constant volume V and a constant temperature of T = 300 K. The friction was set to 1 ps⁻¹. Non-bonded interactions were calculated using a particle mesh Ewald scheme⁴⁴ with a real-space cut-off of 1 nm and an error tolerance of 0.0005. Each TPS simulation (consisting of MD simulations and neural network training) was carried out on half a node using one Xeon Gold 6248 central processing unit (CPU) in conjunction with one RTX5000 GPU. In TPS, the associated and disassociated states were defined according to interionic distances (see Supplementary Table 4 for the values for each ionic species).

The committor of a configuration is invariant under global translations and rotations in the absence of external fields, and it is additionally invariant with respect to permutations of identical particles. We therefore chose to transform the system coordinates from the Cartesian space to a representation that incorporates the physical symmetries of the committor. To achieve an almost lossless transformation, we used the interionic distance to describe the solute and we adapted symmetry functions to describe the solvent configuration⁴⁵. Symmetry functions were developed originally to describe molecular structures in neural network potentials^19,46, but have also been successfully used to detect and describe different phases of ice in atomistic simulations⁴⁷. These functions describe the environment surrounding a central atom by summing over all identical particles at a given radial distance. The ${G}_{i}^{2}$ type of symmetry function quantifies the density of solvent molecules around a solute atom i in a shell centered at r_s

$${G}_{i}^{2}=\mathop{\sum}\limits_{j}{{\mathrm{e}}}^{-\eta {({r}_{ij}-{r}_{{\mathrm{s}}})}^{2}}{f}_{{\mathrm{c}}}({r}_{ij}),$$

where the sum runs over all solvent atoms j of a specific atom type, r_ij is the distance between the central atom i and atom j, and η controls the width of the shell. The function f_c(r) is a Fermi cut-off defined as:

$${f}_{{\mathrm{c}}}(r)=\left\{\begin{array}{ll}{\left[1+\exp ({\alpha }_{{\mathrm{c}}}(r-{r}_{{{{\rm{cut}}}}}-1/\sqrt{{\alpha }_{{\mathrm{c}}}}))\right]}^{-1}\quad &r\le {r}_{{{{\rm{cut}}}}}\\ 0\quad &r > {r}_{{{{\rm{cut}}}}}\end{array}\right.,$$

which ensures that the contribution of distant solvent atoms vanishes. The scalar parameter α_c controls the steepness of the cut-off. The ${G}_{i}^{5}$ type of symmetry function additionally probes the angular distribution of the solvent around the central atom i

$${G}_{i}^{5}=\mathop{\sum}\limits_{j,k > j}{\left(1+\lambda \cos {\vartheta }_{ijk}\right)}^{\zeta }{{\mathrm{e}}}^{-\eta \left[{({r}_{ij}-{r}_{{\mathrm{s}}})}^{2}+{({r}_{ik}-{r}_{{\mathrm{s}}})}^{2}\right]}{f}_{{\mathrm{c}}}({r}_{ik}){f}_{{\mathrm{c}}}({r}_{ij}),$$

where the sum runs over all distinct solvent atom pairs, ϑ_ijk is the angle spanned between the two solvent atoms and the central solute atom, the parameter ζ is an even number that controls the sharpness of the angular distribution, and λ = ±1 sets the location of the minimum with respect to ϑ_ijk at π and 0, respectively. See Supplementary Table 5 for the parameter combinations used. We scaled all inputs to lie approximately in the range [0, 1] to increase the numerical stability of the training. In particular, we normalized the symmetry functions by dividing them by the expected average number of atoms (or atom pairs) for an isotropic distribution in the probing volume.

Type G
²

The symmetry functions of type G² count the number of solvent atoms in the probing volume; the normalization constant ${\langle N[{G}_{i}^{2}]\rangle }_{{{{\rm{iso}}}}}$ is therefore the expected number of atoms in the probing volume ${V}_{{{{\rm{probe}}}}}^{(2)}$

$${\langle N[{G}_{i}^{2}]\rangle }_{{{{\rm{iso}}}}}={\rho }_{{\mathrm{N}}}{V}_{{{{\rm{probe}}}}}^{\,(2)},$$

where ρ_N is the average number density of the probed solvent atom type. The exact probing volume for the G² type can be approximated as

$$\begin{array}{lll}{V}_{{{{\rm{probe}}}}}^{(2)}&=&\int\nolimits_{0}^{\infty }{\mathrm{d}}r\int\nolimits_{0}^{\uppi }{\mathrm{d}}\theta \int\nolimits_{0}^{2\uppi }{\mathrm{d}}\phi \,{r}^{2}\sin (\theta )\exp (-\eta {(r-{r}_{{\mathrm{s}}})}^{2})\,{f}_{{\mathrm{c}}}(r)\\ &\approx &8\uppi {r}_{{\mathrm{s}}}^{2}\sqrt{2/\eta }.\end{array}$$

for small η and r_cut > r_s.

Type G
⁵

The functions of type G⁵ include an additional angular term and count the number of solvent atom pairs located on opposite sides of the central solute atom. The expected number of pairs ${\langle {N}_{{{{\rm{pairs}}}}}\rangle }_{{{{\rm{iso}}}}}$ can be calculated from the expected number of atoms in the probed volume ${\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}}$ as ${\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}}({\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}}-1)/2$. This expression is only exact for integer values of ${\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}}$ and can even become negative if ${\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}} < 1$. We therefore used an approximation which is guaranteed to be non-negative

$${\langle {N}_{{{{\rm{pairs}}}}}\rangle }_{{{{\rm{iso}}}}}\approx \frac{{\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}}^{2}}{2}.$$

The expected number of atoms ${\langle {N}_{{{{\rm{atoms}}}}}\rangle }_{{{{\rm{iso}}}}}$ can be calculated from the volume that is probed for a fixed solute atom and with one fixed solvent atom

$$\begin{array}{lll}{V}_{{{{\rm{probe}}}}}^{(5)}&=&{2}^{1-\zeta }\int\nolimits_{0}^{\infty }{\mathrm{d}}r\int\nolimits_{0}^{\uppi }{\mathrm{d}}\theta \int\nolimits_{0}^{2\uppi }{\mathrm{d}}\phi \,{r}^{2}\sin (\theta ){(1\pm \cos (\phi ))}^{\zeta }\exp (-\eta {(r-{r}_{{\mathrm{s}}})}^{2})\,{f}_{{\mathrm{c}}}(r)\\ &=&{2}^{1-\zeta }{V}_{{{{\rm{probe}}}}}^{(2)}\frac{(2\zeta -1)!!}{\zeta !}\end{array}$$

With the expectation that most degrees of freedom of the system do not control the transition, we designed neural networks that progressively filter out irrelevant inputs and build a highly nonlinear function of the remaining ones. We tested three different pyramidal neural network architectures ‘ResNet I’, ‘ResNet II’ and ‘SNN’, where names containing ‘ResNet’ indicate the use of residual units^48,49 and ‘SNN’ a self normalizing neural network architecture⁵⁰ (see Supplementary Tables 6–8 for the exact architectures used). The best performing architecture is ‘ResNet I’ (see Supplementary Data File 1 for performance comparison of the different architectures for all ionic systems). ResNet I used a pyramidal stack of five pre-activation residual units, each with four hidden layers. The number of hidden units per layer is reduced by a constant factor $f={(10/{n}_{{\mathrm{in}}})}^{1/4}$ after every residual unit block and decreases from n_in = 221 in the first unit to 10 in the last. In addition, a dropout of 0.1fⁱ, where i is the residual unit index ranging from 0 to 4, is applied after every residual block. Optimization of the network weights is performed using the Adam gradient descent⁵¹. For all architectures, training was performed after every third TPS MC step for one epoch with a learning rate of lr = α_eff10⁻³, if lr ≥ 10⁻⁴. The expected efficiency factor α_eff was calculated over a window of k = 100 TPS steps. We performed all deep learning with custom written code based on Keras⁵². The TPS simulations were carried out using a customized version of openpathsampling^53,54 together with our own Python module.

For the transfer training, the last layer with a single neuron (that is, the log predictor) of a model originally trained on LiCl was randomized and all other weights were kept fixed during the subsequent training on the shooting data for the other ionic species (LiI, NaCl, NaI, CsCl and CsI). Training was performed using the Adam optimizer with a learning rate of lr = 2.5 × 10⁻⁵. The test loss was calculated after every epoch on 20% of the data used as test set. The training was stopped when no decrease in the test loss was observed for more than 1,000 epochs. The model with the best test loss was then used.

For the extrapolation in chemical space (Extended Data Fig. 3), we set up a multi-ion neural network of architecture ‘ResNet I’. The model was trained on the shooting results for different pairs of ionic species simultaneously, as specified. It used the coordinates from the set ‘SF shortranged’ together with the ϵ and σ Lennard-Jones parameters of the force field to distinguish the different ionic species. Training was performed with the Adam optimizer (lr = 10⁻³) using 10% of the data as test set. The training was terminated if the test loss did not decrease for 1,000 epochs and the model with the best test loss was then used.

We selected the seven most relevant coordinates identified by the multi-ion neural network as inputs for the multi-ion symbolic regressions (Supplementary Tables 9–12). We used between 3 and 7 of these most relevant coordinates for independent symbolic regression runs using the regularization values λ = 0.001, λ = 0.0001 and λ = 0.00001. We then selected the expression reported in Fig. 2d using the Pareto plot in Fig. 2c.

We also selected the five most relevant coordinates identified from a neural network trained on LiCl for symbolic regression runs (Extended Data Fig. 4b–d). We regularized the produced expressions by penalizing the total number of elementary mathematical operations with λ = 10⁻⁶ and λ = 10⁻⁷.

The contributions of each atom to the committor in a particular system X (Fig. 1b) was calculated as the magnitude of the gradient of the reaction coordinate q(x) with respect to its Cartesian coordinates. All gradient magnitudes were scaled with the inverse atom mass.

Nucleation of methane clathrates

The learning algorithm was applied to an existing TPS dataset of methane-clathrate nucleation initially produced for ref. ²¹. It contains data for simulations carried out at four different temperatures T = 270 K, 275 K, 280 K and 285 K (see Supplementary Table 14 for details). New simulations were performed to obtain the sampled committor values used in the validation. All committor simulations were performed with OpenMM 7.1.1⁴² on NVIDIA GeForce GTX TITAN 1080Ti GPUs, shooting between 6 and 18 trajectories per configuration using the same simulation protocol as in ref. ²¹.

We used 22 different features to describe size, crystallinity, structure and composition of the growing methane-hydrate crystal nucleus (Supplementary Table 1). In addition to the features describing molecular configurations, we used temperature as an input to the neural networks and the symbolic regression. In a pyramidal feed-forward network with 9 layers, we reduced the number of units per layer from 23 at the input to one in the last layer (Supplementary Table 15). The network was trained with the Adam optimizer with learning rate lr = 10⁻³ on the existing TPS data for all temperatures, leaving out 10% of the shooting points as test data. We stopped the training after the loss on the test set did not decrease for 10,000 epochs and used the model with the best test loss. All neural network training was performed on a RTX6000 GPU. We used the three most relevant coordinates as inputs for symbolic regression runs with a penalty on the total number of elementary mathematical operations using λ = 10⁻⁵.

Polymer folding

We applied our machine learning algorithm on existing shooting data of polymer crystallization^24,25. We used two different sets of features to describe the transition, a set of 35 low-resolution (coarse-grained) features that has also been used in previous work and a set of high-resolution features describing each polymer bead on its own. The low-resolution features contain a number of global measures such as the potential energy U and the Steinhardt bond-order parameters Q₄ and Q₆, descriptions of the local environment of selected polymer particles, various measures describing the structure of the polymer by counting chains and loops, and some selected distances (see Supplementary Table 16 for an exhaustive list). The high-resolution feature set consists of the number of connections, neighbors and the neighbor-averaged Lechner–Dellago Steinhardt bond-order parameters⁵⁵ for each polymer bead, that is, each configuration corresponds to a feature vector with 3 × 128 = 384 entries.

For both the high-resolution and the low-resolution description, we used pyramid shaped neural networks (Supplementary Tables 17 and 18). In both cases, training was performed using the Adam gradient descent method with a learning rate lr = 10⁻³ using 20% of the data as test data. The models were saved and the test loss was calculated after every epoch. The training was stopped if the test loss did not decrease for 10,000 epochs. The model with the lowest test loss was then used as the final trained model. All neural network training was performed on an RTX6000 GPU.

We used between two and five of the five most relevant low-resolution features as inputs in symbolic regression runs (Supplementary Tables 19–22). We regularized by penalizing the number of elementary mathematical operations with λ = 10⁻², 10⁻³, 10⁻⁴, 10⁻⁵ and 10⁻⁶.

Mga2 transmembrane dimer assembly in lipid membrane

We used the coarse-grained Martini force field (v2.2)^56,57,58,59 to describe the assembly of the alpha-helical transmembrane homodimer Mga2. All MD simulations were carried out with GROMACS v4.6.7^60,61,62,63 with an integration timestep of Δt = 0.02 ps, using a cubic simulation box containing the two identical 30-amino-acid-long alpha helices in a lipid membrane made of 313 1-palmitoyl-2-oleoyl-glycero-3-phosphocholine (POPC) molecules. The membrane spans the box in the x–y plane and was solvated with water (5,348 water beads) and NaCl ions corresponding to a concentration of 150 mM (58 Na⁺, 60 Cl⁻). A reference temperature of T = 300 K was enforced using the v-rescale thermostat⁶⁴ with a coupling constant of 1 ps separately on the protein, the membrane and the solvent. A pressure of 1 bar was enforced separately in the x–y plane and in z using a semiisotropic Parrinello–Rahman barostat⁶⁵ with a time constant of 12 ps and compressibility of 3 × 10⁻⁴ bar⁻¹. Each MD simulation was carried out on a single compute node with two E5-2680-v3 CPUs and 64 GB memory. All neural network training was performed on an RTX6000 GPU.

To describe the assembly of the Mga2 homodimer, we used 28 interhelical pairwise distances between the backbone beads of the two helices together with the total number of interhelical contacts, the distance between the helix centers of mass and a number of hand-tailored features describing the organization of lipids around the two helices (Supplementary Table 2). To ensure that all network inputs lie approximately in [0, 1], we used the sigmoidal function $f(r)={(1-(r/{R}_{0})^{6})}/{(1-(r/{R}_{0})^{12})}$ with R₀ = 2 nm for all pairwise distances, while we scaled all lipid features using the minimal and maximal values taken along the transition. The assembled and disassembled states are defined as configurations with ≥130 interhelical contacts and with helix–helix center-of-mass distances d_CoM ≥ 3 nm, respectively.

The neural network used to fit the committor was implemented using Keras⁵² and consisted of an initial 3-layer pyramidal part in which the number of units decreases from the 36 inputs to 6 in the last layer using a constant factor of (6/36)^1/2 followed by 6 residual units^48,49, each with 4 layers and 6 neurons per layer (Supplementary Table 23). A dropout of 0.01 is applied to the inputs and the network is trained using the Adam gradient descent protocol with a learning rate of lr = 0.0001 (ref. ⁵¹).

To investigate the assembly mechanism of Mga2, we performed machine-guided sampling in parallel on multiple nodes of a high-performance computer cluster. We ran 500 independent TPS chains guided by the current committor model. The 500 TPS simulations were initialized with random initial TPs. The neural network used to select the initial shooting points was trained on preliminary shooting attempts (8,044 independent shots from 1,160 different points). After two rounds (two steps in each of the 500 independent TPS chains), we updated the committor model by training on all new data. We retrained again after the sixth round. No further training was required, as indicated by consistent numbers of expected and observed counts of TPs. We performed another 14 rounds for all 500 TPS chains to collect TPs (Fig. 5b). Shooting point selection, TPS set-up and neural network training were fully automated in Python code using MDAnalysis^66,67, numpy⁶⁸ and our custom Python package.

The input importance analysis revealed the total number of contacts n_contacts as the single most important input (Extended Data Fig. 7). However, no expression generated by symbolic regression as a function of n_contacts alone was accurate in reproducing the committor. It is likely that n_contacts is used by the trained network only as a binary switch to distinguish the two different regimes—close to the bound or to the unbound states. By restricting the input importance analysis to training points close to the unbound state, we found that the network uses various interhelical contacts that approximately retrace a helical pattern (Extended Data Fig. 7). We performed symbolic regression on all possible combinations made by one, two or three of the seven most important input coordinates (Supplementary Table 3). The best expressions in terms of the loss were selected using validation committor data that had not been used during the optimization. This validation set consisted of committor data for 516 configurations with 30 trial shots each and 32 configurations with 10 trial shots.

To asses the variability in the observed reaction mechanisms, we performed a hierarchical clustering of all TPs projected into the plane defined by the contacts 9 and 22, which enter the most accurate parametrization generated by symbolic regression. We then used dynamic time warping⁶⁹ to calculate the pairwise similarity between all TPs for the clustering, which we performed using the scipy clustering module^70,71. The path density plots (Fig. 5f,g) were histogrammed according to the number of paths, not the number of configurations, that is, the counter of each cell visited by a particular path was incremented by one for this path.

Source link