Scalable 3D reconstruction for X-ray single particle imaging with online machine learning

Image formation and reconstruction

In X-ray SPI, when a laser pulse intersects a molecule, the light diffracts to form an intensity image on the detector. As described by diffraction theory³⁰, this intensity pattern samples the Fourier transform of the electron density along a geometric construct known as the Ewald sphere³². The resulting image on the detector represents the squared amplitude of the Fourier transform, losing phase information. We assume that the particle in each image is oriented randomly with respect to the laser and that the light incident on the detector is subject to Poisson-distributed photon noise. As such, the intensity measured at pixel (u, v) of the diffraction image can be written as

$${{{{\bf{I}}}}}(u,v;V,{R}_{\phi }) \sim {{{{\rm{Pois}}}}}\left[V\left({R}_{\phi }\cdot E(u,v)\right)\right],$$

(1)

where $V:{{\mathbb{R}}}^{3}\to {{\mathbb{R}}}_{+}$ represents the intensity volume, i.e., the squared amplitude of the Fourier transform of the 3D electron density volume, ${R}_{\phi }\in \,{\mbox{SO(3)}}\,\subset {{\mathbb{R}}}^{3\times 3}$ is the orientation of the particle, Pois[η] is a homogeneous Poisson process with rate η, and $E:{{\mathbb{R}}}^{2}\to {{\mathbb{R}}}^{3}$ is the Ewald sphere. A detector defines a set of N pixels ${\{({u}_{k},{v}_{k})\}}_{k\in \{1,\ldots,N\}}$.

To estimate the 3D electron density from a set of diffraction patterns in an online fashion, X-RAI uses an encoder–decoder network architecture (Fig. 1). As images are collected, our method optimizes a pair of neural networks: the convolutional encoder ${f}_{\psi }:{{\mathbb{R}}}^{{N}_{u}\times {N}_{v}}\to {{\mathbb{R}}}^{6},I\mapsto {f}_{\psi }\left(I\right)$, mapping images with a resolution of N_u × N_v pixels to poses defined by six scalars, and the implicit neural volume representation ${V}_{\theta }:{{\mathbb{R}}}^{3}\to {{\mathbb{R}}}_{+},({k}_{x},{k}_{y},{k}_{z})\mapsto {V}_{\theta }({k}_{x},{k}_{y},{k}_{z})$, mapping coordinates in Fourier space to intensities (see Methods for additional details). Similar to existing algorithms^25,27, we perform reconstruction in reciprocal space in order to operate with the diffraction patterns directly. Since the noise model is Poissonian, averaging the contributions of many diffraction patterns is sufficient to construct a noise-free estimate of the diffraction volume. We input the inverse Fourier transform of the square root of the diffraction images to the encoder, finding that this transformation helps pose estimation converge (as shown in the supplement). Instead of independently optimizing the pose of each diffraction image, X-RAI uses the same convolutional encoder for all the images and effectively learns to estimate poses on-the-fly. Based on Eq. (1), a physics-based decoder Γ reconstructs noise-free estimates of the input diffraction images using the known geometry of the detector, the predicted orientations R_ϕ, and an implicit neural representation of the intensity volume V_θ, a process that can be formally written as:

$$\Gamma :{V}_{\theta },{R}_{\phi } \, \mapsto \, {\{{V}_{\theta }\left({R}_{\phi }\cdot E({u}_{k},{v}_{k})\right)\}}_{k\in \{1,\ldots,N\}}.$$

(2)

Unlike a voxel-based representation ($\in {{\mathbb{R}}}^{N\times N\times N}$), the implicit neural representation maps 3D coordinates to intensities and can be queried continuously in Fourier space without the need to explicitly define an interpolation scheme. Furthermore, in a scenario where view directions are optimized jointly with the 3D density, implicit representations have been shown to be more adapted to gradient-based optimization than voxel arrays as they naturally learn low-frequency details first, thereby avoiding local minima³³.

**Fig. 1: Illustration of XFEL SPI image acquisition process and subsequent online processing by our algorithm.**

In the online setup, collected images are sequentially processed in small batches. X-RAI successively updates the parameters of the neural networks, ψ and θ, so as to maximally increase the likelihood of the observed batch ${{{{{\mathcal{B}}}}}}_{t}$. Specifically, the update rule takes a step opposed to the gradient of

$${{{{{\mathcal{L}}}}}}_{{{{{\rm{sym}}}}}}(\psi,\theta ;t)={\sum}_{{{{{\bf{I}}}}}\in {{{{{\mathcal{B}}}}}}_{t}}\min \left\{{{{{\rm{PL}}}}}(\Gamma ({V}_{\theta },{f}_{\psi }({{{{\bf{Y}}}}})),{{{{\bf{I}}}}}),{{{{\bf{Y}}}}}\in {{{{\mathcal{O}}}}}({{{{\rm{IFT}}}}}(\sqrt{{{{{\bf{I}}}}}}))\right\}$$

(3)

with respect to ψ and θ. Here, PL represents the Poisson negative log-likelihood of observing I, assuming that the measurement is a Poisson random variable of the noiseless intensity Γ(V_θ, f_ψ(Y)). The negative log-likelihood is implemented in PyTorch as the PoissonNLLLoss³⁴ as follows:

$${{{{\rm{PL}}}}}(i,t)=i-t\log (i)+\log (t!)$$

(4)

Due to quasisymmetries in the ground truth 3D density, diffraction images that appear similar but are associated with distant poses tend to be mapped to the same pose by the encoder, causing spurious symmetries to appear in the reconstructed volume when using a simple loss function of this type. To avoid this behavior, two variants of each diffraction image, the original image and a flipped version, are run through the pipeline in parallel, as indicated by the ${{{{\mathcal{O}}}}}(\,{\mbox{IFT}}\,(\sqrt{{{{{\bf{I}}}}}}))$ term in equation (3). ${{{{\mathcal{O}}}}}({{{{\bf{Y}}}}})$ corresponds to the pair ({Y(u_k, v_k)}, {Y(v_k, u_k)}), and gradients are only computed for the run that produces the lower reconstruction error. Our particular choice of ${{{{\mathcal{O}}}}}({{{{\bf{Y}}}}})$ ensures that X-RAI converges even in the presence of quasi-planar symmetries. In the online learning setup, images can be processed on the fly at a speed of 161 images per second on a single GPU. At any point during the experiment, the volume V_θ can be phased to obtain an estimate of the electron density.

Comparison to state-of-the-art offline reconstruction

We first validate our method on the problem of offline reconstruction, demonstrating reconstruction quality that matches or exceeds that of prior work. X-RAI can be trained offline on smaller datasets by conducting several epochs over the data until convergence. To compare our method’s performance against that of M-TIP²⁷ and Dragonfly²⁵, we simulate several synthetic datasets containing 50,000 images each using the software package Skopi³⁵. Two of these, the 40S ribosomal subunit (PDB: 7R4X)³⁶ and the ATP synthase molecule (PDB: 6J5I)³⁷, are shown in Fig. 2. X-RAI is able to reconstruct both particles to sub-2 nanometer resolution, surpassing the quality of the other two methods. All volumes shown in the figure represent the reconstructed densities after running the algorithms to convergence. Although we run M-TIP ten times independently and select the best reconstruction, it fails to reconstruct the ATP synthase protein (6J5I) during all the attempts, resulting in the degenerate electron density volume shown. In all simulations, we use a photon energy of 4600 eV and focus the beam to a spot with a diameter of one micron.

**Fig. 2: Comparison of X-RAI with M-TIP²⁷ and Dragonfly²⁵ on reconstructing datasets with 50,000 images in an offline setting.**

Dataset size and resolution

The benefits of using larger datasets can be seen when the data are corrupted with experimental artifacts such as background. In Fig. 3, we demonstrate the advantage of performing reconstruction with hundreds of thousands to millions of images. We synthetically generate datasets of three sizes of the 50S ribosomal subunit (PDB: 5O60)³⁸, containing 50,000, 500,000, and 5,000,000 images. The diffraction images are simulated at a fixed fluence of 0.02 mJ/μm² and a photon energy of 1 keV. For each dataset size, we sample a set of background intensities (shown in Fig. 3) that are then added uniformly to the images with the following formation model:

$${{{{\bf{I}}}}}(u,v;V,{R}_{\phi },B) \sim {{{{\rm{Pois}}}}}\left[V\left({R}_{\phi }\cdot E(u,v)\right)+B\right],$$

(5)

where B, the background intensity level, is constant across each dataset. For each dataset size and background level, we run reconstruction with X-RAI using fixed, ground truth poses and only optimize the neural implicit representation in order to separate the effects of pose estimation. Here, we employ the Poisson negative log-likelihood loss from equation (4) for optimization. Phase retrieval is performed with equivalent settings for all the reconstructed intensity volumes, using alternating iterations of the error reduction and difference-map algorithms as well as a procedure to fit an estimated background to the data. As shown in Fig. 3, when there is no background, the implicit representation can reconstruct the particle equally well at near-detector-limited resolution with 50K, 500K, or 5M images. However, as the background intensity increases, the resolution obtained using 500K or 5M images is higher than that using 50K images.

**Fig. 3: Benefit of more data in the presence of background.**

Online single particle reconstruction

For large datasets containing millions of images, X-RAI can be run online by processing data in small batches, each of which contributes a gradient update to the model. This amounts to running stochastic gradient descent over the entire dataset for one epoch, and X-RAI is able to process data at a speed of 161 images per second on a single GPU.

Figure 4 illustrates how reconstruction quality improves as X-RAI acquires synthetic data of the 50S ribosomal subunit (PDB: 5O60)³⁸ in an online fashion for two levels of beam fluence. For the higher fluence level corresponding to 10¹³ photons per pulse (9.38 mJ/μm²), the diffraction images have a higher level of signal and converge quickly to a resolution of 1.7 nm, compared to a resolution of 1.9 nm when using a fluence of 6 × 10¹² photons per pulse (5.63 mJ/μm²). In this experiment, all datasets are simulated with a photon energy of 4600 eV and a beam focus spot that is one micron in diameter.

**Fig. 4: Online reconstruction of the 50S ribosomal subunit (PDB: 5O60)³⁸.**

Online reconstruction is fundamentally more difficult than its offline counterpart because when reconstruction is run for more than one epoch, the encoder can potentially memorize the mapping of input images to poses (given a sufficiently small dataset) rather than learning the true function that relates these two quantities. Our method’s success at online reconstruction shows that the model determines the true association between diffraction patterns and orientations without relying on memorization.

Table 1 contains quantitative results for nine synthetic datasets consisting of three proteins simulated at three dataset sizes. Remarkably, X-RAI is able to process all the datasets containing 5 million images in an online manner. Each method is run for a maximum of 48 hours and terminated if it exceeds the time or memory budget. M-TIP is unable to process datasets with 500 thousand or 5 million images, and Dragonfly exceeds the time limit for 5 million images, as signified by the empty rows in the table. While Dragonfly is able to process the smallest datasets with 50K images more efficiently than X-RAI, the benefit of our method lies in its ability to reconstruct large datasets with millions of images in a serial fashion. It does so with relatively low memory usage (equivalent to the batch size), providing experimentalists with live reconstruction results during data collection that inform decision-making regarding the quality of the acquired data. For instance, users can opt to cease data acquisition early if the data achieve a desired resolution threshold.

Table 1 Quantitative comparison of X-RAI, M-TIP, and Dragonfly

Visual reconstruction results for all the synthetic datasets can be found in the supplement.

Reconstruction of the PR772 virus from experimental data

In addition to validating X-RAI on synthetic datasets in the online and offline settings, we also test its performance on a real experimental dataset captured at an XFEL. Consisting of 14,772 diffraction images of the PR772 coliphage virus, this dataset³⁹ is the largest biological collection of its kind from an SPI experiment. Owing to the virus’s size, the sample scatters hundreds of thousands of photons per image, which is significantly larger than our simulated proteins.

We reconstruct PR772 by extending the loss function in equation (3) to enforce the known icosahedral symmetry of the virus. During training, we randomly rotate the poses output by the encoder by a symmetry rotation of the icosahedron, resulting in a variant of the original loss function (3) that we refer to as the randomized symmetric loss:

$${{{{{\mathcal{L}}}}}}_{{{{{\rm{rand}}}}}\_{{{{\rm{sym}}}}}}(\psi,\theta ;t)={\sum}_{{{{{\bf{I}}}}}\in {{{{{\mathcal{B}}}}}}_{t}}\min \left\{{{{{\rm{PL}}}}}(\Gamma ({V}_{\theta },{R}_{i,t}\cdot {f}_{\psi }({{{{\bf{Y}}}}})),{{{{\bf{I}}}}}),{{{{\bf{Y}}}}}=\root 10 \of {{{{{\bf{I}}}}}}\right\}$$

(6)

where R_i,t represents a randomly-selected symmetry rotation. We note that while this customized loss function can support online reconstruction if the particle’s symmetry group is known prior to data capture, if the symmetry is unknown, then the analysis must be performed offline. For this dataset, we apply the tenth root operator to the diffraction images in order to reduce the high image contrast caused by the abundance of photon hits at low frequencies. We find that inputting the gamma-corrected diffraction image directly to the encoder helps it perform pose estimation more accurately. Due to the symmetries of the virus, the encoder can only map diffraction patterns to a subset of SO(3) (see supplement). To resolve this issue, we rotate the encoder’s pose estimate by R_i,t, a member of the icosahedral rotation group, which in its orientation-preserving form consists of 60 rotation matrices that produce symmetric views when applied to the virus’s density and intensity volumes. After the encoder outputs a pose estimate, the system rotates this estimate by a randomly-chosen rotation matrix from the icosahedral group, and we jointly optimize the orientation of the three axes defining this symmetry group during reconstruction. Since the rotation matrix is selected randomly for each image in the batch, icosahedral symmetry is enforced in a soft manner over all the batches.

This replication scheme effectively spreads out the encoder’s pose estimates over SO(3) and enables the intensity volume to be supervised on the entirety of ${{\mathbb{R}}}^{3}$, resulting in the reconstructions shown in Fig. 5. Optimization is run for 100 epochs on this dataset. After conducting phase retrieval, we achieve a resolution of 18.95 nm on the density volume. By comparison, the best previously reported reconstruction reaches a resolution of 9 nm on a filtered version of the dataset⁴⁰. The absolute resolution limit of this dataset, as determined by the coordinates of the detector corner, is 8.3 nm³⁹.

**Fig. 5: Reconstruction of the PR772 virus from experimental data collected at an XFEL facility³⁹.**

Reconstruction of a gold nanoparticle from experimental data

We successfully apply X-RAI to reconstruct a real dataset of a gold nanoparticle published in ref. ⁸, thereby validating the ability of our method to handle experimental challenges such as lower signal, panel gaps, bad pixels, and incident fluence variation. The sample we reconstruct from this dataset, named cub42, consists of around 2.4 million images collected in an SPI experiment at the European XFEL. This initial set is filtered down to a set of around 566,000 images containing hits with strong signal via the 2D classification procedure implemented in Dragonfly²⁵, with similar settings as found in ref. ⁸.

Similar to our reconstruction for the PR772 virus, we employ the randomized symmetric loss in equation (6) to handle the particle’s cubic symmetry, except that we transform the input images in the same manner as is done for the simulated experiments, with the inverse Fourier transform of the square root of the diffraction images:

(7)

We select R_i,t randomly from the set of 24 rotation matrices belonging to the chiral octahedral symmetry group. Additionally, we model experimental fluence fluctuation by scaling the images output by the Γ operator by incident fluences estimated with Dragonfly. Optimization is run for 8 epochs using only the low-resolution panels of the detector for efficiency. We then use the pose estimates output by this step along with the full-resolution diffraction patterns to assemble the final intensity volume via trilinear interpolation on a voxel grid. Finally, we run phase retrieval on this volume, averaging the reuslt of 64 independent runs. The resolution of X-RAI’s reconstruction, as determined by the phase retrieval transfer function (PRTF), is 4.56 nm, compared to the 4.89 nm value reported in ref. ⁸. Figure 6 visualizes X-RAI’s reconstruction as well as sample diffraction patterns from the filtered dataset and the PRTF.