Image formation and reconstruction
In X-ray SPI, when a laser pulse intersects a molecule, the light diffracts to form an intensity image on the detector. As described by diffraction theory30, this intensity pattern samples the Fourier transform of the electron density along a geometric construct known as the Ewald sphere32. The resulting image on the detector represents the squared amplitude of the Fourier transform, losing phase information. We assume that the particle in each image is oriented randomly with respect to the laser and that the light incident on the detector is subject to Poisson-distributed photon noise. As such, the intensity measured at pixel (u, v) of the diffraction image can be written as
$${{{{\bf{I}}}}}(u,v;V,{R}_{\phi }) \sim {{{{\rm{Pois}}}}}\left[V\left({R}_{\phi }\cdot E(u,v)\right)\right],$$
(1)
where \(V:{{\mathbb{R}}}^{3}\to {{\mathbb{R}}}_{+}\) represents the intensity volume, i.e., the squared amplitude of the Fourier transform of the 3D electron density volume, \({R}_{\phi }\in \,{\mbox{SO(3)}}\,\subset {{\mathbb{R}}}^{3\times 3}\) is the orientation of the particle, Pois[η] is a homogeneous Poisson process with rate η, and \(E:{{\mathbb{R}}}^{2}\to {{\mathbb{R}}}^{3}\) is the Ewald sphere. A detector defines a set of N pixels \({\{({u}_{k},{v}_{k})\}}_{k\in \{1,\ldots,N\}}\).
To estimate the 3D electron density from a set of diffraction patterns in an online fashion, X-RAI uses an encoder–decoder network architecture (Fig. 1). As images are collected, our method optimizes a pair of neural networks: the convolutional encoder \({f}_{\psi }:{{\mathbb{R}}}^{{N}_{u}\times {N}_{v}}\to {{\mathbb{R}}}^{6},I\mapsto {f}_{\psi }\left(I\right)\), mapping images with a resolution of Nu × Nv pixels to poses defined by six scalars, and the implicit neural volume representation \({V}_{\theta }:{{\mathbb{R}}}^{3}\to {{\mathbb{R}}}_{+},({k}_{x},{k}_{y},{k}_{z})\mapsto {V}_{\theta }({k}_{x},{k}_{y},{k}_{z})\), mapping coordinates in Fourier space to intensities (see Methods for additional details). Similar to existing algorithms25,27, we perform reconstruction in reciprocal space in order to operate with the diffraction patterns directly. Since the noise model is Poissonian, averaging the contributions of many diffraction patterns is sufficient to construct a noise-free estimate of the diffraction volume. We input the inverse Fourier transform of the square root of the diffraction images to the encoder, finding that this transformation helps pose estimation converge (as shown in the supplement). Instead of independently optimizing the pose of each diffraction image, X-RAI uses the same convolutional encoder for all the images and effectively learns to estimate poses on-the-fly. Based on Eq. (1), a physics-based decoder Γ reconstructs noise-free estimates of the input diffraction images using the known geometry of the detector, the predicted orientations Rϕ, and an implicit neural representation of the intensity volume Vθ, a process that can be formally written as:
$$\Gamma :{V}_{\theta },{R}_{\phi } \, \mapsto \, {\{{V}_{\theta }\left({R}_{\phi }\cdot E({u}_{k},{v}_{k})\right)\}}_{k\in \{1,\ldots,N\}}.$$
(2)
Unlike a voxel-based representation (\(\in {{\mathbb{R}}}^{N\times N\times N}\)), the implicit neural representation maps 3D coordinates to intensities and can be queried continuously in Fourier space without the need to explicitly define an interpolation scheme. Furthermore, in a scenario where view directions are optimized jointly with the 3D density, implicit representations have been shown to be more adapted to gradient-based optimization than voxel arrays as they naturally learn low-frequency details first, thereby avoiding local minima33.

A femtosecond-resolution X-ray pulse intersects a single, hydrated molecule, creating a diffraction image. This process is repeated at high rates to collect millions of diffraction images, each observing the molecule with an unknown orientation, or pose. Our algorithm, X-RAI, employs a CNN-based encoder fψ to efficiently estimate the pose Rϕ of the molecule in each image. A physically accurate decoder Γ (in Fourier space, shown here in real space) produces a noise-free estimate of the diffraction pattern using the molecule’s 3D structure, represented as a neural field Vθ. The symmetric loss \({{{{{\mathcal{L}}}}}}_{{{{{\rm{sym}}}}}}\) is applied to optimize the parameters ψ and θ in an online fashion using self-supervision. At any point during the experiment, the volume Vθ can be phased to obtain an estimate of the electron density.
In the online setup, collected images are sequentially processed in small batches. X-RAI successively updates the parameters of the neural networks, ψ and θ, so as to maximally increase the likelihood of the observed batch \({{{{{\mathcal{B}}}}}}_{t}\). Specifically, the update rule takes a step opposed to the gradient of
$${{{{{\mathcal{L}}}}}}_{{{{{\rm{sym}}}}}}(\psi,\theta ;t)={\sum}_{{{{{\bf{I}}}}}\in {{{{{\mathcal{B}}}}}}_{t}}\min \left\{{{{{\rm{PL}}}}}(\Gamma ({V}_{\theta },{f}_{\psi }({{{{\bf{Y}}}}})),{{{{\bf{I}}}}}),{{{{\bf{Y}}}}}\in {{{{\mathcal{O}}}}}({{{{\rm{IFT}}}}}(\sqrt{{{{{\bf{I}}}}}}))\right\}$$
(3)
with respect to ψ and θ. Here, PL represents the Poisson negative log-likelihood of observing I, assuming that the measurement is a Poisson random variable of the noiseless intensity Γ(Vθ, fψ(Y)). The negative log-likelihood is implemented in PyTorch as the PoissonNLLLoss34 as follows:
$${{{{\rm{PL}}}}}(i,t)=i-t\log (i)+\log (t!)$$
(4)
Due to quasisymmetries in the ground truth 3D density, diffraction images that appear similar but are associated with distant poses tend to be mapped to the same pose by the encoder, causing spurious symmetries to appear in the reconstructed volume when using a simple loss function of this type. To avoid this behavior, two variants of each diffraction image, the original image and a flipped version, are run through the pipeline in parallel, as indicated by the \({{{{\mathcal{O}}}}}(\,{\mbox{IFT}}\,(\sqrt{{{{{\bf{I}}}}}}))\) term in equation (3). \({{{{\mathcal{O}}}}}({{{{\bf{Y}}}}})\) corresponds to the pair ({Y(uk, vk)}, {Y(vk, uk)}), and gradients are only computed for the run that produces the lower reconstruction error. Our particular choice of \({{{{\mathcal{O}}}}}({{{{\bf{Y}}}}})\) ensures that X-RAI converges even in the presence of quasi-planar symmetries. In the online learning setup, images can be processed on the fly at a speed of 161 images per second on a single GPU. At any point during the experiment, the volume Vθ can be phased to obtain an estimate of the electron density.
Comparison to state-of-the-art offline reconstruction
We first validate our method on the problem of offline reconstruction, demonstrating reconstruction quality that matches or exceeds that of prior work. X-RAI can be trained offline on smaller datasets by conducting several epochs over the data until convergence. To compare our method’s performance against that of M-TIP27 and Dragonfly25, we simulate several synthetic datasets containing 50,000 images each using the software package Skopi35. Two of these, the 40S ribosomal subunit (PDB: 7R4X)36 and the ATP synthase molecule (PDB: 6J5I)37, are shown in Fig. 2. X-RAI is able to reconstruct both particles to sub-2 nanometer resolution, surpassing the quality of the other two methods. All volumes shown in the figure represent the reconstructed densities after running the algorithms to convergence. Although we run M-TIP ten times independently and select the best reconstruction, it fails to reconstruct the ATP synthase protein (6J5I) during all the attempts, resulting in the degenerate electron density volume shown. In all simulations, we use a photon energy of 4600 eV and focus the beam to a spot with a diameter of one micron.

Each row corresponds to a separate dataset simulated using Skopi35 with the protein data bank (PDB) code specified in the leftmost column. For both datasets, X-RAI is able to reconstruct each particle to within 1.5 nanometers of resolution, exceeding the quality of the other two methods. For the protein 6J5I, M-TIP fails to reconstruct the intensity volume, resulting in the degenerate density volume shown.
Dataset size and resolution
The benefits of using larger datasets can be seen when the data are corrupted with experimental artifacts such as background. In Fig. 3, we demonstrate the advantage of performing reconstruction with hundreds of thousands to millions of images. We synthetically generate datasets of three sizes of the 50S ribosomal subunit (PDB: 5O60)38, containing 50,000, 500,000, and 5,000,000 images. The diffraction images are simulated at a fixed fluence of 0.02 mJ/μm2 and a photon energy of 1 keV. For each dataset size, we sample a set of background intensities (shown in Fig. 3) that are then added uniformly to the images with the following formation model:
$${{{{\bf{I}}}}}(u,v;V,{R}_{\phi },B) \sim {{{{\rm{Pois}}}}}\left[V\left({R}_{\phi }\cdot E(u,v)\right)+B\right],$$
(5)
where B, the background intensity level, is constant across each dataset. For each dataset size and background level, we run reconstruction with X-RAI using fixed, ground truth poses and only optimize the neural implicit representation in order to separate the effects of pose estimation. Here, we employ the Poisson negative log-likelihood loss from equation (4) for optimization. Phase retrieval is performed with equivalent settings for all the reconstructed intensity volumes, using alternating iterations of the error reduction and difference-map algorithms as well as a procedure to fit an estimated background to the data. As shown in Fig. 3, when there is no background, the implicit representation can reconstruct the particle equally well at near-detector-limited resolution with 50K, 500K, or 5M images. However, as the background intensity increases, the resolution obtained using 500K or 5M images is higher than that using 50K images.

We simulate synthetic data of the 50S ribosomal subunit (PDB: 5O60)38 at a fixed fluence level of 0.02 mJ/μm2 and a photon energy of 1 keV at a resolution of 1282 per diffraction image. We generate three datasets with 50K, 500K, and 5M images, respectively, and test the effect of a uniform background intensity on the resolution of reconstruction. Here, only the implicit representation is tested with ground truth poses in order to separate the effects of inaccurate pose estimation. Reconstruction is performed for 100, 10, and 1 epoch(s) for the 50K-sized, 500K-sized, and 5M-sized datasets, respectively, such that each run performs the same number of gradient updates. As seen in the plot, the implicit representation achieves near-detector-limited resolution when there is zero background. With nonzero background, the resolution achieved using 50K images is lower than that using 500K images. The difference in resolution between using 500K and 5M images is negligible, a trend that is seen at all nonzero background levels. We perform 5 runs for each dataset size and background intensity to account for the stochasticity of reconstruction and phasing, and we plot the standard deviation as shaded error bars.
Online single particle reconstruction
For large datasets containing millions of images, X-RAI can be run online by processing data in small batches, each of which contributes a gradient update to the model. This amounts to running stochastic gradient descent over the entire dataset for one epoch, and X-RAI is able to process data at a speed of 161 images per second on a single GPU.
Figure 4 illustrates how reconstruction quality improves as X-RAI acquires synthetic data of the 50S ribosomal subunit (PDB: 5O60)38 in an online fashion for two levels of beam fluence. For the higher fluence level corresponding to 1013 photons per pulse (9.38 mJ/μm2), the diffraction images have a higher level of signal and converge quickly to a resolution of 1.7 nm, compared to a resolution of 1.9 nm when using a fluence of 6 × 1012 photons per pulse (5.63 mJ/μm2). In this experiment, all datasets are simulated with a photon energy of 4600 eV and a beam focus spot that is one micron in diameter.

Remarkably, X-RAI is able to reconstruct the protein even when acquiring the data sequentially in batches of 64 images. In the plot, resolution improves steadily as the model processes more images. Two synthetic datasets with varying levels of beam fluence (photons per pulse) are shown in order to illustrate the effect of signal level on reconstruction quality. Various time points on each of the fluence curves are labeled from 1–7, and the reconstructions corresponding to these points are visualized by the numbered structures shown in blue and red. Our method converges to a resolution of about 1.7 nm after processing 5 million images when handling images with 1013 photons per pulse (9.38 mJ/μm2). On the other hand, when operating under a lower fluence level of 6 × 1012 photons per pulse (5.63 mJ/μm2), X-RAI only converges to a resolution of 1.9 nm, demonstrating the effect of fluence on resolution. The data points for 1.28M, 2.56M, 3.84M, and 5M images correspond to processing times of approximately 2, 4, 6, and 8 hours, respectively. Since our method is an online algorithm that processes data serially, it can help experimentalists decide when to cease data collection via the results from live reconstruction.
Online reconstruction is fundamentally more difficult than its offline counterpart because when reconstruction is run for more than one epoch, the encoder can potentially memorize the mapping of input images to poses (given a sufficiently small dataset) rather than learning the true function that relates these two quantities. Our method’s success at online reconstruction shows that the model determines the true association between diffraction patterns and orientations without relying on memorization.
Table 1 contains quantitative results for nine synthetic datasets consisting of three proteins simulated at three dataset sizes. Remarkably, X-RAI is able to process all the datasets containing 5 million images in an online manner. Each method is run for a maximum of 48 hours and terminated if it exceeds the time or memory budget. M-TIP is unable to process datasets with 500 thousand or 5 million images, and Dragonfly exceeds the time limit for 5 million images, as signified by the empty rows in the table. While Dragonfly is able to process the smallest datasets with 50K images more efficiently than X-RAI, the benefit of our method lies in its ability to reconstruct large datasets with millions of images in a serial fashion. It does so with relatively low memory usage (equivalent to the batch size), providing experimentalists with live reconstruction results during data collection that inform decision-making regarding the quality of the acquired data. For instance, users can opt to cease data acquisition early if the data achieve a desired resolution threshold.
Visual reconstruction results for all the synthetic datasets can be found in the supplement.
Reconstruction of the PR772 virus from experimental data
In addition to validating X-RAI on synthetic datasets in the online and offline settings, we also test its performance on a real experimental dataset captured at an XFEL. Consisting of 14,772 diffraction images of the PR772 coliphage virus, this dataset39 is the largest biological collection of its kind from an SPI experiment. Owing to the virus’s size, the sample scatters hundreds of thousands of photons per image, which is significantly larger than our simulated proteins.
We reconstruct PR772 by extending the loss function in equation (3) to enforce the known icosahedral symmetry of the virus. During training, we randomly rotate the poses output by the encoder by a symmetry rotation of the icosahedron, resulting in a variant of the original loss function (3) that we refer to as the randomized symmetric loss:
$${{{{{\mathcal{L}}}}}}_{{{{{\rm{rand}}}}}\_{{{{\rm{sym}}}}}}(\psi,\theta ;t)={\sum}_{{{{{\bf{I}}}}}\in {{{{{\mathcal{B}}}}}}_{t}}\min \left\{{{{{\rm{PL}}}}}(\Gamma ({V}_{\theta },{R}_{i,t}\cdot {f}_{\psi }({{{{\bf{Y}}}}})),{{{{\bf{I}}}}}),{{{{\bf{Y}}}}}=\root 10 \of {{{{{\bf{I}}}}}}\right\}$$
(6)
where Ri,t represents a randomly-selected symmetry rotation. We note that while this customized loss function can support online reconstruction if the particle’s symmetry group is known prior to data capture, if the symmetry is unknown, then the analysis must be performed offline. For this dataset, we apply the tenth root operator to the diffraction images in order to reduce the high image contrast caused by the abundance of photon hits at low frequencies. We find that inputting the gamma-corrected diffraction image directly to the encoder helps it perform pose estimation more accurately. Due to the symmetries of the virus, the encoder can only map diffraction patterns to a subset of SO(3) (see supplement). To resolve this issue, we rotate the encoder’s pose estimate by Ri,t, a member of the icosahedral rotation group, which in its orientation-preserving form consists of 60 rotation matrices that produce symmetric views when applied to the virus’s density and intensity volumes. After the encoder outputs a pose estimate, the system rotates this estimate by a randomly-chosen rotation matrix from the icosahedral group, and we jointly optimize the orientation of the three axes defining this symmetry group during reconstruction. Since the rotation matrix is selected randomly for each image in the batch, icosahedral symmetry is enforced in a soft manner over all the batches.
This replication scheme effectively spreads out the encoder’s pose estimates over SO(3) and enables the intensity volume to be supervised on the entirety of \({{\mathbb{R}}}^{3}\), resulting in the reconstructions shown in Fig. 5. Optimization is run for 100 epochs on this dataset. After conducting phase retrieval, we achieve a resolution of 18.95 nm on the density volume. By comparison, the best previously reported reconstruction reaches a resolution of 9 nm on a filtered version of the dataset40. The absolute resolution limit of this dataset, as determined by the coordinates of the detector corner, is 8.3 nm39.

To enforce the known icosahedral symmetry of the virus, we augment the encoder’s orientation estimates with the symmetry rotations of the icosahedron, effectively spreading out the pose estimates over SO(3). The resulting density reconstruction is of resolution 18.95 nanometers, as determined by the phase retrieval transfer function. The diffraction patterns are displayed with reduced image contrast.
Reconstruction of a gold nanoparticle from experimental data
We successfully apply X-RAI to reconstruct a real dataset of a gold nanoparticle published in ref. 8, thereby validating the ability of our method to handle experimental challenges such as lower signal, panel gaps, bad pixels, and incident fluence variation. The sample we reconstruct from this dataset, named cub42, consists of around 2.4 million images collected in an SPI experiment at the European XFEL. This initial set is filtered down to a set of around 566,000 images containing hits with strong signal via the 2D classification procedure implemented in Dragonfly25, with similar settings as found in ref. 8.
Similar to our reconstruction for the PR772 virus, we employ the randomized symmetric loss in equation (6) to handle the particle’s cubic symmetry, except that we transform the input images in the same manner as is done for the simulated experiments, with the inverse Fourier transform of the square root of the diffraction images:
$${{{{{\mathcal{L}}}}}}_{{{{{\rm{rand}}}}}\_{{{{\rm{sym}}}}}}(\psi,\theta ;t)={\sum}_{{{{{\bf{I}}}}}\in {{{{{\mathcal{B}}}}}}_{t}}\min \left\{{{{{\rm{PL}}}}}(\Gamma ({V}_{\theta },{R}_{i,t}\cdot {f}_{\psi }({{{{\bf{Y}}}}})),{{{{\bf{I}}}}}),{{{{\bf{Y}}}}}={{{{\rm{IFT}}}}}(\sqrt{{{{{\bf{I}}}}}})\right\}$$
(7)
We select Ri,t randomly from the set of 24 rotation matrices belonging to the chiral octahedral symmetry group. Additionally, we model experimental fluence fluctuation by scaling the images output by the Γ operator by incident fluences estimated with Dragonfly. Optimization is run for 8 epochs using only the low-resolution panels of the detector for efficiency. We then use the pose estimates output by this step along with the full-resolution diffraction patterns to assemble the final intensity volume via trilinear interpolation on a voxel grid. Finally, we run phase retrieval on this volume, averaging the reuslt of 64 independent runs. The resolution of X-RAI’s reconstruction, as determined by the phase retrieval transfer function (PRTF), is 4.56 nm, compared to the 4.89 nm value reported in ref. 8. Figure 6 visualizes X-RAI’s reconstruction as well as sample diffraction patterns from the filtered dataset and the PRTF.

To enforce the known cubic symmetry of the nanoparticle, we augment the encoder’s orientation estimates with the symmetry rotations of the cube (equivalent to the octahedral symmetry group), effectively spreading out the pose estimates over SO(3). The low-resolution panels of the diffraction patterns shown in the figure are used to train X-RAI’s autoencoder end-to-end in order to determine the pose estimates of all the images. These estimates are then used to merge the full-resolution diffraction images into an explicit voxel representation. After phasing this full-resolution intensity volume, the resulting density reconstruction is of resolution 4.56 nanometers. The patterns on the left are displayed with reduced image contrast, and the diffraction volume shown only contains the low-resolution detector content.
