Fourier-based three-dimensional multistage transformer for aberration correction in multicellular specimens

Machine Learning


AO-LLS microscope

Imaging was performed using an AO-LLS microscope similar to one described previously3 (Supplementary Fig. 14 and Supplementary Table 7). Briefly, 488-nm and 560nm lasers (500 mW 2RU-VFL-P-500-488-B1R and 1,000 mW 2RU-VFL-P-1000-560-B1R, MPB Communications Inc.) were modulated using an acousto-optical tunable filter (Quanta-Tech, AA OptoElectronic, AOTFnC-400.650-CPCh-TN) and shaped into a stripe by a Powell lens (Laserline Optics Canada, LOCP-8.9R20-2.0) and a pair of 50- and 250-mm cylindrical lenses (25-mm diameter; Thorlabs, ACY254-050 and LJ1267RM-A). The stripe illuminated a reflective, phase-only, gray-scale spatial light modulator (Meadowlark Optics, AVR Optics, P1920-0635-HDMI; 1,920 × 1,152 pixels) located at a sample conjugate plane. An eight-bit phase pattern written to the spatial light modulator generated the desired light-sheet pattern in the sample, and an annular mask (Thorlabs Imaging) at a pupil conjugate plane blocked unwanted diffraction orders before the light passed through the excitation objective (Thorlabs, TL20X-MPL). A pair of pupil conjugate galvanometer mirrors (Cambridge Technology, Novanta Photonics, 6SD11226 and 6SD11587) scanned the light sheet at the sample plane. The sample was positioned at the common foci of the excitation and detection objectives by a three-axis XYZ stage (Smaract; MLS-3252-S, SLS-5252-S and SLS-5252-S). Fluorescence emission from the sample was collected by a detection objective (Zeiss, ×20 1.0 numerical aperture (NA), 421452-9800-000), reflected off a pupil conjugate DM (ALPAO, DM69) that applied aberration corrections, and then recorded on two sample conjugate cameras (Hamamatsu ORCA Fusion).

SH measurements (Supplementary Fig. 18) were performed on the same microscope by localizing the intensity maxima (on a Hamamatsu ORCA Fusion) formed by the emitted light after passage through a pupil conjugate lenslet array (Edmund Optics, 64-479). The positional shifts of these maxima relative to those seen with no specimen present encode the pupil wavefront phase2, which can then be reconstructed.

Integration with microscope

AOViFT inference is performed routinely on the microscope acquisition PC (Intel Xeon, W5-3425, Windows 11, 512 GB RAM, NVIDIA A6000 with 48 GB VRAM). Inferences are made in an Ubuntu Docker container based on the TensorFlow NGC Container (24.02-tf2-py3) running in parallel with the microscope control software. Data communication between AOViFT and the microscope control software is handled through the computer’s file system. Image files and command-line parameters are passed to the model, and an output text file reports the resultant DM actuator values (Supplementary Fig. 19). When a volume is large enough to require tiling and dozens of volumes need to be processed, model inferences are parallelized and run using a SLURM compute cluster consisting of four nodes, each node containing four NVIDIA A100 80GB.

Fluorescent beads and cells expressing fluorescent endocytic adapter AP2

The 25-mm coverslips (Thorlabs, CG15XH) used for imaging beads, cells and zebrafish embryos were first cleaned by sonication in 70% ethanol followed by Milli-Q water, each for at least 30 min. They were then stored in Milli-Q water until use. Gene-edited SUM159-AP2-eGFP cells14 were grown in Dulbecco’s modified Eagle’s medium (DMEM)/F12 with GlutaMAX (Gibco, 10565018) supplemented with 5% fetal bovine serum (FBS; Avantor Seradigm, 89510-186), 10 mM HEPES (Gibco 15630080), 1 μg ml−1 hydrocortisone (Sigma, H0888), 5 μg ml−1 insulin (Sigma, I9278). Fluorescent beads (0.2-μm diameter, Invitrogen FluoSpheres Carboxylate-Modified Microspheres, 505/515 nm, cat. no. F8811 or 0.2-μm diameter Tetraspeck, Thermo Fisher Scientific Invitrogen, T7280) alone or with cells at 30–50% confluency were deposited onto plasma-treated and poly-d-lysine (Sigma-Aldrich, P0899)-treated 25-mm coverslips. Cells were cultured under standard conditions (37 °C, 5% CO2, 100% humidity) with twice weekly passaging. The SUM159-AP2-eGFP cells were imaged in Leibovitz’s L-15 medium without phenol red (Gibco,21083027) with 5% FBS (American Type Culture Collection, SCRR-30-2020), 100 μM Trolox (Tocris, 6002) and 100 μg ml−1 Primocin (InvivoGen, ant-pm-1) at 37 °C. Aberrations of approximately 1λ P–V were induced using a DM in ten configurations of Zernike modes (\({Z}_{2}^{2}\), \({Z}_{3}^{-3}\), \({Z}_{3}^{-1}\), \({Z}_{4}^{\,0}\) and their pairwise combinations). Widefield PSFs were collected from 0.2-μm fluorescent beads to confirm the aberrations applied and residual aberrations after correction (Supplementary Table 7).

Zebrafish embryos expressing fluorescent AP2 and mitochondria

Genome-edited ap2s1-expressing zebrafish (genome editing of ap2s1, ap2s1:ap2s1-mNeonGreenbk800; Supplementary Note D) were injected with cox8-mChilada mRNA for two color experiments. The N-terminal 34 amino acids of Cox8a were cloned into a pMTB backbone with a linker and mChilada coding sequence on the C terminus (unpublished, gift from N. Shaner). The plasmid was linearized, and mRNA was synthesized using a SP6 mMessage mMachine transcription kit (Thermo Fisher). RNA was purified using an RNeasy kit (Qiagen) and embryos were injected with 2 nl of 10 ng μl−1 Cox8a-mChilada, 100 mM KCl, 0.1% phenol red, 0.1 mM EDTA and 1 mM Tris, pH 7.5. Zebrafish embryos were first nanoinjected with 3 nl of a solution containing 0.86 ng μl−1 α-bungarotoxin protein, 1.43 × PBS and 0.14% phenol red. The injected embryos were mounted for imaging using a custom, volcano-shaped agarose mount. Each mount was constructed by solidifying a few drops of 1.2% (w/w) high-melting agarose (Invitrogen UltraPure Agarose, 16500–100, in 1× Danieau buffer) between a 25-mm glass coverslip and a 3D-printed mold (Formlabs Form 3+, printed in clear v.4 resin). This created ridges that formed a narrow groove. A hair-loop was used to orient the embryo within the agarose groove, positioning the left lateral side upward. Subsequently, 10–20 μl of 0.5% (w/w) low-melt agarose (Invitrogen UltraPure LMP Agarose, 16520–100, in 1× Danieau buffer) preheated to 40 °C, containing 0.2 μm Tetraspeck microspheres, was added on top of the embryo. This layer solidified around the embryo to secure it while providing fiducial beads for sample finding. Once the low-melt agarose solidified, the volcano-shaped mount was held by a custom sample holder for imaging. The embryo was oriented so that its anterior–posterior axis lay parallel to the sample x axis, with the anterior end facing the excitation objective and the posterior end facing the detection objective. The microscope objectives and the sample was immersed in a bath of ~50 ml bath Danieau buffer and were fully submerged, ensuring the embryo remained in buffered medium. Measurements for AOViFT and SH were done serially on the same FOV to compare the aberration corrections of both methods (Supplementary Table 7).

Spatially varying deconvolution

To compensate for sample-induced aberrations postacquisition, we performed a tile-based spatially varying deconvolution on each 3D volume. Each volume was first subdivided into several 3D tiles approximating isoplanatic patches. A AOViFT predicted PSF (for compensation) or an ideal PSF (for no compensation) was assigned to each tile, and aberrations were corrected using OTF masked Wiener (OMW) deconvolution15. To minimize boundary artifacts during deconvolution, the tile size was extended by half the PSF width at each boundary (32 pixels); after deconvolution, these overlaps were removed and the deconvolved core regions were stitched together to form the final corrected volume. All computations were done in MATLAB v.2024a (Mathworks).

Synthetic training/testing datasets

To train a model for predicting optical aberrations from images of subdiffractive objects in biological samples, we generated synthetic datasets encompassing a range of relevant variables (for example, aberration modes and amplitudes, number and density of puncta, SNR). This synthetic dataset generation procedure is as follows.

For a single subdiffractive punctum, the electric field in the rear pupil of the detection objective is given by:

$$E({k}_{x},{k}_{y})=A({k}_{x},{k}_{y})\exp ({\mathrm{i}}\phi ({k}_{x},{k}_{y}))$$

(1)

where A(kx, ky) is the pupil amplitude with coordinates kx, ky, and ϕ(kx, ky) is the pupil phase. Under aberration-free conditions, ϕ(kx, ky) is a constant. We can empirically determine A(kx, ky) by acquiring a widefield image of an isolated subdiffractive object (100-nm fluorescent bead), performing phase retrieval12,16 and applying the opposite of the retrieved phase using a pupil conjugate DM so that ϕ(kx, ky) becomes a constant.

The electric field for the image of a single aberrated punctum is:

$${E}_{{\rm{abb}}}({k}_{x},{k}_{y})=A({k}_{x},{k}_{y})\exp ({\mathrm{i}}{\phi }_{{\rm{abb}}}({k}_{x},{k}_{y}))$$

(2)

where the ϕabb(kx, ky) is described as a weighted sum of Zernike modes of unique amplitudes:

$${\phi }_{{\rm{abb}}}({k}_{x},{k}_{y})=\sum _{m,n}{\alpha }_{n}^{m}{Z}_{n}^{\,m}({k}_{x},{k}_{y})$$

(3)

Empirically, zebrafish induced aberrations for the microscopes used here are well described by combinations of 11 of the first 15 Zernike modes17 (Supplementary Fig. 1), for which n ≤ 4, excluding piston (\({Z}_{0}^{\,0}\)), tip (\({Z}_{1}^{-1}\)), tilt (\({Z}_{1}^{1}\)) and defocus (\({Z}_{2}^{\,0}\)) (as these represent phase offsets or sample translation). The distributions and amplitudes of the remainder are used to build the training set as discussed below.

The aberrated 3D detection PSF of a subdiffractive punctum is approximated by:

$${{\rm{PSF}}}_{{\rm{abb}}}^{\det }(x,y,z)={\left\vert {\iint }_{{\rm{pupil}}}{E}_{{\rm{abb}}}({k}_{x},{k}_{y})\exp [{\mathrm{i}}({k}_{x}x+{k}_{y}y+{k}_{z}z)]{\rm{d}}{k}_{x}{\rm{d}}{k}_{y}\right\vert }^{2}$$

(4)

where \({k}_{z}=\sqrt{{(\frac{2\pi \eta }{\lambda })}^{2}-{k}_{x}^{2}-{k}_{y}^{2}}\), η is the refractive index of the imaging medium and λ is the free-space wavelength of the fluorescence emission.

For light sheet microscopy, the aberrated 3D overall PSF is:

$${{\rm{PSF}}}_{{\rm{abb}}}^{{\rm{overall}}}(x,y,z)={{\rm{PSF}}}^{{\rm{exc}}}(z)\cdot {{\rm{PSF}}}_{{\rm{abb}}}^{\det }(x,y,z)$$

(5)

where PSFexc(z) is given by the cross-section of the swept light sheet used for imaging. Examples of these PSFs are shown in Supplementary Fig. 20 I–V, with MBSq-35 in Supplementary Table 6 used for training and imaging (see ref. 18 for additional information on these light sheets).

Each synthetic training volume sample V is 64 × 64 × 64 voxels in size spanning 8 × 8 × 12.8 μm3 (with 125 × 125 × 200 nm3 voxels) and containing between J = 1 to J = 5 puncta chosen from a uniform distribution and located randomly at points (xj, yj, zj) within the volume. Each punctum is modeled as a Gaussian of full width at half maximum wj chosen randomly from the set [100, 200, 300, 400] nm, allowing for slightly larger than the diffraction-limit features. The image of each punctum is generated by its convolution with the aberrated PSF:

$${I}_{j}^{{\,\rm{bead}}}(x,y,z)={{\rm{PSF}}}_{{\rm{abb}}}^{{\rm{overall}}}(x,y,z)\otimes \exp \left[-4\ln (2)\frac{{x}^{2}+{y}^{2}+{z}^{2}}{w_{j}^{2}}\right]$$

(6)

The integrated photons No per punctum were selected from a uniform distribution of 1 to 200,000 photons. The total intensity distribution is:

$${I}_{{\rm{photon}}}(x,y,z)=\varUpsilon \cdot \mathop{\sum }\limits_{j=1}^{J}{I}_{j}^{\,{\rm{bead}}}(x-{x}_{j},y-{y}_{j},z-{z}_{j})$$

(7)

where

$$\varUpsilon =\frac{{N}_{o}}{{\iiint }_{-\infty }^{\infty }{I}_{j}^{\,{\rm{bead}}}(x,y,z)\,{\rm{d}}x\,{\rm{d}}y\,{\rm{d}}z}$$

(8)

As the signal from each aberrated punctum can exceed the boundary of V, total signal SV within V is:

$${S}_{V}={\iiint }_{V}I(x,y,z)\,{\rm{d}}x\,{\rm{d}}y\,{\rm{d}}z\le J{N}_{o}$$

(9)

After accounting for partial signal contributions (SV) the photons per voxel were converted to camera counts by applying the quantum efficiency QE, Poisson shot noise η and camera read noise ϵ to arrive at the final synthetic training set example:

$${I}_{{\rm{camera}}}(x,y,z)=QE\cdot {\textit{I}}_{{\rm{photon}}}(\textit{x},\textit{y},\textit{z})+\eta [\rm{QE}\cdot {\textit{I}}_{{\rm{photon}}}(\textit{x},\textit{y},\textit{z})]+\epsilon$$

(10)

Zernike distributions

To ensure diversity in the training set to cover potential aberrations, each training example was chosen from the amplitudes of the 11 included aberration modes shown in color in Supplementary Fig. 1 with equal probability from one of four different distributions:

  1. (1)

    Single mode (Supplementary Fig. 21b) One mode is randomly chosen, with amplitude α chosen randomly from 0 ≤ α ≤ 0.5 λ RMS.

  2. (2)

    Bimodal (Supplementary Fig. 21c) An initial target for the total amplitude αt is chosen randomly from 0 ≤ αt ≤ 0.5 λ RMS. A second partitioning factor ϵ is chosen randomly from 0 ≤ ϵ ≤ 1. The amplitudes of the two modes are then α1 = ϵαt and α2 = (1 − ϵ)αt.

  3. (3)

    Powerlaw (Supplementary Fig. 21d) An initial target for the total amplitude αt is chosen randomly from 0 ≤ αt ≤ 0.5 λ RMS. The initial partitioning factors ϵn for the modes are chosen randomly from a Lomax (that is, Pareto II) distribution19:

    $${\epsilon }_{n}=\frac{\gamma }{{({x}_{n}+1)}^{\gamma +1}}\quad {\rm{where}}\quad \gamma =0.75$$

    (11)

    where each xn is chosen randomly from 0 ≤ xn ≤ 1. They are then renormalized:

    $${\epsilon }_{n}^{{\prime} }=\frac{{\epsilon }_{n}}{\mathop{\sum }\nolimits_{n = 1}^{11}{\epsilon }_{n}}$$

    (12)

    and the final amplitudes of the modes are \({\alpha }_{n}={\epsilon }_{n}^{{\prime} }{\alpha }_{t}\).

  4. (4)

    Dirichlet (Supplementary Fig. 21e) An initial target for the total amplitude αt is chosen randomly from 0 ≤ αt ≤ 0.5 λ RMS. The initial partitioning factors ϵn for the modes are chosen randomly from 0 ≤ ϵn ≤ 1. They are then renormalized:

    $${\epsilon }_{n}^{{\prime} }=\frac{{\epsilon }_{n}}{\mathop{\sum }\nolimits_{n = 1}^{11}{\epsilon }_{n}}$$

    (13)

    and the final amplitudes of the modes are \({\alpha }_{n}={\epsilon }_{n}^{{\prime} }{\alpha }_{t}\).

Together, the training examples from these four distributions create a diverse set of overall aberration amplitudes and number of significant modes in the training data, with all 11 modes contributing equally across the dataset (Supplementary Fig. 21a).

Training dataset

For the model training, a dataset of 2 million synthetic 3D volumes was created, with aberration magnitude uniform sampled from 0.0 to 0.5 λ RMS (at wavelength λ = 510 nm), uniform distribution of the number of objects between 1 and 5, and photons ranging between 1 and 200,000 integrated photons per object.

Test dataset

To evaluate our models, we created a test dataset with 100,000 3D volumes. The parameter distribution was the same as training, but extended the aberration magnitude up to 1.0 λ RMS, and up to 500,000 integrated photons. To test the operational limit of our models, this test dataset included up to 150 objects in any given volume.

Fourier embedding

Most ML vision models operate on real-space representations of the data, which lack clearly defined limits on image size or feature descriptors of their content. Instead, we used Fourier domain embeddings (Supplementary Fig. 24). These are bound by the microscope’s OTF. Aberrations within an isoplanatic patch globally effect all photons within that patch, producing a unique, learnable ‘fingerprint’ pattern in the FFT amplitude and phase (Supplementary Note A.1 and Supplementary Figs. 5–7).

Preprocessing

To create Fourier embeddings (Fig. 1b) for our model, we preprocess the input 3D image stack W of CCPs within an isoplanatic region to suppress noise and edge artifacts (Fig. 1a),

$$V=\varUpsilon (W\,).$$

(14)

The preprocessing module (ϒ) begins with a set of filters to extract sharp-edged objects that reveal the aberration signatures: a Gaussian high-pass filter to remove inhomogeneous background and a low-pass filter through a Fourier frequency filter, with cutoff set at the detection NA limit (σ = 3 voxels). A Tukey window (Tukey cosine fraction = 0.5, in \(\hat{x}\hat{y}\) only) is applied to remove FFT edge artifacts from the volume borders. No windowing is applied along the axial direction, \(\hat{z}\), because embeddings are constructed near kz = 0 where aberration information is maximized.

Embedding

Once preprocessed, a ratio of the resultant 3D FFT amplitude, to the 3D FFT amplitude of the ideal PSF (undergoing identical preprocessing steps) is used to generate the amplitude embedding, α(kz) at each kz plane:

$${V}_{\rm{ideal}}=\varUpsilon ({\rm{PSF}}_{\rm{ideal}})$$

(15)

$$\alpha =\frac{| {\mathcal{F}}(V)| }{| {\mathcal{F}}({V}_{\rm{ideal}})| }$$

(16)

where \({\mathcal{F}}\) denotes the 3D Fourier transform. The most useful information content is located at kz = 0, the principal plane located at the midpoint of the \(\hat{{k}_{z}}\)-axis. Three 2D planes from α1, α2 and α3 along \(\hat{{k}_{z}}\)-axis as are necessary to extract axial information for inputs to the model as follows:

$${\alpha }_{1}={\alpha }_{{k}_{z = 0}}$$

(17)

$${\alpha }_{2}=\frac{1}{5}\mathop{\sum }\limits_{i=0}^{4}{\alpha }_{{k}_{z = i}}$$

(18)

$${\alpha }_{3}=\frac{1}{5}\mathop{\sum }\limits_{i=5}^{9}{\alpha }_{{k}_{z = i}}$$

(19)

where α1 is the principal plane along the kx-axis and ky-axis, α2 is the mean of five consecutive 2D planes starting from the principal plane and α3 is the mean of five consecutive 2D planes starting from the kz = 5 plane (Supplementary Figs. 7 and 24a,c).

For the phase embedding, φ, we first remove interference from several puncta in the FOV that may obscure the aberration signature in the phase image. The interference patterns are removed using: peak local maxima (PLM; https://scikit-image.org/docs/stable/auto_examples/segmentation/plot_peak_local_max.html) for peak detection in real space using normalized cross-correlation (NCC; https://scikit-image.org/docs/stable/auto_examples/registration/plot_masked_register_translation.html) with a kernel cropped from the highest peak in V. The neighboring voxels around the detected puncta peaks are masked off, creating a volume, \({\mathcal{S}}\). The OTF with interference removed, \({\tau }^{{\prime} }\), can now be obtained as well as a real space reconstructed volume, \({V}^{{\prime} }\), through inverse FFT,

$$M={\rm{PLM}}({\rm{NCC}}(V\,))$$

(20)

$${\mathcal{S}}=V\times M$$

(21)

$$\tau =\frac{{\mathcal{F}}(V)}{{\mathcal{F}}({\mathcal{S}})}$$

(22)

$${V}^{{\prime} }={{\mathcal{F}}}^{\text{-}1}(\tau )$$

(23)

The phase φ(kz) at each kz plane is then given by the unwrapped phase of τ at that plane (Supplementary Fig. 24b,d). We calculate the three phase embeddings in the same manner as our amplitude embedding such that:

$${\varphi }_{1}={\varphi }_{{k}_{z = 0}}$$

(24)

$${\varphi }_{2}=\frac{1}{5}\mathop{\sum }\limits_{i=0}^{4}{\varphi }_{{k}_{z = i}}$$

(25)

$${\varphi }_{3}=\frac{1}{5}\mathop{\sum }\limits_{i=5}^{9}{\varphi }_{{k}_{z = i}}$$

(26)

Combining the six planes together, we define the input to the model as a Fourier embedding,

$${\mathcal{E}}=\{{\alpha }_{1},{\alpha }_{2},{\alpha }_{3},{\varphi }_{1},{\varphi }_{2},{\varphi }_{3}\}$$

(27)

A notable advantage of this approach is that, although the signal from each individual CCP is weak, those in the same isoplanatic region contain near-identical spatial frequency distributions that add together to yield Fourier embeddings of high SNR suitable for accurate inference of the underlying aberration (Supplementary Fig. 6).

AO vision Fourier transformer

Below, we outline the key components of AOViFT, which uses a 3D multistage vision transformer architecture. This model efficiently captures Fourier domain features at several spatial scales, enabling robust aberration prediction.

Multistage

Recent advances in attention-based transformers have demonstrated scalability, generalizability and multi-modality for a range of computer vision applications20,21,22,23,24.

Multiscale (or hierarchical) vision transformers, such as Swin25 and MViT26, are designed with specialized modules (for example, shifted-window partitioning25 and hybrid window attention27) to excel at a variety of detection tasks for 2D natural images using supervised training on ImageNet28. Although these variants are more efficient than their ViT counterparts in terms of FLOPs and number of parameters, they often incorporate specialized modules as noted above. Hiera29 showed that these designs can be streamlined without performance loss by leveraging large-scale self-supervised pretraining.

Current multiscale architectures use a feature pyramid network scheme30—downsampling the spatial resolution of the image for each stage while expanding the embedding size for deeper layers. Instead, in our work, we use Ω stages and do not downsample during any of the stages, but rather select different patch sizes for each stage (Fig. 1). This allows the embedding dimension within each stage to be fixed to the number of voxels in the patch of that stage, rather than expanding with increasing depth as in some hierarchical models.

Patch encoding

The input to the model is the Fourier embedding, a 3D tensor \({\mathcal{E}}\in {{\mathbb{R}}}^{\ell \times d\times d}\), where  = 6 is the number of 2D planes each with a height and width of d. For each model stage, i, patchifying begins by dividing the input tensor \({\mathcal{E}}\) into nonoverlapping 2D tiles (each pi × pi) that are each flattened into a one-dimensional patch for a total of ki patches in a plane. After patchifying, the input tensor is transformed into \({x}_{p}\in {{\mathbb{R}}}^{\ell \times {k}_{i}\times {p}_{i}^{2}}\) (Fig. 1b).

The initial ViT model uses a set of consecutive transformer layers with a fixed patch size for all transformers, where each transformer layer can capture local and global dependencies between patches through self-attention20. The computation needed for the self-attention layers scales quadratically with reference to the number of patches (that is, sequence length). Although using a smaller patch size could be useful to capture visual patterns at a finer resolution, using a large patch size is computationally cheaper.

Our baseline model uses a two-stage design with patch sizes of 32 and 16 pixels, respectively (Fig. 1c). Supplementary Note A shows an ablation study using several stages with patch sizes ranging between 8 and 32 pixels.

Positional encoding

Rather than adopting the Cartesian positional encoding of ViT20, we use a polar coordinate system (r, θ) to encode the position of each patch. This choice is motivated by the radial symmetries of the Zernike polynomials and the efficiencies gained in NeRF31, coordinate-based MLPs32 and RoFormer33. For a given plane in \({\mathcal{E}}\) (Eq. 27), the radial positional encoding vector (RPE) is calculated for every patch,

$${\rm{RPE}}(r,\theta )=[r,\sin \theta ,\cos \theta ,\ldots ,\sin m\theta ,\cos m\theta ]$$

(28)

where (r, θ) are the polar coordinates for the center of each patch, and m = 16. All patches and their positional encoding are then mapped into a sequence of learnable linear projections \(\zeta \in {{\mathbb{R}}}^{\ell \times {k}_{i}\times {p}_{i}^{2}}\) that we use as our input to the transformer layers in the model.

Transformer building blocks

Each stage has n transformer layers, where each layer has h multihead attention (MHA) layers that map the interdependencies between patches, followed by a multilayer perceptron block (MLP) that learns the relationship between pixels within a patch. The stage’s embedding size, \({\epsilon }_{i}={p}_{i}^{2}\), is set to match the number of voxels in a patch for that stage. The MLP block is four times wider than the embedding size (Supplementary Fig. 2c). Layer normalization (LN)34 is applied before each step, and a skip/residual connection35 is added after each step:

$${\zeta }_{1}={\rm{LN}}({\rm{MHA}}(\zeta\;))+\zeta$$

(29)

$${\zeta }_{2}={\rm{LN}}({\rm{MLP}}({\zeta }_{1}))+{\zeta }_{1}$$

(30)

In addition to the skip connections in each transformer layer, we also add a skip connection between the input and output of each stage. We use a dropout rate of 0.1 for each dense layer36 and stochastic depth rate of 0.1 (ref. 37). The patches from the final stage are pooled using a global average along the last dimension and passed to a fully connected layer to output z Zernike coefficients.

Attention modules

We use self-attention38 as our default attention module for all transformer layers in our model. Complementary to our approach, recent studies have looked into alternative attention methods to reduce the quadratic scaling of self-attention23,39,40. Our architecture is compatible with these attention mechanisms, which would further improve our model’s efficiency.

In silico evaluations

Supplementary Note A shows an ablation study of our synthetic data simulator (Supplementary Note A.2), our multistage design (Supplementary Note A.3), our training dataset size (Supplementary Note A.4 and Supplementary Fig. 2) and details of our training hyperparameters (Supplementary Note A.5, Supplementary Table 2 and Supplementary Table 8). We also introduce a new way of measuring prediction confidence of our model using digital rotations in Supplementary Note A.6 (Supplementary Fig. 23).

We present a detailed cost analysis benchmark comparing our architecture with other widely used models such as ConvNeXt41 and ViT20 in Supplementary Note B. To further diagnose our model’s performance, we carried out a series of experiments to understand our model’s sensitivity to SNR (Supplementary Note C.1 and Supplementary Figs. 24–25), generalizability to other light sheets (Supplementary Note C.2), number of objects in the FOV (Supplementary Note C.3) and object size (Supplementary Note C.4 and Supplementary Fig. 26).

Ethics approval and consent to participate

All experiments with zebrafish were done in accordance with protocols approved by the University of California, Berkeley’s Animal Care and Use Committee and following standard protocols (animal use protocol number AUP-2019-09-12560-1). All zebrafish used in this study were embryos younger than 72 h postfertilization. Sex determination was not a factor in our experiments. All husbandry and experiments with zebrafish were done in accordance with protocols approved by the University of California, Berkeley’s Animal Care and Use Committee and following standard protocols (animal use protocol numbers AUP-2019-09-12560-1 (Upadhyayula laboratory), AUP-2020-10-13737-1 (Swinburne laboratory) and AUP-2021-05-14347-1 (Zebrafish Facility Core Protocol)).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *