Metrics
The MB localization for both simulation and in vivo dataset along with the ground truth is presented in Fig. 1 as a representative sample of our results. To manifest the prominence of DE-DETR in MB localization and quantify our results, the precision, recall and average Root Mean Square Error (RMSE) of each network are presented in Tables 1 and 2. The aforementioned metrics are calculated as follows:
$$\begin{aligned} \text {Precision} = \frac{TP}{FP + TP} \end{aligned}$$
(1)
$$\begin{aligned} \text {Recall} = \frac{TP}{FN + TP} \end{aligned}$$
(2)
$$\begin{aligned} RMSE = \frac{1}{{N_{{TP}}}}{\sum _{i \in {TP}}^{N_{{TP}}} \sqrt{\left( x_i-x’_i\right) ^2+\left( z_i-z_i’\right) ^2}} \end{aligned}$$
(3)
The above \(\left( z’; x’\right)\) and \(\left( z; x\right)\) represent the predicted and ground truth MB positions, respectively. TP is the number of MBs correctly predicted by the network; FP denotes the erroneous classification of a sample as an MB, and FN signifies the incorrect classification of a sample as background according to a precision of \(\frac{\lambda }{2}\) with \({\lambda }\) being the ultrasound wavelength.

Simulation (a) and in vivo (b) sample frame with localizations. Red: ground truth, blue: predictions. The ground truth is unknown in (b).
As shown in Tables 1 and 2, increasing the number of patches generally improves the metrics. Precision for DETR ranges from 50.01–80.53% and recall is between 48.32–67.07% with the “best” case being patch size of 256 by 171. For DE-DETR, precision ranges between 62.12 and 81.95% and recall is between 58.13 and 74.37% and the network does the best for patch size of 256 by 256. The improvement seen by increasing the number of patches can be attributed to the fact that attention modules struggle to capture smaller spatial details, and therefore, patching can help capture fine-grained details more effectively, as each patch can be processed at a higher resolution. However, caution is necessary regarding the size of the number of patches, as an increase in this number leads to a corresponding increase in the number of parameters and computations needed for the model to process each patch individually and then combine the results. Furthermore, as the patches become too small, the network might struggle to capture the global structure of the frames, resulting in poor performance. The training was done on a single NVIDIA GeForce RTX 3080 GPU in a Docker container.
Super-resolution maps
Figure 2 shows the results of our method on the last 100 frames of the simulation video (test dataset) after considering a fixed Gaussian around each localized MB to produce the super-resolution localization maps. The results are also compared with DETR, a morphological-based31 method, temporal mean, Gaussian fit32, Deep-ULM or UNet-based solution18, and the ground truth. As illustrated in Fig. 2, the vessels in our super-resolution localization maps are well-defined and closely resemble the ground truth (g), showcasing its efficacy in accurately representing complex vascular networks. Using DE-DETR’s bounding box based localization, the fine details of the vessel branches and intersections are preserved, indicating high precision in MB detection. This level of detail is less evident in the temporal mean, where vessels appear blurred, and in the morphological method, where the structure look less precise. The Gaussian fit method seems to perform well in the central area, however it does not seem to do as well in the deeper regions, which complies with the fact that MB PSFs look different in deeper regions of an ultrasound B-mode image. Furthermore, UNet seems to be missing some contrast between vessel and background regions, indicating a higher number of false negatives. Furthermore, compared to its predecessor DETR (e), DE-DETR shows enhanced delineation of smaller vessels and improved overall network connectivity by reducing the number of false positives.

SR images of the test dataset, i.e., the last 100 frames of the first simulation video of the challenge dataset obtained from (a) UNet18 (b) Morphological-based31 (c) temporal mean (d) Gaussian fit32 (e) DETR27 (f) DE-DETR (g) ground truth, respectively.
The super-resolution maps based on different localization methods with tracking are displayed in Fig. 3. To appraise our localization method, in Fig. 3, we have compared the results using each method’s localization and our tracking algorithm, leaving localization as the only variable. When compared with Figs. 2, 3 shows the effectiveness of utilizing tracks to visualize vessel maps as the vessels have a smoother and more natural appearance.
The comparison between different methods in Fig. 3 demonstrates the improved performance of DE-DETR, which generates super-resolution maps closely matching the ground truth and providing a powerful tool for detailed vascular imaging. The UNet-based method (a) captures the general structure of the vessels but introduces significant artifacts and noise, particularly in the dense regions. The vessels appear fragmented and less continuous, which can be attributed to the limitations of the UNet in handling complex vessel patterns and the high density of localized MBs. The morphological method (b) provides more continuous vessel structures. However, it still struggles with accurately delineating smaller vessels and fine details. The method tends to over-smooth the vessel boundaries, which leads to a loss of intricate structural information. The Gaussian fit method (c) provides a better approximation of the vessel structures, capturing both large and small vessels with reasonable accuracy. However, some regions still exhibit blurring, particularly in areas with high vessel density, indicating that the Gaussian fit may not fully resolve closely spaced MBs. The DETR (d) and DE-DETR (e) methods show substantial improvements over the previous methods and both provide a high level of detail and continuity in the vessel structures. The DE-DETR method (e), in particular, stands out by more closely approximating the ground truth (f) both at the denser parts of the map in the middle and at the deeper areas in the bottom left of the super-resolution maps.

SR images of the test dataset, i.e. the last 100 frames of the first simulation video of the challenge dataset obtained from (a) UNet-based18 (b) Morphological-based31 (c) Gaussian fit32 (d) DETR27 (e) DE-DETR (f) ground truth.
To evaluate our results further, Tables 3 and 4 present the Dice score between every compared localization method and the ground truth. Here, the Dice score provides a quantitative measure of the similarity between predicted and ground-truth vessel delineations. Among these, DE-DETR achieves the highest Dice score at 83.19%, followed closely by DETR at 83.00%. The Temporal mean method has the lowest score at 78.14%. This indicates that DE-DETR and DETR perform better in terms of segmentation accuracy for fixed Gaussian-based maps compared to the other methods.
Table 4 displays the Dice scores for tracking-based super-resolution maps using the same set of methods. Here, DE-DETR again performs the best with a Dice score of 92.00%, slightly outperforming DETR, which has a score of 91.38%. The Morph method also shows a strong performance with an 84.83% score. These results suggest that tracking-based super-resolution maps generally yield higher Dice scores across all methods compared to fixed Gaussian-based maps, indicating that tracking-based approaches might be more effective for enhancing segmentation accuracy. In summary, DE-DETR consistently outperforms other methods in both scenarios, while tracking-based super-resolution maps tend to produce better segmentation results than fixed Gaussian-based maps.
DE-DETR also demonstrates superior performance in maintaining precision and recall across varying depths of ultrasound images. Figure 4 , which segments the image into three distinct depths, illustrates less reduction in both precision and recall for DE-DE compared to other methods. This performance can be attributed to the fact that our approach does not rely on PSFs, which are often variable with depth variations. By circumventing the dependency on PSFs, our method ensures consistent and robust image segmentation, thereby achieving better precision and recall metrics at different image depths.

Precision and Recall for different methods in 3 different depths of imaging.
Next, we assessed the results of our method on the in vivo rat brain dataset from the UltraSR challenge. Figure 5 demonstrates the results of localization and tracking for Deep-ULM18, DETR, DE-DETR, Gfit and temporal mean. It’s worth mentioning that the tracking along each of these SR maps is the same as the tracking used in the rest of the paper. The UNet-based method (a) shows a reasonable attempt at capturing the vessel structures but suffers from notable artifacts and noise. In regions marked by blue arrows, the vessels appear fragmented and less continuous. This discontinuity and noise can obscure the fine vascular details, making it challenging to distinguish between true vessel structures and artifacts. We’d like to note that the UNet results are taken from18, where the network was trained on Field-II simulations specifically derived from the in vivo dataset. In contrast, our DETR and DEDETR models were trained exclusively on the challenge-provided simulation data, which is entirely independent of the in vivo data. This distinction underscores the robustness of our approach and its ability to generalize across different data domains without relying on dataset-specific priors. The temporal mean method (b) results in a highly blurred output, losing much of the fine detail necessary for accurate vessel visualization. The vessels are poorly defined, and the overall map lacks the sharpness required for precise analysis. This method fails to capture dynamic changes and fine structures, leading to a loss of critical information in the super-resolution map. The Gaussian fit method (c) provides a clearer representation of the vessel structures. However, in regions indicated by the blue arrows, the vessels still exhibit some blurring and loss of fine detail. The DETR method (d) shows a significant improvement over the previous methods, providing a high level of detail and continuity in the vessel structures. The vessels are more accurately represented, with fewer artifacts and noise. However, there are still some regions, marked by blue arrows, where the method could improve in capturing the finest details and resolving closely spaced MBs. The DE-DETR method excels in providing a detailed and continuous representation of the vessel structures, closely approximating the ground truth. The regions indicated by the blue arrows show that DE-DETR effectively captures intricate vessel details that other methods miss. This method’s advanced detection and tracking capabilities allow for more precise localization and better handling of overlapping MBs, resulting in a high-quality super-resolution map.
To further analyse the superiority of our method to visualize smaller vessels, more dense areas and deeper regions, we have chosen three different areas to zoom in and compared the results of each method accordingly in Fig. 6. Temporal mean has been excluded since the results from Fig. 5 already shows the poor details provided by this method. In Fig. 6, the image on the left is the same as Fig. 5, this time indicating the zoomed in boxes. Boxes (a) to (c) are selected from different regions of the super-resolution map. Each row shows the results for one of the boxes and each of the columns indicate the method used to get the super-resolved maps. As illustrated by the subfigures in Fig. 6, DEDETR provides more detailed and smooth vessels maps in different regions of the image, show casing it’s ability in providing high quality microvasculature maps.

SR images of the in vivo test dataset, obtained using Gaussians based on tracking from (a) UNet-based solution [9] (b) temporal mean (c) Gaussian Fit (d) DETR [19] (e) DE-DETR, respectively.

Zoomed-in boxes from different in vivo SR maps, showing the positions of the boxes in the DE-DETR super-resolution map on left and the results from each method (indicated by columns) for each of the (a), (b) and (c) boxes.
To quantify spatial resolution, we computed the Modulation Transfer Function (MTF) from in vivo super-resolution images using an edge-based method, in Fig. 7. Specifically, a region of interest (ROI) containing tightly packed vessels was selected, with the aim of evaluating each method’s ability to resolve fine spatial details. Within this ROI, a horizontal line profile was extracted across the vessels.
The intensity profile along this line represents a one-dimensional cross-section of the structural detail in the image. This profile was first smoothed using a Gaussian filter to reduce high-frequency noise and mitigate local fluctuations that could bias the frequency analysis. The smoothed signal was then mean-centered and windowed (e.g., using a Hamming window) to minimize spectral leakage during Fourier transformation.
The one-dimensional Fourier transform of the line profile was computed, and the magnitude of the resulting spectrum was normalized with respect to its DC component. This yielded the normalized MTF curve, which characterizes the image’s ability to transmit spatial frequencies.
As illustrated in Fig. 7, the superior performance of deformable DETR is clearly supported by the MTF measurements compared to alternative approaches in terms of its ability to detect, localize, and separate closely spaced vessels.

MTF calculated across a horizontal line (color: cyan) in an ROI with tightly packed vessels (color: green) for (a) DEDETR, (b) DETR, (c) UNet and (d) Gaussian fit. (e) Comparison of the normalized MTF of all methods.
