Deep learning for tooth detection and segmentation in panoramic radiographs: a systematic review and meta-analysis

Artificial intelligence-based models are increasingly being investigated in dentistry to improve diagnostic accuracy, particularly using orthodontic statistics, a specific basis of pathology. Nevertheless, these images frequently present superpositions and deformations that pose inevitable challenges for training neural networks to perform tasks such as object detection (OD) and segmentation (OS). Therefore, although human diagnostic capabilities may outweigh computer-based capabilities, it is believed that deep learning models can become a potential solution to keep inherent limitations in mind, especially considering inexperienced operators. However, the diagnostic power of these models should be interpreted with caution, as bias and performance variations across different architectures and datasets can affect generalization. [9].

When referring to image interpretation and computer vision (CV), it is important to identify various tasks related to object identification. In this way, OD and OS are the main terms used in modern literature, and OS is further categorized into semantic segmentation (SS) and instance segmentation (IS). SS involves creating pixel-level boundaries around objects in a particular category (for example Dental structures are individually segmented, but they are all categorized as teeth) but distinguishes individual instances within the category (for example Not only is it classified as teeth, but it also allows for individual segmented dental structures. [1].

Of the 20 studies included in the current review, only two did not assess OD performance. [18, 19]. Among those who went, dealt with four tooth signs, but did not specify performance metrics for this task [16, 22,23,24]. Most studies employing both OD and labeling methods utilized two-stage CNNs, except for three studies using single-stage CNNs. [16, 22, 24]. OS methodologies vary due to the use of SS. [18]3 will be used [9, 19, 25] The other is built in [4] Improves CNN diagnostic performance. This was also performed by two other studies, but no results have been reported [17, 26].

Establishing a reliable ground truth (GT) is important for evaluating deep learning models. Seven studies have reported that multiple operators are performing manual annotations and labeling [15, 24, 25, 27,28,29,30]8 people relied on one clinician. Five did not specify the number of practitioners involved. Operator experience was often unreported, but changed from 5 to 30 years. In particular, studies focusing on midband detection using a single operator used cone beam computed tomography (CBCT) as GT [21, 26, 31, 32] -This could be a potential source of bias.

Sample dataset splitting is another important factor in model evaluation. Eleven studies followed the recommended protocol precisely to create three independent image sets (TRS), validation (VS), and tests (TES) for training. [4, 9, 19, 21,22,23,24,25, 29,30,31]. However, eight studies did not follow this protocol [16,17,18, 20, 26,27,28, 32]and only TES was included as the study analyzed the diagnostic performance of pre-trained and validated CNN CNNs. [15]. Only one survey used publicly available datasets [4]. Concerns about generalizability were raised in studies that did not include sets from different global groups and institutions or were acquired with various x-ray machines. In this way, only four studies have reported using various test sets. [9, 18, 22, 25]. Nevertheless, inadequate studies of the datasets have sought to reduce overfitting by implementing cross-validation techniques. [26, 28, 31] Or implement the data augmentation process [4, 9, 17, 19, 22, 26, 27, 31, 32].

The metrics used to assess the performance of the DL model differ within the included studies, with accuracy, recall, and F1 scores being reported most frequently. Other pixel-based metrics such as IOU were also included. However, performance must be interpreted with caution due to variations in dataset quality and research design. Vinayahalingam et al. Exceptional results for OD and OS are published with a recall of 0.997, a recall of 0.989 and an F1 score of 0.992. OS also achieved great results, but certain limitations of this study were found as blurred or incomplete OPGs were excluded from the dataset [25]. Similarly, Choi et al. We achieved impressive average accuracy and recalled results of 0.991 and 0.996, respectively, but the results are ungeneralized and highly biased due to the exclusion of images by patients with primary dentition and mixed teeth, impacted teeth, or partially larval. [22].

Reporting discrepancies were apparent when comparing studies assessing similar tasks. When evaluating the same Nn, Tuzoff et al. Bonfanti-Gris et al. We reported a low sensitivity value for this same task (0.693). Similarly, Tuzoff et al. Reports of sensitivity and specificity of 0.980 and 0.999 and Bonfanti-Gris et al. Report 0.500 for both [15, 27]. Other studies have also found to report poor results. Yükselet al. was observed to have an average average accuracy of object detection adjusted to a threshold of 0.5-0.95 of 0.477. Only when this was reduced to 0.5 the model showed a maximum accuracy of 0.894 [17]. Nevertheless, Bonfanti-Gris et al. If the dataset is reduced, Leite et al. I got great results for both OD and OS (S = 0.989, p = 0.996 and p= 0.958, r= 0.975, IOU = 0.936 and F1 score = 0.966, respectively) [9]. In this way, threshold effects were observed but not explicitly discussed, leaving a gap in information to consider when applying these models in a clinical setting. This is because decision-making thresholds can have a significant impact on both sensitivity and specificity.

The result of sample size reduction without data augmentation techniques was Kilic et al, achieving S = 0.9804. was also reported by. p= 0.9571 and F1 score = Object detection and labeling results 0.9686 [23]. Estai et al. We also reported positive results of OD (p= 0.992 and r= 0,994) and labeling tasks (P, R, F1Score = 0.980, E = 0.999, and A = 0.998) respectively. Nevertheless, using two sets of images rather than three can create a risk of bias factors. [28].

Contrary to the previous, Bilgir et al. and Kaya et al. We reported significant results using a single CNN for OD. In the first case, the authors reported high sensitivity, accuracy, F1 score, false detection rate, and false negative rate, but Kaya et al. Excellent results with maps and map metrics [20, 24].

Within the included studies, two different DL approaches were investigated. Mahdi et al. used a transfer learning-based optimization technique to present positive results with CNNs such as ResNet-101 and ResNet-105. Instead, Chandrashekar et al. introduced a collaborative learning approach in which two DL models are integrated to achieve better results. In this case, the authors compared the studied CNNS performance metrics individually and during collaboration, comparing higher accuracy, F1 scores, and map results with the latest (>0.973) for both OD and OS.

Finally, even studies focusing solely on OS tasks presented a variety of results. Sheng et al. reported an accuracy value of 0.885, an average IOU of 0.468, and an F1 score of 0.637. [18]. Nevertheless, Lee et al. achieved better performance metrics while using significantly reduced data sets and implementing data augmentation techniques. Iou=0.877, F1 score=0.875, p= 0.858, and r= 0.893 [19].

When comparing results obtained from different neural networks, the depth of the CNN should be considered. This is because it can affect the performance of the model. Deeper architectures improve accuracy, but have been reported to improve risk overfitting, especially in small datasets [4, 18]. Data augmentation techniques may reduce this, but increasing model complexity does not necessarily result in improved proportional accuracy. Also, while some architectures may work well with certain dataset sizes, others may suffer from overfitting or decreasing returns [18]. Therefore, there is no standardized framework for selecting the optimal depth and learning parameters, limiting the comparability and reproducibility of the results. [33].

Deep learning OD and OS models have been reported to accurately perform the identification of affected teeth. This systematic review localized six titles that this objective was addressed by assessing the identification and classification capabilities of several CNNs MESiodens. Dai et al. The results were overall impressive, as reported results for A, S, E, P, and MAP of 0.94, 0.95, 0.93, 0.93, and 0.99, respectively. [29]. Similarly, Ha et al. obtains results from 0.915 to 0.043 for a, s, and e, and a similar sample size dataset [21].

When comparing different CNNs, Kuwada et al. DetectNet was observed to outperform AlexNet and VGG-16, with sensitivity, specificity, and accuracy values of 0.920, 1.000, and 0.960, respectively. [30]. Other studies reported similar results for architectures such as ResNet-18, Resnet-101, Inception Resnet-V2, and Sqeezenet [31]. Variations within the results are Aljabri et al. was detected by. Analyse four different DL models and study performance in experiments with two different sample sizes. Overall, worst results were observed with the VGG-16 architecture [32].

Kim et al. achieved excellent results by adopting new OS technology to limit the premaxillary area and improving detection accuracy of the midband. However, generalization was not guaranteed because this study excluded images with distortion and blurry. [34].

The DL model has also been adopted by dentistry to detect ectopic rashes in the maxillary first molar [35] Classify the position of the third molar of the lower jaw [36, 37]. Automation of object segmentation is important for digital applications, especially in 3D images where manual segmentation is labor-intensive and skill-dependent. This may be particularly relevant to treatment planning, addressing complications intraoperatively, and planning for automated implants [2].

Although AI-based applications are widely studied, the clinical impact of DL models requires further discussion. The model performed is done with tooth detection and segmentation, but still leaves behind practical challenges such as standardized training data, external validation, and regulatory approval prior to clinical practice implementation. Additionally, clinicians' trust in model interpretability and AI-generating reports must be addressed.

Despite the promising results, the systematic review highlights some limitations. First, focusing on the DL method, including research within a limited time frame, can be considered liability.

Based on the data reviewed, future research should prioritize diverse and generalizable data sets, incorporate multicenter images, address and adopt standardized reference tests and reporting guidelines such as stardoys and claim checklists. These steps increase the research quality, robustness and reliability of AI-based diagnostic tools for dental use.

Source link