In this section, we review existing research into the application of machine learning and deep learning methods for automatic detection of ASD using facial images. It covers a wide range of approaches, including traditional machine learning classifiers, pre-trained convolutional neural networks, explainable AI technologies, and multimodal frameworks. The aim is to highlight progress made, assess the strengths and limitations of previous research, and identify important gaps that the current research is attempting to address, particularly in terms of accuracy, interpretability, and real-world applicability. Table 1 summarizes the latest published research on ASD detection using facial features, deep learning, advances in multimodal integration, and the importance of interpretability.
Traditional automated machine learning for ASD detection
Early machine learning approaches to ASD detection rely on traditional classifiers such as Support Vector Machines (SVMs) and Random Forests, but were often limited by the complexity of high-dimensional data. Elshoky et al.17 In contrast to traditional ML methods using Automl using the TPOT framework, CONDICL achieved a notable 96% accuracy in the Kaggle ASD dataset. This improvement was attributed to automated hyperparameter adjustments and model selection, reducing manual effort and improving model robustness. However, while efficient, automotive models often remain black boxes with limited interpretability. Similarly, Rashed et al.twenty three ASDD was introduced using Automl tools such as Lazy Predict, Autokeras and TPOT, combining dimension reduction technologies such as PCA and Chi-square testing. The integration of data from multiple corpus improved generalization across age ranges, but the reliance on structured text datasets limits adaptability to image-based analysis. K. Khan and R. Kataryatwenty four It provides a comprehensive study that categorizes previous work into data-centric algorithm-based and traditional ML frameworks, frequently used across various datasets such as SVM, Random Forest, ABIDE, UCI, AQ-10, and other, highlights logistic regression and achieves accuracy above 90%. Sethi et altwenty five The study compares five ML models from the Kaggle screening dataset and identifies random forests as the best performance (92.2%, F1 score of 0.92), but records limitations due to small size of the dataset, imbalances, and lack of multimodal data such as MRI and facial images.
Deep learning using pre-trained CNN models
Recent advances in deep learning have leveraged pre-trained CNN architectures for ASD classification from facial images. Hosseini et al.26 Using a MobileNet-based deep model trained on the Kaggle dataset, we reported an accuracy of 94.64% and identified wide sets of eye-like facial properties as important markers. Ahmed et al.20 Similarly, MobileNet and InceptionV3 were applied to achieve 95% accuracy, highlighting the simplicity of the model and the feasibility of real-time deployment. Arsado and Arzarani18 Although we adopted Xception with 91% accuracy, models such as NasnetMobile (78%) suffered poor performance, highlighting the need for architectural choice. Lady and Andrew10 Compare VGG16, VGG19 and EfficientNetB0 with facial cues, enhancing that even similar architectures can achieve a variety of results with accuracy of 84.66%, 80.05% and 87.9%, respectively. Ahmad et al.9 ResNet50 has demonstrated that it outperforms other CNNs, achieving 92% accuracy and presents the advantages of functional abstraction depth. However, prepaid models offer high performance, but many do not have built-in interpretations. This is an important issue in clinical applications.
Gaddala et al.12 We implemented CNN models based on VGG16 and VGG19 to detect autism spectrum disorder in facial images. Using the Kaggle ASD dataset (2936 images), the model achieved accuracy of 86.33% and 84.00%, respectively. These results indicate that traditional CNN architectures remain competitive when trained on well-curated datasets. Rum et al.6 We proposed a CNN-based facial image analysis framework for ASD detection using 80:20 train test splitting in the Kaggle dataset. Their model achieved 91% accuracy with a relatively high loss rate of 0.53. This suggests that the model has potential, but further adjustments may be required to improve generalization and reduce overfitting. While providing a cost-effective alternative to MRI-based diagnostics, high training losses highlight the limitations of robustness. Khan and Katarya27 The research presents a deep learning approach using the Xception architecture. It achieves high training (97.66%) and validation (99.39%) accuracy, but the test accuracy drops to 67.35%, indicating overfitting and limited generalizability. author28 Use MobileNet and two dense layers to perform feature extraction and image classification for autism diagnosis. They gained 94.6% accuracy using deep learning in either healthy or potentially autistic.
Explanable and interpretable AI in ASD classification
To address the need for transparency in ASD detection, researchers have integrated explainable AI (XAI) tools into a deep learning framework. Alam et al.11 We presented a data-centric approach using XAI along with XACE and RESNET50V2, achieving 98.9% and 97.1% accuracy, respectively. Data augmentation and use of preprocessing were important in improving model performance. Atlam et al.29 A dual-component model combining deep learning classifiers and SHAP descriptions was introduced to enhance clinical interpretability. The proposed model emphasizes transparency in medical decision-making and strengthens trust between AI systems and health professionals. Ma et al.5 Using mutant autoencoder (CVAE) in contrast to MRI function and combined with transfer learning, we achieved accuracy of over 94%. Interpretability was prioritized, but dependence on neuroimaging was limited. Overall, these studies show that embedding interpretability in AI models is not only feasible, but also essential for ethical and clinical adoption. Hossain et al.26 We presented a novel approach using multilayer perceptrons (MLPs) trained with questionnaire-based inputs from tests of autism spectrum quotients. Unlike image-based models, their approach achieved a totally 100% accuracy across all age groups using only 10 key questions. Although impressive, the reliance on self-reported or caregiver-reported inputs introduces potential subjectivity and bias, limiting standalone use without clinical supervision.
Uddin et al.30 A systematic review of 130 publications from 2017 to 2023 was conducted, highlighting the progression of deep learning techniques in ASD diagnosis through image and video modalities. Their study concluded that image-based DL models significantly improved the accuracy and speed of the diagnosis. However, their review also points to the gap between model interpretability and integration with real-time clinical settings, pointing to the need for an explanationable and reliable AI in healthcare. Atlam et al.31 Introducing an explanatory mental health disorder (EMHD) framework, this study integrates a voting ensemble model (using functional selectors such as mutual information, ANOVA, RFE) using SHAP-based XAI to classify and explain disorders in young children and young children. EMHD achieves a perfect score (accuracy, accuracy, recall, and F1 score 1.0). Almars et al.32 IIENM is an integrated IoT emotional recognition system that utilizes the efficient net to detect emotions in children with autism. Trained with two facial expression data sets, this model captures real-time facial and physiological data via IoT sensors. To ensure transparency, we incorporate explanatory AI methods (lime and grade cams) to spotlight images and signal areas that are important for their prediction.
Multimodal and hybrid ASD detection approach
Integrating multimodal data sources has emerged as a strategy to improve the accuracy and resilience of ASD diagnostics. Sellamuthu et al.7 We proposed a hybrid framework that combines facial images and ADOS behavioral scores with multimodal CNN models achieving 97.05% accuracy. Individual models such as MobileNETV2 (78.94%) and ResNet50 (56.19%) were performed. Gutierrez et al.33 The combination of visual and audio signaling for pain assessment in nonverbal patients highlights the potential of multimodal AI in a broader health care, demonstrating that 92% accuracy and specificity can be achieved. Zhu et al.34 Behavioral indicators were introduced into ASD screening using infant response (RTN) signals in a multimodal system that achieved accuracy of up to 92%. These methods significantly enhance classification, but in many cases, multiple sensors or subjective annotations are required, which can reduce the practicality of large-scale deployments.
Lightweight and real-time detection system
Lightweight architectures and mobile deployments are being investigated to enable scalable and resource-efficient ASD screening. Sholikah et al.35 Using VGG-16 embedded in a mobile application, we developed a real-time facial emotion recognition system, reaching 91% accuracy. This enabled emotional feedback for ASD clients, demonstrating a direct real-world utility. Singh et al.16 Applied transfer learning using six prerequisite models including MobileNet, Xception and EfficientNetB7 provides accuracy in the range of 82.6-88% and offers options tailored to device functionality. Anjum et al.3 We integrated five CNN models with logistic regression to highlight a fusion-based design for a lightweight yet effective system, reaching 88.33% accuracy. Khosla et al.36 We used MobileNet for facial classification and applied domain-specific adjustments such as eye spacing normalization to achieve 87% accuracy. Although these models promise mobile deployments, there is still concern about lower accuracy and increased sensitivity to pre-processing.
Li et al.14 We introduced a CNN-based facial sentiment analysis system that uses video data to classify ASDs and classifies ASDs based on arousal, valence, and facial action units (AUS). Their system was a data set consisting of 105 children (62 ASD, 43 non-ASD), achieving F1 scores of 76% and 69% sensitivity and specificity, respectively. The strength of this approach lies in the use of emotional cues rather than static traits. Cao et al.15 VITASD has been developed. This utilized a facial image-based ASD diagnostic model that utilizes the Visual Transformer (VIT) architecture. VITASD achieved 94.5% accuracy using a custom dataset of 2926 images. Unlike CNNS, VITS processes spatial information globally and is particularly suitable for subtle tasks such as ASD detection. Although powerful, VIT requires more computational resources and large datasets for optimal training. This may limit adoption at low resource settings. Gehdu et al.19 We focused on perceptual differences in autistic individuals by examining how images of surrounding faces were grouped in the odd ball detection task. Unlike the typical classification model, this study used a private dataset with 120 participants to highlight behavioral traits and cognitive processing. This study found significant performance discrepancies in the facial imaging group between the autism (65.96%) and non-automatic (74.71%) groups.
Although advances in ASD detection methods have faced significant challenges, including limited interpretability in traditional ML, overfitting and poor generalization in deep learning, problems with MRI and self-reported data scalability, reliance on complex sensors in multimodal systems, and reduced demand for accuracy or high resource for lightweight and transformer models. This study introduces an innovative deep learning framework for automated ASD detection that utilizes pre-trained CNN models such as VGG16, VGG19, InceptionV3, VGGFACE, and MobiLenet. Explanatory AI technologies such as Lime are incorporated with advanced preprocessing, data augmentation, and explanationable AI technologies to improve both the accuracy and interpretability of the diagnostic process.
