AI could help universities spot struggling students weeks before final grades are known, but new research by Chen-Chung Chi of Tamkang University in Taiwan suggests that simple machine learning may outperform deep learning when student datasets are small and unbalanced.
the study, AI-powered sustainable transformation of education supply chains: A comparative evaluation of machine learning models for early warning systems and design-level frameworks for institutionalization and impact assessmentwas published in sustainabilityEvaluating an AI model for predicting student failure in programming courses. This study found that random forest models using SMOTE detect at-risk students more reliably than GRU and LSTM models, reducing attrition and providing a practical early warning tool for higher education institutions seeking to support sustainable student success.
In higher education, student attrition poses a supply chain risk
The paper argues that universities often rely on delayed signals, such as midterms and final exams, to identify underperforming students. By that time, the optimal period for intervention may have ended. Early warning systems can change the timing by identifying risks while there is still sufficient time to tailor tutoring, advice or guidance.
Tamkang University already uses the Smart PASS platform, which integrates learning management systems, advising tools, and student success features. This research focuses on improving existing performance and engagement diagram systems that classify students according to fixed thresholds based on their performance and engagement. This rules-based system has practical limitations. Fixed thresholds can be affected by class size, outliers, and early-semester sparseness if many students have not yet generated sufficient activity data.
The proposed AI system is designed to replace fixed classification steps with failure probabilities for model generation while maintaining a more extensive dashboard and notification workflow.
Random Forest detected at-risk students faster than deep learning models
This study tested learning trajectory data for 188 students from one programming course over four semesters. The first two semesters were used for training 90 students. The next two semesters with 98 students were used for temporary validation. This means that the model was tested on later cohorts rather than random splits of the same pool.
The dataset included 30 students who failed, creating a large imbalance in the class. Because the purpose of the early warning system is to identify students at risk of failing, this study treated the failing category as a positive class. Recall has therefore become an important measure of whether the system is catching students who need help.
We compared three models: Random Forest with SMOTE, GRU, and LSTM. SMOTE was used to address imbalances in random forest models by creating synthetic minority examples in the training data. The deep learning model used a small number of replicated examples with small Gaussian noise. The results favored the random forest model. Across prediction weeks 6 to 16 and both validation semesters, Random Forest achieved 85.59 percent accuracy, 91.19 percent failed student recall, 58.89 percent precision, and 70.36 percent F1 score.
Most importantly for the intervention, the random forest model provided usable warnings by week 6, with a failure recall rate of 87.86 percent. This means the system is able to identify most students who ultimately fail, while allowing approximately 12 weeks for instructors and advisors to intervene.
The LSTM and GRU models performed poorly in at-risk groups, even though they had better headline accuracy in some cases. In the early weeks, both deep learning models often collapsed toward the majority class. This means that it tended to predict students as passing and miss students who failed. LSTM only became available around week 14, while GRU remained unreliable for most of the semester.
Although deep learning models are thought to be suitable for time series data, this study shows that this is not necessarily the case for small and unbalanced educational datasets. In this setting, a simpler, cheaper model would have been more useful for the actual operational goal: to be able to catch and assist students early enough.
Early intervention is important for weekly learning patterns
The study also compared different ways of representing student activity data. The original weekly feature captured what students did each week, and the cumulative feature showed the average of activity over time. It’s a mixed approach that combines both. The original weekly feature had the highest sensitivity for failing students, with a failure recall rate of 90.36 percent. Cumulative and mixed features improved precision and F1 score, but at the cost of some recall.
For student support, absenting an at-risk student can be more costly than sending an unnecessary check-in to a student who is doing well. This makes recall especially important. As a result, if early detection is a priority, the original weekly features are prioritized in this study.
This study also found that cumulative features do not dilute information. In fact, there was a stronger mean difference between those who passed and those who failed. However, its smooth profile made it less responsive to sudden drops or changes in behavior. The original weekly feature was more sensitive to sudden signals that a student was slipping.
This study examined the activity types recorded in the iClass learning management system, including homework, forum participation, exams, instructor-defined custom activities, and web links. Random Forest importance scores emphasized homework, forum, and exam-related signals, and LSTM permutation analysis emphasized exams, homework, and custom activities.
Both models agreed that web link click-through data had little predictive value. The study explains that opening a linked resource does not indicate how deeply a student has engaged with the material, and that universities should be careful to treat all digital traces as meaningful learning signals.
The key takeaway for instructors is that consistent submission behavior, assessment performance, and course-specific activity patterns are more beneficial than simple clicks. Therefore, an ideal early warning system should focus on learned behaviors that demonstrate effort, understanding, and task completion, rather than superficial platform activity.
Impact and limitations of AI in student success systems
The findings suggest that universities do not need complex deep learning systems to begin improving early student support. Random forest models that carefully handle imbalances may be a more practical starting point for small course-level implementations. Many universities lack the large datasets, advanced AI teams, and infrastructure needed to train complex models. Low-cost, reproducible approaches that work with small datasets may be more realistic for early warning systems.
The study also proposes a three-step institutionalization pathway:
- Level 1 involves instructor pilots, with individual teachers using the system to track whether reported students are receiving support.
- Level 2 moves into departmental use and allows faculty advisors and curriculum teams to monitor risk patterns across courses.
- Level 3 integrates the system into a broader university platform, making it more widely available.
Alternative impact frameworks targeting student outcomes, resource efficiency, organizational learning, and human capital outcomes are also proposed for future evaluations. Note that these are design-level suggestions and not measured results from this study.
Rather than claiming that the system has already improved graduation rates, reduced institutional costs, or produced measurable long-term social impact, this study shows that early detection of risk is technically feasible in a single course and provides a framework for later testing broader effects.
This study also acknowledges some limitations. This is based on one programming course at one university with 188 students and only 30 failures, so the results cannot be automatically generalized across disciplines, universities, and learning management systems. Binary pass/fail results also limit the analysis. Student performance is more complex than simple pass/fail labels, and future research may examine different achievement levels and types of academic risk. The study also found semester-to-semester variation, indicating that models may need to be updated regularly in response to changes in cohorts and instructional patterns.
The deep learning model was tested using a relatively simple imbalance process. More advanced techniques such as focal loss, class-weighted training, 3D SMOTE, attention mechanisms, and trans-based models may change the comparison in future studies. Privacy and governance also become important as such systems move from course pilots to campus-wide use. Early warning systems handle sensitive student data and can impact how instructors, advisors, and institutions perceive learners. The proposed deployment pathway suggests the need for transparency, auditing, and careful intervention design.
