Endotracheal suctioning (ES) is an important but invasive clinical procedure that currently lacks powerful automated training tools for skill development and risk mitigation, especially in unsupervised settings. Researchers Hoang Khang Phan (Ho Chi Minh City University of Technology), Quang Vinh Dang (University of Massachusetts Amherst), Noriyo Colley (Hokkaido University) and others. We present a novel framework that leverages large-scale language models (LLMs) for video-based activity recognition and explainable feedback generation. This effort is important because it provides natural language guidance to trainees beyond simple recognition and transforms complex technical data into actionable insights. The company’s LLM-centered approach clearly outperforms traditional machine learning and deep learning models, achieving 15-20% improvements in accuracy and F1 scores, establishing a scalable foundation for improving nursing education and patient safety.
This study addresses significant gaps in training and assessment, particularly in settings such as home health care and education, where consistent professional supervision is limited. The team accomplished this by creating an integrated LLM-centric system that can analyze video data to identify procedural steps and provide interpretable guidance to trainees. The core innovation lies in leveraging LLM not only for activity recognition but also for explainable decision-making, transforming complex technical assessments into accessible natural language feedback.
This study is centered around a video-based approach in which the LLM acts as a central reasoning module, performing both spatiotemporal activity recognition and detailed analysis of the steps shown in the video data. The researchers benchmarked this LLM-based system against traditional machine learning and deep learning techniques and demonstrated significant performance improvements of approximately 15-20% in both accuracy and F1 score. Beyond simply identifying actions, the framework includes a pilot student support module built on anomaly detection and Explainable AI (XAI) principles to automatically highlight both correct actions and areas for improvement. Experiments show that LLM effectively identifies ES configuration steps, such as preparation, catheter insertion, suction application, and withdrawal, by analyzing skeletal keypoints obtained from video footage.
The system’s ability to provide interpretable feedback is a significant advance, providing targeted suggestions to hone skills and improve training efficiency. This automatic feedback mechanism provides a scalable and objective method for assessing procedure performance beyond traditional human subjective observation. Taken together, these contributions establish a scalable, interpretable, data-driven foundation for advancing nursing education and improving patient safety. This research establishes a path to automated skills assessments, data-driven clinical training, and real-time safety alerts designed to prevent procedural errors and reduce patient harm. This LLM-based approach not only improves recognition accuracy but also increases transparency and fosters trust in the system’s evaluation and recommendations. The research team designed a system that uses video-based pose estimation to capture the kinematics of nursing staff during ES, analyzing spatiotemporal features derived from key points on the skeleton to recognize steps such as preparation, catheter insertion, suction application, and withdrawal. To address the challenges regarding occluded body parts in pose data, this study implemented an interpolation technique that mirrors the approach used by Ngo et al. to reduce noise and missing values. Although this interpolation improved the F1 score from 42% to 46% using raw skeletal data, the team recognized the need for further performance improvements.
This work pioneered the use of LLM not just for recognition, but also for generating explainable decision analysis and natural language feedback, translating complex technical insights into accessible guidance for trainees. The researchers utilized LLM as a central inference module and achieved approximately 15-20% improvement in both accuracy and F1 score compared to the baseline model. Beyond simple recognition, the team built a student support module based on anomaly detection and explainable AI (XAI) principles to provide automated, interpretable feedback that highlights correct actions and suggests targeted improvements. To augment limited training data, this study considered a technique inspired by Dobhal et al. and investigated the potential of LLM, especially GPT-4o, as a data augmentation agent through rapid engineering.
This approach aimed to generate synthetic data to enhance model training and increase the F1 score by 1% from 55% to 56% of random sampling. Additionally, recognizing the limitations of a single perspective, the team took inspiration from a multi-angle video acquisition strategy and achieved an F1 score of 61% compared to 51% for a single-angle approach, although a complex multi-camera system was required. The resulting LLM-based approach establishes a scalable, interpretable, data-driven foundation for advancing nursing education and improving patient safety.
LLM framework improves endotracheal aspiration recognition accuracy
Scientists have developed a new Large Language Model (LLM)-centered framework for video-based activity recognition. In particular, we target endotracheal aspiration (ES), an essential yet invasive clinical procedure. This study addresses the lack of automated training and feedback systems for ES, especially in environments with limited supervision. Experimental results reveal that the LLM-based approach significantly outperforms the baseline model, achieving approximately 15-20% improvement in both accuracy and F1 score. This breakthrough provides an extensible and interpretable foundation for advancing nursing education and increasing training efficiency.
The research team used LLM as the central reasoning module to measure spatiotemporal activity recognition and explainable decision analysis from video data. Additionally, LLM verbalizes feedback in natural language, translating complex technical insights into easy-to-access guidance for trainees. The data demonstrate the framework’s ability to accurately identify ES configuration steps such as preparation, catheter insertion, suction application, and withdrawal through analysis of skeletal keypoints obtained from video footage. The performance of the system was quantified by improvements in both accuracy and F1 score, which are important metrics for evaluating the effectiveness of activity recognition models.
The study includes a pilot student support module that goes beyond simple recognition and leverages anomaly detection and explainable AI (XAI) principles. Testing has proven that this module provides automated and interpretable feedback, highlighting both correct actions and targeted areas for improvement. Measurements confirm the system’s ability to analyze activity execution patterns and generate meaningful feedback, supporting performance evaluation and skill improvement. The framework’s ability to detect and interpret procedural nuances represents an important step toward objective and continuous quality monitoring in clinical training.
The researchers used video-based pose estimation to capture the fine-grained kinematics of nursing staff during procedures and analyzed spatiotemporal features from key points on the skeleton. Previous work by Ngo et al. achieved an F1 score of 42% using raw skeletal data and improved to 46% with interpolation, but the team’s LLM-based approach exceeds these results. While Dobhal et al. demonstrated a 1% increase in F1 score from 55% to 56% using synthetic data generated with LLM, the current study achieved a 15-20% improvement compared to the baseline model, demonstrating significant progress in recognition accuracy. This study demonstrated the superior performance of the framework compared to traditional machine learning and deep learning approaches, achieving approximately 15-20% improvement in both accuracy and F1 score. The LLM acts as a central reasoning module that can identify spatiotemporal activities and provide explainable decision analysis from video data. Additionally, the system transforms complex technical insights into natural language feedback that is easy for students to access, providing automated and interpretable guidance on the right course of action and room for improvement.
This work represents a proof of concept for a new generation of human activity recognition (HAR) systems that leverages LLM’s contextual inference for applications in human-computer interaction. By combining semantic understanding with visual data, this framework enables zero-shot learning and nuanced interpretation of behavior, with potential implications for personal healthcare, smart environments, and robotics. The authors acknowledge the limitations of the pilot study and suggest that further research is needed to refine subsequent prototypes. However, success in identifying and explaining student errors, combined with the ability to verbalize technical indicators into understandable feedback, establishes a scalable and interpretable foundation to advance nursing education, increase training efficiency, and ultimately contribute to improved patient care.
