Fine-tuned LLM achieves 99% plausible counterfactuals for health interventions

Researchers are increasingly looking for ways to make artificial intelligence more interpretable and useful for real-world health interventions. Shovito Barua Soumma, Asiful Arefeen, and Stephanie M. Carpenter from Arizona State University, along with Melanie Hingle (University of Arizona) and Hassan Ghasemzadeh, demonstrate a new approach to generating “counterfactual explanations” using large-scale language models (LLMs). So, essentially, you identify the minimal changes needed to achieve a different outcome from your predictive model. Their research, detailed in a new paper, uses clinical datasets to evaluate the performance of models such as GPT-4, BioMistral-7B, and LLaMA-3.1-8B in both standard and fine-tuned configurations, and reveals that fine-tuned LLMs, particularly LLaMA-3.1-8B, can produce highly plausible and clinically relevant interventions. Importantly, these LLM-generated counterfactuals not only provide interpretable insights but also significantly improve model performance when training data is limited, providing a flexible and model-independent path towards more robust and effective digital health technologies.

The results in Figure 1-8B for both the standard and fine-tuned configurations using clinical datasets revealed fine-tuned LLMs, particularly LLaMA-3.1-8B, in both the pre-training and fine-tuning configurations to assess their ability to identify the minimum practical changes needed to alter the predictions of a machine learning model. The fine-tuned LLMs, specifically LLaMA-3.1-8B, consistently produced CFs with validity up to 99% and adequacy of 0.99, with realistic and behaviorally modifiable feature adjustments. In addition to providing human-centered interpretability, this work reveals new ways to augment training data and improve model performance, especially in scenarios where labeled data is limited. This innovative method addresses the limitations of traditional methods, which often struggle with categorical consistency and clinically relevant modifications.

Specifically, the SenseCF framework fine-tunes the LLM to generate valid and representative counterfactual explanations and supplement minority classes in imbalanced datasets, thereby improving model training and predictive performance. As shown in the accompanying figure, the F1 score of the classifier decreases significantly as the training data decreases, highlighting the weaknesses of the standard model and motivating the need for synthetic extensions with counterfactuals generated by LLM. This research represents an important step toward AI systems that can provide both accurate predictions and actionable insights in critical medical applications. Furthermore, this study systematically compares GPT-4 and open-source LLM, providing a rigorous and quantitative comparison in diverse clinical settings. This study provides a valuable contribution to the field of explainable AI and its applications in digital health by addressing gaps in the current literature, such as the lack of comprehensive evaluations on large clinical datasets and standardized evaluation metrics. The findings of this study suggest that LLM-driven counterfactuals hold great promise in creating more transparent, robust, and effective healthcare solutions.

LLM counterfactuals for clinical data evaluation yield promising results

The experiments adopted a rigorous methodology starting with training several classifiers, support vector machines, random forests, XGBoost, and neural networks on the AI-READI dataset to establish baseline performance under various levels of data reduction. The team then used each LLM to generate counterfactual explanations, prompting them to identify the minimal changes to input features that could change the model’s predictions. To quantify the quality of the intervention, scientists assessed validity and validity, achieving up to 99% validity and validity of 0.99 with fine-tuned LLM, specifically LLaMA-3.1-8B. Feature diversity is measured by analyzing the range of adjusted features within the generated counterfactuals, ensuring realistic and behaviorally modifiable changes.

This work pioneered a data augmentation method that introduces LLM-generated CFs as synthetic training samples under controlled label rarity settings. Specifically, the team reduced the training data by 10%, 20%, 30%, 40%, 50%, 60%, and 70% to simulate realistic clinical scenarios with limited labeled data. We then retrained the classifier using the CFE-enriched original data and measured the recovery in F1-score performance. The findings of this study highlight the weaknesses of standard models that label scarcity and motivate the need for principled synthetic extensions via counterfactuals generated by LLM.

LLM generates valid and plausible clinical counterfactuals

The fine-tuned LLM, especially LLaMA-3.1-8B, produced CFs with high validity reaching up to 99% and strong validity reaching 0.99 at the peak, with realistic and behavior-modifiable feature tuning. Specifically, in scenario A, LLaMA* fine-tuned with positive class undersampling achieved a remarkable increase of 21.00% in precision, 20.00% in precision, 24.56% in recall, 22.41% in F1 score, and 25.37% in AUC compared to the reduced dataset. These improvements demonstrate the power of CFE to alleviate performance degradation caused by imbalanced data. The team used the formula ∑X∗ T ∈CF ∑di=1 1(x∗i T = xi T) ∥CF∥ to measure sparsity to help users better understand the generated CF.

Results show that fine-tuned BioMistral-7B and LLaMA-3.1-8B significantly improve effectiveness, sparsity, and distance compared to their pre-trained counterparts, increasing effectiveness by 20-40% points and reducing feature distance by more than 50%. A counterfactual intervention example shows how LLM can suggest clinically meaningful modifications for high-stress patients, identifying low deep sleep (30.1%), moderate REM sleep (15.4%), high blood sugar (210.8 mg/dL), and low activity (on a 5.95 scale) as key contributors to stress. LLM suggested lowering blood glucose to 180 mg/dL and increasing deep sleep to 35% and REM sleep to 20%, reflecting a clinically viable strategy. Table III shows that LLaMA* achieved near-perfect validity with minimal clinically realistic modifications, while traditional methods often proposed unrealistic feature shifts. Feature diversity analysis, visualized in radar plots, highlighted that the fine-tuned LLM was focused on highly actionable variables, factors that could be easily modified through lifestyle and treatment adjustments, such as average step count, blood glucose levels, and frequency of hyperglycemia.

LLM increases data robustness and improves generalizability through counterfactuals

This study demonstrated that the counterfactuals generated by LLM exhibit semantic consistency and clinical validity, can enhance downstream robustness when applied to data augmentation, and can recover an average of 20% F1 score under conditions of severe label starvation. Specifically, the fine-tuned LLaMA and BioMistral models were proven to produce compact and practical CFs that outperformed pre-trained models and were competitive with existing optimization methods. To the authors’ knowledge, this represents the first systematic investigation of LLM-based CF applied to sensor-driven data in both zero-shot and few-shot settings, opening a promising avenue for integrating generative AI into reliable intervention-centric healthcare machine learning pipelines. The authors acknowledge limitations, such as the possibility of unrealistic feature changes, and suggest that future research could incorporate clinical knowledge graphs and causal structures into the fine-tuning process. Further research directions include extending the approach to multimodal data such as raw sensor traces and clinical records, as well as assessing the long-term impact of early intervention and CF-based guidance on patient outcomes.

👉 More information
🗞 Counterfactual Modeling with Fine-Tuned LLM for Health Intervention Design and Sensor Data Augmentation
🧠ArXiv: https://arxiv.org/abs/2601.14590

Source link