Machine Learning-Driven Diabetes Care Using Predictive Introduction Analysis for Personalized Drug Prescriptions

Data Sources and Variables

This study analyzed electronic health record (EHR) data from 17,773 T2D patients from VA Medical Centers (VAMC) in the United States from March 2, 1998 to December 13, 2010.^30,31and complies with data sharing guidelines. The dataset did not contain any personally identifiable information and did not involve interactions with human subjects, so formal ethical approval was not required. Data include patient demographics, comorbidities, medications, and mortality. See previous research for more information^30,31. Independent variables included Featuresdemographics and comorbidities, Decision variables (i.e., antidiabetic drugs), it is a binary (“take” or “not taking” the drug) and a general clinical decision-making problem³². The selection of predictors was based on previous work and data availability. Finally, prognostic-based CDS provided insight into outcome-based prescribing using mortality as the outcome variable. Mortality data was obtained from the VA Health System and strengthened by cross-reference with Social Security Records³¹. Mortality-based assessments are consistent with contemporary trends in evidence-based medicine that prioritize long-term patient-centered outcomes over surrogate endpoints³³. Furthermore, the key point is that while effective glucose control has long been recognized as affecting clinical outcomes of diabetes, some drugs offer benefits through mechanisms other than glucose reduction, others that can provide glucose reduction can cause other harms that may affect mortality. For example, systematic reviews and meta-analyses have shown that metformin lowers mortality independent of its glucose-lowering effects and affects other diseases such as cancer.³⁴. Furthermore, previous studies have shown that certain hypoglycemic agents (such as insulin) can reduce HBA1C, but are associated with an increased risk of death, highlighting that improved surrogate measures do not always lead to improved patient survival.^35,36. Our model addresses this significant gap by identifying treatment routes associated with lower risk of death in patients with type 2 diabetes.

Methodology

Our research methods are summarized in Figure 1. It consists of three main stages: data preparation, predictive analysis, and normative analysis. First, we prepared the data in step 1. In step 2, we developed a predictive BN model. Three resampling techniques (sampling, oversampling, and hybrid) were used to handle the imbalances of the data class. In step 3, after ensuring predictive performance of the model, we optimized treatment route recommendations using the BN belief update feature and its unique Markov blanket properties. Finally, we used decision tree algorithms to effectively present, visualize and communicate optimal policies using metadata obtained from optimal drug prescription results. This helps in implementing optimal recommendations in practical practice. All the algorithms, calculations, analysis, and visuals developed were implemented using the R package Bnlean³⁷, visnetwork³⁸and Glossy³⁹.

Data Preprocessing

Data preparation consists of two steps: discretization and data resampling. Discretization of continuous variables is an important preprocessing step for BNS and has a significant impact on network performance.⁴⁰. Increase exponentially with the number of states to simplify interpretation and reduce the complexity of the joint probability distribution of BNs⁴⁰all drug and complication variables were converted to binary variables, to indicate whether the medication was taken and whether complications were present. The only variables with three states were ages classified as young, middle-aged or old. Use a cart (classification and regression tree) approach to determine the optimal discretization threshold⁴¹.

One of the challenges in the VAMC dataset is the unbalanced distribution of outcome classes (i.e. mortality vs. survival). As in many real datasets, especially in medicine, minority classes (mortality rates) were lower in the dataset (13%) compared to the majority classes (survival). Under these imbalances, prediction models often bias towards majority classes, but the main goal is to accurately classify minority classes.⁴². One approach to combating imbalance challenges is to re-align the data using data resampling techniques during ML training. Resampling is primarily categorized as sampling, oversampling, and hybrid methods. Underestimating sampling removes instances of the majority class, and oversampling replicates some instances of the minority class or generates synthetic data (such as synthetic minority oversampling techniques (SMOTE)).^42,43. Hybrid algorithms repeatedly combine undersampling and oversampling steps⁴⁴.

Predictive analysisWe developed the BN model for three purposes. 1) Predict the mortality rate for diabetic patients (prediction), 2) to characterize the aggregate effects of comorbidities on diabetic drugs, demographic variables, and mortality rates.Descriptionand 3) (using 1 and 2) to identify the optimal routes (i.e., sequences and combinations) of diabetic drugs to minimize mortality in diabetic patients (i.e., sequences, combinations)Standards). BNS was intentionally selected for their dual capacity to support both predictive and normative analyses simultaneously. In addition to quantifying risk, BNS can simulate the effects of alternative treatment strategies using Bayesian inference. This is essential for personalized medical and scenario-based reasoning.

To maintain the causal structure of the BN, the model was learned from the data by (i) enforcing survival to the end node. This means that it could not lead to any other events. (ii) Demographic variables that are root nodes, i.e. events, cannot lead to the root node. (iii) drugs and complications were set up as intermediate nodes. This means that it could be both the cause and effect of other variables. Next, in response to the above node concept, all intermediate classes to root classes, all arcs, all arcs from end to root classes, and all arcs from end to intermediate classes were defined as prohibited (i.e. blacklisted).

In this study, we used three general constraint-based BN learning algorithms. Grow-Shrink (GS), Incremental Association (IAMB), and Interleaved Incremental Association (Inter.iamb). Two score-based algorithms: hill climbing (HC) and taboo search. Two hybrid algorithms: Max – Min Hill Climbing (MMHC) and Limited Maximization (RSMAX2) build the network. Bayesian Information Criteria (BIC) was examined with a score-based algorithm⁴⁵. For the hybrid algorithm, we used a maximum-to-maximum parent and child (MMPC) heuristic to limit the search space, and hill crimming was used to maximize scores and identify the best network. Constraint-based algorithms (e.g., GS) returned partially oriented acyclic graphs, so we applied bootstrap to identify the directions of undirected arcs (available in the appendix) that are subject to forced backlists (available in the appendix). Once the network structure was learned, the parameters (i.e., probability) were estimated based on maximum likelihood estimation. Finally, the network's joint probability distribution was estimated based on the data.⁴⁵.

As mentioned previously, various resampling techniques were used for the predictive analysis stage. Additionally, a 5x cross-validation process was carried out to prevent overfitting. This involved learning the structure and parameters of each network five times, with each iteration taking into account the first partition as test data and the remaining four partitions as training data. Predictive performance of the model was assessed using a variety of performance metrics, including recall, accuracy, area under the curve (AUC), F1 measurements, and accuracy. These metrics provide insight into various aspects of the model's prediction performance, including correctly identifying positive instances, avoiding false positives, distinguishing classes, balancing accuracy and recall, and overall prediction accuracy. Recall measures the ability of the model to correctly identify a positive instance among all actual positive instances. Accuracy measures the accuracy of the model that correctly identifies a positive instance among all predicted positive instances. AUC represents a metric that measures the ability of a model to distinguish between positive and negative instances. Provides overall performance measurements across different classification thresholds. F1 measurements are combined metrics that take into account both accuracy and recall, providing a balanced evaluation of the performance of the model. Finally, accuracy measures the overall performance of the model in predicting both positive and negative instances. After selecting the optimal resampling technique using cross-validation, the final model structure and its parameters were trained using the original VAMC data.

Normative analysis

The prescribed analysis of this study utilizes BN and D separation characteristics to optimize medication decisions in T2D patients by minimizing the risk of death³⁰. The basis of this algorithm is the identification of the smallest subset of variables that affect the relationship between decision variables (drugs) and outcomes (death), increasing computational efficiency. The patient demographics and diagnosis that lead to drug decisions are considered out of control within the Markov blanket. Conversely, drug-induced complications are considered a variety of characteristics and do not need to be matched. The algorithm aims to minimize the risk of death by determining the optimal drug combination for each patient type and assessing various combinations of decision variables and patient characteristics. This is achieved through a structured approach that repeats medication combinations and patient types, calculating the risk of death and identifying the minimum risk and combination for each scenario.

In our study, we examined three data-driven optimization strategies to determine the most effective treatment plan for managing T2D. This is forward optimization, backward optimization, and optimized ADA. The forward optimization strategy starts with a single drug and adds treatments step by step based on their effectiveness in reducing mortality. This mimics the way humans optimize their decisions. Post-optimization employs a comprehensive initial approach by assessing all potential treatment combinations, selecting the most effective one, and improving selection by moving backwards to the most complex treatment options. This is how to perform mathematical optimizations. Finally, optimized ADA strategies begin with metformin (i.e., ADA recommendations), advance through other recommended classes based on their ability to minimize mortality, and systematically select medications that personalize and enhance the effectiveness of the guidelines. More technical information on optimization strategies and BN inference can be found in Appendix B.

Current Treatment Policies and Guidelines for Prescribing Antidiabetic Drugs

Based on the 2017 ADA and the European Association for Diabetes Research (EASD) guidelines (summated in Figure S1, Appendix C), antidiabetic drugs usually start with metformin monotherapy and proceed to combination therapy as needed to control glucose levels. If the HBA1C target is not met, injectable therapy may be added. To determine the best treatment plan for each stage of care, multiple drug therapy combinations (e.g. metformin, glyburide, glipizide, etc.) were analyzed along with ADA/EASD recommendations, leading to 128 treatment plans: 7 single drug therapy, 21 double staining therapy, 35 TR-drug therapy, 35 4 lug therapy, 21 downstairs sexpur-pharmacotherapy, and 1 septraglug therapy.

Source link