Application of causal forest double machine learning (DML) approach to assess tuberculosis preventive therapy’s impact on ART adherence

Data source and study population

This research employed the University of Gondar Comprehensive and Specialized Hospital, Ethiopia’s electronic medical record system (EMR-ART). Routine secondary data on PLHIV-initiated ART at the hospital between March 2005 and December 2024 were employed. The source population was all patients who were HIV-positive and attended the hospital’s ART clinic between March 2005 and December 2024.

Inclusion factors were: age ≥ 15 years at the moment of starting ART, confirmed HIV-positive diagnosis, and full baseline clinical and laboratory information at the start of treatment. The patients were excluded if they had missing data at baseline, were referred from another facility with an incomplete clinical record, or were lost to follow-up in the first month after starting ART. After the application of these, 4152 patients were transferred to the final analytic cohort in total. Follow-up was continued up to the date of death, loss to follow-up, or administrative study clo.

Variable definitions

For this causal analysis, the variables were assigned to three basic roles: treatment, outcome, and covariates. These were selected based on clinical significance, theoretical justification, and availability in the electronic medical record system.

Treatment variable

TPT Started: The treatment variable of interest was TPT initiation, indicating whether or not a patient had been prescribed TPT as part of their HIV care. It was coded as:1 = Yes (TPT initiated) and 0 = No (TPT not initiated). TPT initiation is as per physician orders recorded in the ART clinic database and reflects the exposure of the patient to preventive TB treatment during the study.

Outcome variable

ART Adherence: The outcome variable of the present study was adherence to antiretroviral therapy. It was operationalized as a binary variable derived from adherence data routinely collected during clinical follow-up visits = Good adherence (according to ≥ 95% of doses taken or clinicians’ “good” marking), and 0 = Poor adherence (fair, poor, or missed medication included). This variable measures the patient’s ability to sustain adherence to the prescribed ART regimen over time and is a proxy for treatment success and the likelihood of viral suppression in the long term.

Covariates (potential confounders)

The following covariates were included in the analysis as potential confounders that may influence both TPT initiation and ART adherence. These variables were selected based on clinical practice and the results of prior research on HIV treatment processes:

Demographic characteristics

Age: Continuous variable in years at ART initiation.
Sex: Male or Female.
Marital status: Categorical variable (Single, Married, Divorced, Widowed).
Education level: Ordinal variable for the highest level of education completed.
Residence: Urban or Rural.
Religion: Self-reported religious affiliation.

Clinical characteristics

WHO clinical stage: Categorical variable (Stages I–IV) based on WHO guidelines at baseline.
Duration on ART: Measured in months from ART start to final follow-up.
Functional Status: Working, Ambulatory, or Bedridden, indicating patient capacity at baseline.
BMI (body mass index): Calculated from documented weight and height measurements.
Immunological and Laboratory Data:
Baseline CD4 Count: Measured in cells/µL at ART start.
Recent CD4 Count: Most recent CD4 result on follow-up.

Treatment-related variables

Regimen line: Categorical variable for whether the patient was on a first-line or second-line ART regimen.
CPT use: Binary indicator for whether Cotrimoxazole Preventive Therapy was initiated (1 = Yes, 0 = No).

These covariates were included to adjust for potential selection bias in TPT initiation and to be in a position to estimate unbiased treatment effects using causal machine learning methods. The broad span of the covariates across sociodemographic, clinical, immunological, and treatment categories ensures a very wide control for observable confounding in the estimation of the causal effect of TPT on ART adherence.

Analytical approach

To estimate the causal effect of tuberculosis preventive therapy (TPT) initiation on patient adherence, we implemented and compared three causal inference models: Adjusted Logistic Regression, Propensity Score Matching, and Causal Forest Double Machine Learning (DML). Each model was applied to the same dataset, and the Average Treatment Effect (ATE) was estimated along with its corresponding 95% confidence interval (CI). The comparison focused on the precision and reliability of the estimated effects, as reflected in the width of the confidence intervals. This multi-model approach allowed us to assess the robustness of the findings and identify the most suitable method for both average and potentially heterogeneous treatment effect estimation.

The first step in any causal inference analysis is the explicit definition of the hypothesized causal relationships between variables. This is generally defined in terms of a causal graph, or Directed Acyclic Graph (DAG), that visually defines the relationship between treatment, outcome, and confounders⁴⁴. In this work, a causal graph was constructed to guide the selection of covariates to control for in the estimation of the effect of TPT on adherence to ART (Fig. 1).

A causal graph has various applications. One, it makes it possible to systematize the integration of knowledge in a given field, primarily from clinicians, into modeling. By identifying which variables are likely to causally affect treatment assignment and outcome, we can determine the minimum sufficient adjustment set required to block backdoor paths and obtain an unbiased estimate of treatment effect. This is especially crucial in observational studies since treatment is not randomly received.

Second, the graph facilitates the explanation of assumptions and avoidance of standard pitfalls such as conditioning on colliders, control for mediators, or omission of informative confounders. The graph used in this study was developed in collaboration with domain experts from the clinical field and represented well-documented relationships in HIV care, such as the impact of WHO clinical stage, CD4 count, and duration of ART on the chance of receiving TPT and taking medication.

Our graph had demographic, clinical, immunological, and treatment-related variables as confounders. Arrows were drawn from the covariates to both treatment (initiation of TPT) and outcome (ART adherence), indicating their assumed causal effect. An arrow was also drawn from TPT to ART adherence, representing the causal effect that we wished to estimate. This graph provided a free and interpretable framework to explain our variable selection for causal modeling with the DML method.

Upon confirmation of the covariate adjustment set, we performed thorough data preprocessing. This included label encoding of categorical features and missing value processing by appropriate imputation or exclusion.

This method relies on several key assumptions, including unconfoundedness, positivity, and Stable Unit Treatment Value Assumption (SUTVA)⁴⁵. We adjusted for a comprehensive set of demographic and clinical covariates to satisfy the ignorability condition, though unmeasured confounding cannot be entirely ruled out. Positivity was assessed by examining the distribution of treatment probabilities. The treatment (TPT initiation) was consistently coded across patients, and we assume no interference between individuals (SUTVA) (Table 1). Finally, the DML framework’s use of non-parametric models and orthogonalization ensures robustness to model misspecification and correlated residuals.

Table 1 Causal assumptions checklist for causal forest DML

Lastly, we split the dataset into training and test subsets in an 80/20 ratio to allow for genuine estimation, an important double machine learning requirement.

The Causal Forest DML model was trained on single-base learners: A Random Forest Regressor to train the outcome as a function of the covariates, and a Random Forest Classifier to train the treatment assignment as a function of the covariates. These were trained separately, and their residuals were orthogonalized, i.e., the treatment and outcome models were residualized against the covariates, to satisfy the Neyman orthogonality condition. This is important to reduce the sensitivity of the final estimator of the treatment effect to the nuisance model errors.

To ensure robust model performance and prevent overfitting, hyperparameters for the Random Forest base learners were optimized using cross-validation on the training set. Key parameters such as the number of trees, maximum tree depth, and minimum samples per leaf were selected based on minimizing out-of-sample prediction error.

For enhanced interpretability, we supplemented feature importance analysis with SHAP value estimation, allowing us to visualize and quantify the marginal effect of each covariate on the individualized treatment effect predictions. This approach provided a more nuanced understanding of how specific patient characteristics modulate the impact of TPT on ART adherence.

After fitting the models, we estimated the ATE for the overall causal effect of TPT on ART adherence in the population, and the CATEs to adjust for individualized treatment effects in different patient profiles. Additionally, this causal estimation pipeline combines domain knowledge, causal inference, and modern machine learning to glean interpretable and tailored causal estimates from real-world clinical data. With these methods, we aimed to create long-lasting and actionable knowledge of the effect of TPT on ART adherence for PLHIV.

Mathematical modeling of the causal forest DML

To formally describe the estimation framework used in this study, we present the mathematical formulation of the DML approach. The objective is to estimate the causal effect of TPT on adherence to ART, adjusting for observed covariates using modern causal inference techniques.

Let Y ∈ {0,1} denote the binary outcome variable representing ART adherence, where Y = 1 indicates good adherence and Y = 0 indicates poor adherence. The treatment variable is denoted by T ∈ {0,1}, where T = 1 corresponds to patients who initiated TPT and T = 0 corresponds to those who did not. Let X ∈ ℝ^p be a vector of p observed pre-treatment covariates, including demographic, clinical, immunological, and treatment-related characteristics.

The parameter of interest is the CATE, defined as:

$$\uptau \left( {\text{X}} \right) = {\mathbb{E}}[Y\left( {1} \right) – Y\left( 0 \right)|X]$$

where Y(1) and Y(0) are the potential outcomes under treatment and control, respectively. This function captures heterogeneity in the treatment effect across different patient profiles.

To estimate τ(X), we employ the Causal Forest DML algorithm, which operates through a two-step orthogonalization process:

1.

Outcome model:

Estimate the expected outcome given covariates:

$$\hat{m}\left( {\text{X}} \right) = {\mathbb{E}}[Y|X]$$
2.

Treatment model (propensity score):

Estimate the probability of treatment assignment given covariates:

$$\hat{e}\left( X \right) = {\mathbb{P}}\left( {T = 1|X} \right)$$
3.

Residualization:

Compute residuals for both outcome and treatment to remove the influence of confounders:

$$\tilde{Y} = Y – \hat{m}\left( X \right),\;\tilde{T} = T – \hat{e}\left( X \right)$$
4.

Treatment effect estimation:

Fit a non-parametric model (e.g., random forest) to estimate τ(X) by regressing $\tilde{Y}$ on $\tilde{T}$, conditioned on X:

$$\uptau \left( {\text{X}} \right) = {\mathbb{E}}[\tilde{Y}|\tilde{T},X]$$

The final output is an estimate of τ(X) for each individual in the dataset, allowing for both population-level and subgroup-level causal inference.

In addition to estimating the conditional treatment effects, we compute the Average Treatment Effect (ATE) by averaging τ(X) over all individuals:

$${\text{ATE}} = {\mathbb{E}}\;\left[\kern-0.15em\left[ {\uptau \left( X \right)} \right]\kern-0.15em\right]$$

We also examine feature importance in treatment effect estimation to understand which covariates contribute most to explaining heterogeneity in treatment effects. This is assessed through permutation-based importance scores derived from the causal forest model and visualized using SHAP-style and waterfall plots.

This mathematical framework ensures that the causal effect of TPT on ART adherence is estimated with minimal bias, leveraging the strength of machine learning in modeling complex relationships while preserving the assumptions of causal inference.

Source link