A moment kernel machine for clinical data mining to inform medical decision making

In this section, we first illustrate how clinical decision making can be assisted through machine learning algorithms to give informed predictions based on the patient’s clinical data. In particular, we formulate this task as a classification problem. Next, as the core of this section, we develop a novel moment kernel to extract features from the clinical data for the classification problem. Our method uniquely integrates the HSIC Lasso algorithm to select informative features, thereby improving computational efficiency without sacrificing classification performance. To demonstrate the applicability of the developed medical decision making machine, we also present case studies using both synthetic data and real clinical data.

Event prediction as classification problems

Making decisions regarding a major medical intervention for a patient, e.g., deciding whether the patient should have surgery, generally requires (1) collecting sufficient data on possible post-intervention outcomes; (2) monitoring the current health condition and reviewing the medical history of the patient; and (3) assessing the significant factors influencing the possible outcomes based on the patient’s current health condition and medical history. This pipeline closely follows the formulation of a classification task in machine learning, where each class represents one possible post-intervention outcome. Then, the classification outcome obtained by the ML procedure using the patient’s clinical data can inform the medical decision.

In ML, features play a critical role in its performance. To ably assist with the medical decision making process, it is crucial to extract features from clinical data which reflect the patients’ medical conditions as well as ones that affect post-intervention outcomes, which is the main focus of the following sections.

Moment kernel for clinical data

When working with datasets comprising both numerical and categorical values, we apply one-hot encoding²⁰, a technique that converts categorical predictors to non-negative binary values. Each category is given a unique numerical representation as a feature vector consisting of entries of 0 and 1. We then pre-normalize each feature (predictor) within the training and testing datasets to ensure that all of them are in the interval [0, 1] before further normalizing the predictor vector of each patient to a probability distribution. Let ${\textbf{x}}_i=(x_{i1},\dots ,x_{iM})$ denote the predictor vector of the $i^{\textrm{th}}$ patient for $i=1,\dots ,N$, where N is the total number of patients and M is the number of predictors for each patient, then we normalize each ${\textbf{x}}_j$ as

$$\begin{aligned} p_{ij}=\frac{x_{ij}}{\sum _{j=1}^Mx_{ij}}, \end{aligned}$$

(2)

which yields a vector ${\textbf{p}}_i=(p_{i1},\dots ,p_{iM})$ satisfying $\sum _{j=1}^Mp_{ij}=1$. In addition, every component $p_{ij}$ of ${\textbf{p}}_i$ takes values in the same interval [0, 1], and this resolves any heterogeneity in the data, that is, different components of the predictor vector ${\textbf{x}}_i$ are drawn from different ranges. The property $\sum _{j=1}^Mp_{ij}=1$ then reveals that ${\textbf{p}}_i$ is a probability vector, and hence each ${\textbf{p}}_i$ represents the probability distribution of some random variable ${\textbf{A}}$. In particular, if ${\textbf{A}}$ takes values on the set $\Omega =\{\alpha _1,\dots ,\alpha _M\}$ containing M distinct elements, then the probability of the event $\{{\textbf{A}}=\alpha _j\}$ is given by ${\mathbb {P}}({\textbf{A}}=\alpha _j)=p_{ij}$ for each $j=1,\dots ,M$.

By the Hausdorff moment problem²¹, the probability distribution ${\textbf{p}}_i$ of ${\textbf{A}}$ is uniquely determined by the moment vector ${\textbf{m}}_i=(m_{i0},\dots ,m_{i,M-1})\in {\mathbb {R}}^M$, whose $k^{\textrm{th}}$-component is given by

$$\begin{aligned} m_{ik}={\mathbb {E}}({\textbf{A}}^k)=\sum _{j=1}^M\alpha _{j}^kp_{ij}, \end{aligned}$$

(3)

and referred to as the $k^{\textrm{th}}$–moment of the random variable ${\textbf{A}}$ with respect to the probability distribution ${\textbf{p}}_i$. Computationally, the moment vector ${\textbf{m}}_i$ can be easily obtained by ${\textbf{m}}_i={\textbf{p}}_i{\mathbb {M}}$, where

$$\begin{aligned} {\mathbb {M}}=\left[ \begin{array}{ccccc} 1 &{} \alpha _{1} &{} \alpha _{1}^2 &{} \cdots &{} \alpha _{1}^{M-1} \\ 1 &{} \alpha _{2} &{} \alpha _{2}^2 &{} \cdots &{} \alpha _{2}^{M-1} \\ 1 &{} \alpha _{3} &{} \alpha _{3}^2 &{} \cdots &{} \alpha _{3}^{M-1} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 1 &{} \alpha _{M} &{} \alpha _{M}^2 &{} \cdots &{} \alpha _{M}^{M-1} \end{array}\right] \in {\mathbb {R}}^{M\times M} \end{aligned}$$

(4)

is the $M\times M$ Vandermonde matrix generated by the vector $\alpha =(\alpha _1,\dots ,\alpha _M)$ consisting of the possible values of the random variable ${\textbf{A}}$. The assumption that $\alpha _i$ are distinct guarantees $\det ({\mathbb {M}})=\prod _{1\le k< l\le M}(\alpha _l-\alpha _k)\ne 0$, equivalently, the invertibility of ${\mathbb {M}}$. Therefore, as a map from ${\mathbb {R}}^M$ to ${\mathbb {R}}^M$ assigning a probability distribution to a moment vector, ${\mathbb {M}}$ is bijective, i.e., different probability distributions must associate with different moment vectors, which also verifies the aforementioned Hausdorff moment problem that ${\textbf{p}}_i$ is uniquely determined by ${\textbf{m}}_i$ and vice versa from the perspective of linear algebra. Together with the fact that ${\textbf{p}}_i$ is the normalization of the predictor vector ${\textbf{x}}_i$ containing the medical records of the $i^{\textrm{th}}$ patient, this observation firmly declares the candidacy of the moment vector ${\textbf{m}}_i$ as the feature for medical decision making tasks.

Moreover, because the normalization procedure illustrated in (2) endows each ${\textbf{p}}_i$ with the property $\sum _{j=1}^Mp_{ij}=1$, if $M-1$ components of ${\textbf{p}}_i$ are known to us, say the first $M-1$ components $p_{i1}$, $\dots$, $p_{i,M-1}$, then the remaining component can be explicitly calculated as $p_{iM}=1-\sum _{j=1}^{M-1}p_{ij}$, i.e., the freedom to determine an M-dimensional probability vector is of degree $M-1$. Consequently, the normalized clinical dataset $\{{\textbf{p}}_1,\dots ,{\textbf{p}}_N\}$ lies on a proper subspace of ${\mathbb {R}}^M$ of dimension at most $M-1$. Then, in practice, the use of moments up to some order $M'<M-1$ may be sufficient to make a strategic medical decision. In this case, the moment kernel as defined in (4) becomes an M-by-$M’$ matrix, which transforms high-dimensional predictor vectors to low-dimensional moment vectors while retaining all the information required for making a medical decision at a lower computational cost.

In addition, the moment kernel in (4) is independent of the data, and hence ${\textbf{A}}$ is a dummy random variable so that the sample space $\Omega$, containing the M outcomes of ${\textbf{A}}$, is completely free to choose. To inform strategic medical decisions, we seek a construction of $\Omega$ suitable for ranking the relative contribution of feature vectors, which we discuss in the next section.

HSIC lasso

Because each moment $m_{ik}$ in (3) is a weighted sum of the normalized predictors $p_{ij}$ with the weights $\alpha _j^k$, the choice of the sample space $\Omega =\{\alpha _1,\dots ,\alpha _M\}$ boils down to the determination of the weights. Naturally, predictors with larger weights acknowledge greater importance to the decision making process.

To this end, we formulate the task of searching for $\Omega$ as a feature importance ranking (FIR) problem²², to assign larger weights to more informative predictors. In particular, we will apply the Hilbert Schmidt Independence Criterion (HSIC) Lasso algorithm formulated as²³

$$\begin{aligned} \begin{aligned} \min _{\alpha \in {\mathbb {R}}^M} \quad&\frac{1}{2} \Vert \bar{{\textbf{L}}} – \sum \limits _{j=1}^{M} \alpha _j \bar{{\textbf{K}}}^{(j)} \Vert _{\text {Frob}}^2 + \lambda \Vert \alpha \Vert _1, \\ \text {s.t.} \quad&\alpha _1, \ldots , \alpha _M \ge 0, \end{aligned} \end{aligned}$$

(5)

where $\Vert \cdot \Vert _{\text {Frob}}$ is the Frobenius norm of matrices, i.e., $\Vert {\textbf{A}}\Vert _{\text {Frob}}=\sqrt{\sum _{i=1}^m\sum _{j=1}^nA_{ij}^2}$ for any ${\textbf{A}}\in {\mathbb {R}}^{m\times n}$ with the (i, j)-entry $A_{ij}$, and $\Vert \alpha \Vert _1=\sum _{i=1}^M|\alpha _i|$ is the $\ell ^1$-norm of the vector $\alpha =(\alpha _1,\dots ,\alpha _M)$, and $\lambda >0$ is a constant controlling the sparsity of the solution. Moreover, $\bar{{\textbf{K}}}^{(j)} = \Gamma {\textbf{K}}^{(j)} \Gamma$ and $\bar{{\textbf{L}}} = \Gamma {\textbf{L}} \Gamma$ are centered Gram matrices with the entries ${\textbf{K}}_{m,n}^{(j)} = k(p_{j,m},p_{j,n})$ and ${\textbf{L}}_{m,n} = l(y_m,y_n)$ defined by using some kernel functions k and l, where $y_i$ denotes the class label of the $i^{\textrm{th}}$ patient and $\Gamma = {\textbf{I}}_N – \frac{1}{N} {\textbf{1}}_N {\textbf{1}}^{\top }_N$ is the centering matrix. Moreover, for memory and computational efficiency, we use Block HSIC Lasso²⁴ in our experiments.

The pipeline of our feature extraction framework through moments is summarized in Fig. 1. The following case studies further show that the performance of the classification using the moment vectors ${\textbf{m}}_i$, generated from the moment kernel in (4), as features is comparable to those using other features with reduced computation time and increased robustness.

Case studies

Real-world data

We present three real-world datasets, including making informed decisions about breast cancer surgery, weight loss surgery, and liver transplant surgery by using pre-surgery patients’ clinical data. Specifically, we use the breast cancer dataset from the UCI Machine Learning Repository²⁵, which is a publicly available database with open access, for classification of breast cancer recurrence events. The Metabolic Bariatric Surgery Accreditation and Quality Improvement Program (MBSAQIP) dataset²⁶ is used for classification of the incidence of catastrophic events, including death, unplanned admission to ICU, and at least one re-operation within 30 days after weight loss surgery, and the Organ Procurement and Transplantation Network (OPTN) dataset²⁷ is used for classification of graft failure following a liver transplant surgery. Both MBSAQIP and OPTN datasets are publicly available and accessible upon request. The three datasets are briefly summarized in Fig. 2, and a more detailed summary of each dataset is included in the Supplemental Material.

All procedures included in this work were performed in accordance with the relevant regulations and guidelines, and all informed consents were obtained before the admission to respective medical institute took place. The breast cancer dataset was collected from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia. The MBSAQIP dataset is a Health Insurance Portability and Accountability Act (HIPAA)-compliant data file containing cases submitted to the MBSAQIP Data Registry, which contains patient-level, aggregate data and does not identify hospitals, health care providers, or patients. The OPTN dataset is collected via an online Web application. Transplant professionals from hospitals, histocompatibility (tissue typing) laboratories, and organ procurement organizations located across the United States use the application to manage their list of waiting transplant candidates, access and complete electronic data collection forms, add donor information and run donor-recipient matching lists, access various transplant data reports and policies. No organs/tissues were procured from prisoners.

For each dataset, we use an 80% training and 20% testing split, and explore three types of feature engineering schemes ${\textbf{M}}$, ${\textbf{X}}$ and ${\textbf{X}}(\alpha )$ for classification, where ${\textbf{M}}$ contains the features generated by the moment kernel in (4), ${\textbf{X}}$ presents the preprocessed data (obtained by normalization and one-hot encoding), and ${\textbf{X}}(\alpha )$ consists of the features in ${\textbf{X}}$ after feature selection. Each set of features are used to train three classifiers, including logistic regression (LR), artificial neural networks (NNs), and gradient-boosting machines (GBMs), and we examine the computation time and the area under the Receiving Operator Characteristic (AUC) curve for the testing data.

As shown in Fig. 2, all three datasets are imbalanced. Imbalanced datasets contain significantly uneven class labels. To address the issue of data imbalance, we adopt the following accommodations in the training phase. For LR, observation weights are added according to the ratio of class-imbalance; for NNs, an error weight based on class distribution is added to punish the misclassification of the minority label; for GBMs, we use RUSBoost²⁸, a boosting method well-known for its robustness to class imbalance, to learn from the skewed training data. In the testing phase, the AUC is naturally immune to class imbalance so that the classification performance accurately reflects the performance of the classifiers.

Synthetic data

Finally, we also test robustness to number of samples, features, noise and missing values for each preprocessing scheme using five experiments : (a) noise-free data, (b) data with signal-to-noise ratio $SNR = 20$, (c)-(e) missing data in significant features, in which experimental data are synthetically generated. In each of these experiments, we have 10,000 samples ($N=10,000$) and 2,000 predictors ($M=2,000$), and among the 2,000 predictors only 5 of them are causal; namely, only five of them really contribute to the output labels, and the other 1,995 features are randomly generated, which are independent of the output labels. Causal features are generated by two Gaussian distributions with mean $(\mu _{i1},\mu _{i2}) = (0.3i, 0.7i), i = 1, \ldots , 5$ and the same variance $\sigma ^2 = 1$, representing data from two different classes.

The order of moments is set to be a tunable hyper-parameter, which, in this work, is chosen to optimize the AUC via cross-validation for each of the ML algorithms. Note that increasing the number of moments does not necessarily improve classification performance, as more moments (features) used may lead to a higher chance of overfitting. To illustrate this idea, in Fig. 5, we use another synthetic dataset with 100 observations and 100 predictors. The AUC results are given using ${\textbf{M}}$ with the order of moment M taken up to 20.