InfEHR: Clinical phenotype resolution through deep geometric learning on electronic health records

The Institutional Review Board of the Icahn School of Medicine at Mount Sinai approved the protocol for retrieving and analyzing all EHRs in this study. Data obtained from the MOVER dataset was approved for use by the Institutional Review Board of the University of California, Medical Center, and the main campus.

Overview of InfEHR

The premise of InfEHR is that more information is available than is typically used in individual clinical decision-making. The complexities of obtaining information from EHRs limit their utility. InfEHR is a geometric deep-learning approach for resolving clinical uncertainty using EHRs with minimal human intervention. The framework is designed to perform in realistic clinical settings where large volumes of labeled training data cannot be obtained and where existing knowledge is limited.

Three sequential modules make up the InfEHR framework. Module 1 intakes raw EHRs and produces EHR graphs through three successive steps: EHRs are first pre-processed to remove invalid data, next clinical events are automatically abstracted from the EHRs and embedded to form a set of nodes, finally individual EHRs are aligned to the abstracted events and represented as graphs where nodes are connected according to the naive temporal ordering in the patient EHR, forming EHR graphs. In Module 2, an attention-based graph neural network (GNN) embeds EHR graphs using self-supervision. These embeddings, representing the complete patient record, are used in an automatic rules-generation engine to obtain initial probabilities for all unlabeled cases. And in Module 3, uncertainty in these probabilities is resolved through semi-supervised training of the GNN using a specialized loss function. Module 3 can be used with any source of prior probability information.

Descriptions of each component module, corresponding detailed equations, and the specifics of the datasets used are provided below. A workflow diagram is provided in Supplementary Fig. 1.

Module 1: Processing Electronic Health Records into Graphs

Training Datasets

We obtained structured and unstructured data from 11 million electronic health records (EHRs) from the Mount Sinai Health System stored in the Mount Sinai Data Warehouse (CN-S) and through the Extrico Health platform (PO-AKI) over time-varying measurements, medications, and clinical progress notes.

For potential neonatal CN-S cases, records were obtained by identifying individuals with at least 48 h of antibiotic exposure administered in the NICU and without categorical missingness (e.g., no vitals information). All antibiotic courses for such individuals meeting this requirement (n = 8067 individuals, 9256 antibiotic courses) were then extracted. A physician subject-matter expert manually confirmed the CN-S status for n = 3653 antibiotic courses. We applied a stratified split to the physician-confirmed dataset by birthweight to obtain a labeled training dataset of n = 2914 cases (80% of the total). Birthweight was chosen because it is an independent risk factor for CN-S³².

For potential PO-AKI cases, records were obtained for individuals undergoing surgery of any kind with presurgical hospitalization ≧ 2 days and who had in-hospital serum creatinine measurements taken at 72 h (n = 22,138) postoperatively to compute AKIN scores. For patients with multiple surgeries, only first surgeries with subsequent operations >72 h (n = 8031) were considered. A positive AKI diagnosis was assigned to any patient with AKIN score >1.

Validation Datasets

We used n = 729 from the stratified split (20% of total) cases as a validation dataset for the CN-S task.

For the PO-AKI task we used the EPIC EHR cohort in the MOVER dataset from the University of California at Irvine Medical Center (UCIMC, n = 39,685) and included only patients with serum creatinine measured preoperatively (2 or more measurements) and at 72 h postoperatively (n = 2631). We applied the AKIN definitions to obtain labels as in MSHS.

EHR Preprocessing

We applied the preprocessing steps described below to the CN-S and PO-AKI training datasets (individually) and applied the results where relevant to the validation datasets. The MSHS consists of several individual hospitals with varying database capture and update protocols. As a result, some types of information were systematically unavailable in the data warehouse at the time of retrieval.

EHRs with such categorical missingness (e.g., no vitals) were excluded from model training given that this pattern of missingness likely resulted from database-specific variation (retained cases: n = 5213 CN-S, n = 4276 PO-AKI MSHS, excluded cases: n = 2854 CN-S, n = 3764 PO-AKI MSHS); however, we used all valid records, including incomplete records, for the density estimations that the node discovery process required (see Module 3 below).

Preprocessing Numerical Values

We include vitals measurements and laboratory results measured on at least 100 unique individuals and with representation from both labels (i.e., the measurement does not by itself identify a case). As a result, given only 137 patients with confirmed CN-S, we considered the vital signs of respiratory rate, spO2, temperature, pulse, and systolic/diastolic blood pressures to avoid bias from measurement type. We retained 25 unique vitals in PO-AKI. We considered 387 and 72 unique labs, and 47 and 280 distinct medications, in CN-S and PO-AKI, respectively.

We further processed continuous numerical values by dropping any value greater than three times the maximum or less than three times the minimum clinical reference range (deemed to be likely artifactual). We also apply these preprocessing steps to the UCIMC dataset to variables corresponding with MSHS data and remove any variables without correspondence from the dataset.

Preprocessing Nonnumerical Values

Nonnumerical observations corresponding to lab results were standardized by applying Levenstein distance to coalesce all similar variations to the most frequently observed term. We processed categorical features derived from clinical notes from the MSHS datasets as follows: we applied QuickUMLS, a Universal Medical Language System (UMLS) matcher, to identify terms from clinical notes matching a UMLS term with high confidence (> 0.7). The extracted terms were further refined in the node discovery process. No clinical notes were available in the UCIMC dataset.

Node Discovery and Embedding

We discovered the set of nodes comprising the global pool of clinical events as nodes using density-based selection procedures (node discovery). We applied this process separately to the CN-S and PO-AKI training datasets, then used the learned results from each training dataset to extract clinical events from its respective validation dataset.

We detail the operations involved, as they apply to continuous and discrete variables, in the following sections:

Continuous Variables

We fit kernel density estimations (KDE) to the set of all observations for all EHRs for each measurement type (e.g., heart rate, respiratory rate, white blood cell count, etc.). The resulting KDE curve indicated local densities by the intervals between local peaks that we then used to discretize the continuous measurement. The number and distance between peaks is set by a single bandwidth parameter that we determined empirically to satisfy the following constraints: the discretization must be shared by at least 100 unique individuals (e.g., a local density for blood glucose must contain measurements observed for at least 100 unique individuals) while maximizing the number of identified intervals.

Discrete Variables

InfEHR uses discrete but time-varying information, including medications and clinical terms, in notes. We include medications that were administered to at least 100 unique patients in the prescribable subset of RxNorm.

We extracted UMLS-synonymous terms from clinical notes using QuickUMLS (see preprocessing of nonnumerical data above). We weighted the collection of the extracted terms using term frequency inverse document frequency (TF-IDF), then applied nonnegative matrix factorization (NMF) with automatic determination of latent topic number (minimum components in the NMF H matrix such that the cophenetic correlation coefficient ≧ is 0.90). We analyzed the resulting low rank term-weight matrix to identify and retain terms strongly associated with any latent topic (≧ top 10% of distributed topic weights). This procedure simultaneously selects terms based on frequency and informational content to create a data-driven vocabulary of clinically meaningful terms from notes for use as graph nodes.

The node selection process automatically compresses the range of all clinical events to a subset based on the underlying density distributions of the dataset. This allows the discovery of nodes through a data-driven discovery process without human pre-specification or assumptions. Specifically, the number and content of nodes are not known a priori. Additional semantic information and implicit relational structures between nodes are encoded during node embedding, as described below.

Method of Node Embedding

We derived 64-dimensional numerical representations (embeddings) for the identified nodes described above. To compute the node embeddings, we first constructed a bipartite graph with partitions over individual patients and the identified global set of nodes. Next, we computed the overlap weighted projection of the clinical event nodes over the patients and retained only edges weighted at or above the 25th percentile edge weights. We added nodes representing semantic types (e.g., the name of a lab measurement or vitals sign category) to the projected graph and connected them to relevant nodes with maximum edge weight. The neighborhood of any individual node, therefore, included all nodes of the same semantic type as well as nodes across semantic types with high co-occurrence (indicated by high-weight edges). Nodes were encoded to reflect neighborhood information using the Node2Vec algorithm.

The resulting collection of embeddings (including clinical events and semantic identifiers) forms a manifold that naturally encodes semantic clinical relationships into spatial distances between embeddings. We assigned to each node in an EHR graph the resulting relevant embedding, subject to some added components as described below. Note: learn embeddings from the training datasets individually; we used these embeddings in the validation dataset without retraining. (See Fig. 7.)

**Fig. 7: Electronic health records (EHRs) are represented automatically as Electronic Health Record graphs through an unsupervised process.**

Tuning the General Node Representation to Individual Temporal Contexts

After extracting the set of clinical events, we extracted the time stamps for their occurrences in individual EHRs. We adjusted all time stamps to reflect elapsed times by subtracting the earliest time stamp corresponding to a clinical event in each EHR. We aggregated all unique time stamps for all EHRs and derived 32-dimensional embeddings for each time stamp using the Time2Vec algorithm in Eq. (1):

$${Time}2{Vec}
(4)

where:

$t $ time component (like time stamp, hour of day, etc.)

${w}_{k}$, ${b}_{k}$ Learnable parameters of the model

$k$ Position in the Time2Vec vector.

The general representation of a clinical event is formed by concatenating the event embedding with the semantic type embedding (e.g., the embedding of a certain KDE density for blood glucose is concatenated with the embedding for blood glucose). These generalized embeddings—consistent across patients—are tuned to the individual by adding the embedding of the time stamp for its occurrence. The resulting embeddings render clinical events in a machine-readable format. Although time stamps are not explicitly used as positional markers, the vectorization of time adds temporal information to representations of clinical events. The numerical distance between locally co-occurring but semantically distinct clinical events is reduced by the similarity of their time stamp components compared to events farther apart in time. Individual variation in temporal dynamics therefore shapes the representation of clinical events to the machine, transforming generalized clinical event representations to reflect individual contexts (see Fig. 7).

Module 2: Deep Geometric Learning Approach

Notation

We represent individual EHRs as directed graphs by determining relevant clinical events $\varepsilon $ (such as a measurement value within a certain range, or the appearance of a term in a clinical note) where $\varepsilon $ is discovered through an automatic process (node discovery).

We derive embeddings for these clinical events by learning a manifold ${\mathcal{M}}$ comprising all events $\varepsilon $ and their respective semantic types $\tau $. We apply an operator (here, concatenation) to obtain the representations of all possible clinical events as shown in Eq. (2):

$${\rm E}=\phi \left(\varepsilon \in \boldsymbol{\mathscr{M}}\Rightarrow \!\!,\tau \in \boldsymbol{\mathscr{M}}\right):{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{d}$$

(5)

Graphs representing patients are constructed by identifying the time stamp $\varepsilon \in {\rm E}$ in the patient record, embedding the time stamp using a network trained on the Date2Vec objective, and concatenating $\varepsilon $, resulting in initial node embedding

$${h}_{i}^{\left(0\right)}\in {{\mathbb{R}}}^{m}\left(m > d > n\right)$$

The graph is defined as

$$G=\left(\varepsilon,V\left(t\right)\right)$$

(6)

with directed edges

$${e}_{i,j}=\left({v}_{i}\left(t\right) < {v}_{j}\left(t\right)\right)$$

(7)

Problem definition

Given the graph $G$, we train networks to learn whole graph representations in ${{\mathbb{R}}}^{d}$ for (1) self-supervised representations of EHRs and (2) computing likelihoods over clinical queries. Exact definitions of the loss functions used for training in (1) and (2) appear below.

Construction and Definitions of EHR Graphs

InfEHR computes likelihoods through sequential processing of EHRs. We obtain EHRs and then represent them as temporal graphs Eq. (3), according to these two definitions:

Definition 1: EHR

Given ${{\mathcal{R}}}_{i},$ comprising all medical records for patient$i$ occurring over the set of times ${T}_{i}$, we extract the electronic health record (EHR) of patient $i$, denoted as ${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{i}$:

$${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{i}=\{(r,t)\in \{{vitals},{labs},{medications},{clinicalnotes}\}{andt}\subseteq {T}_{i}\}$$

(9)

with $t$ bounded by:

$$\max \left({T}_{0,{defined}},{T}_{0,{patienti}}\right)\le t\le \min \left({T}_{\max,{patienti}},{T}_{\max,{defined}}\right)$$

(10)

where:

${T}_{{defined}}$ Timestamp of a clinical event or user-provided temporal duration.

${T}_{{{patient}}_{i}}$ An observed timestamp in the records of patient $i$.

${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{{i}_{j}}$ Unique EHR identified by a specific clinical event in patient $i$‘s records.

Here, ${T}_{{defined}}$ indicates the time stamp of a clinical event or user-provided temporal duration, and ${T}_{{{patient}}_{i}}$ corresponds to an observed time stamp in the records of patient $i$. In the case of multiple defined clinical events occurring in the records of patient $i$, each event results in a unique EHR identified by ${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{{i}_{j}}$.

Definition 2: EHR Graph

We take EHRs (as defined above) and represent them as temporal graphs. We discover and collate nodes from the collected EHRs using unsupervised methods into a global node pool (Nodes), embed individual time stamps using Time2Vec (Times), and form temporal edges following (11) and the algorithm in Box 1.

$$\forall {\text{node}}_{j}\in G,j < i\wedge {{{time}}}_{j} < {{{time}}}_{i}\!\!:\!\!{{create\; edge}}\left({{{node}}}_{j}\to {{{node}}}_{i}\right).$$

(11)

Training the GNN on Attributed EHR Graphs

We train a GNN to produce whole graph embeddings (dim = 128 self-supervised, dim = 164 semi-supervised) subject to additional processing layers under a self-supervised and semi-supervised objective (details below). We use a consistent architecture adapted to supervised and self-supervised training regimes.

Given an EHR Graph $G=(V,E)$, the model initially condenses and rewires the graph through a learned pooling operation resulting in:

$${G}^{{\prime} }={ASAPool}(G,\rho )$$

(12)

We derive the global representation $X$ and logits for ${G}^{{\prime} }$ as shown:

$${H}^{(1)}={ReLU}\left(\;\sum _{j{\mathscr{\in }}{\mathscr{N}}\left(i\right)}{\alpha }_{{ij}}^{\left(1\right)}{W}^{\left(1\right)}{x}_{j}\right)$$

(13)

$$R={W}_{r}{H}^{(1)}$$

(14)

$${H}^{(2)}={ReLU}\left(\;\mathop{\sum }\limits_{j{\mathscr{\in }}{\mathscr{N}} (i)}{\alpha }_{{ij}}^{(2)}{W}^{(2)}{H}_{j}^{(1)}\right)+R$$

(15)

$$X=\frac{1}{\left|{V}^{{\prime} }\right|}\sum _{i\in {V}^{{\prime} }}{H}_{i}^{(2)}$$

(16)

$${Logits}={W}_{f}X+{b}_{f}$$

(17)

This network definition uses equations in sequential order (12, 13, 14, 15, 16, 17).

For all experiments we use input node feature dimensions d = 160 (from d = 32 for time embedding + d = 64 node semantic type embedding + d = 64 node value embedding). We use a node pooling ratio (rho) of 0.8 and attention heads for GAT layers = 2, and use hidden dimensions H(1) = 256, and H(2) = 128. We train on a self-supervised learning (SSL) objective (explained below) to produce a d = 128-dimensional representation (dimensions chosen from experience) and a d = 2-dimensional output for the semi-supervised objective.

Self-Supervised Learning for Whole Graph Representations

We learn whole graph representations by training the InfEHR GNN using a self-supervised loss according to the algorithm in Box 2:

Box 1 Algorithm to construct EHR Graph

1. Input: $EHRs=\{{\rm{E}}{\rm{H}}{{\rm{R}}}_{1}\ldots,{\rm{E}}{\rm{H}}{{\rm{R}}}_{n}\}$ for a dataset of length $n$

2. Nodes $\leftarrow $ Embed (NodeDiscovery(EHRs))

3. Times Embed ({time(Node) | Node in Nodes}) ←

4. for each $({r}_{i},{t}_{i})$ in ${\mathscr{E}}{\mathscr{H}}{R}_{i}$ do

5. if ${r}_{i}$ has a representation in Nodes then

6. ${{node}}_{i}\leftarrow {Nodes}[{r}_{i}]$

7. ${tim}{e}_{i}\leftarrow {Times}[{t}_{i}]$

8. Concatenate $({nod}{e}_{i},{tim}{e}_{i})$ and insert into graph ${G}_{i}$

9. end if

10. for each ${nod}{e}_{j}$ in ${G}_{i}$ where $j < i$ and and ${tim}{e}_{j} < {tim}{e}_{i}$ do

11. Create edge in $({nod}{e}_{j}\to {nod}{e}_{i})$ in ${G}_{i}$

12. end for

13. end for

14. EHR Graphs ←$\{{G}_{1},\ldots,{G}_{n}\}$

Box 2 Self-supervised training algorithm: Detailed Graph Self-Supervised Learning with VICReg and MI

1. Input: Graph dataset ${\rm{G}}={{\{{G}_{i}=({V}_{i},{E}_{i},{X}_{i})\}}^{N}}_{i=1}$

2. where: ${V}_{i}$ is the set of nodes;

3. ${E}_{i}\subseteq {V}_{i}\times {V}_{i}$ is the set of edges;

4. ${X}_{i}\in {{\mathbb{R}}}^{\left|{V}_{i}\right|\times {d}_{{in}}}$ are node features;

5. $T$ Number of epochs;

6. Feature dimensions: ${d}_{{in}}$: (input), ${d}_{h}$: (hidden), ${d}_{e}$: (embedding).

7. Input: Loss weights: ${\lambda }_{{sim}}$, ${\lambda }_{\mathrm{var}}$, ${\lambda }_{\mathrm{cov}}$, ${\lambda }_{{mi}}$

8. ASAPooling ratio $r$, number of attention heads $K.$

9. Output:

10. Trained encoder ${f}_{\theta }$ for downstream tasks

11. // Initialization Phase

12. Initialize encoder ${f}_{\theta }$ with components:

13. ASAPooling layer: ASA: ${{\mathbb{R}}}^{{d}_{{in}}}\to {{\mathbb{R}}}^{r\cdot {|V|}}$

14. GAT layers: ${{GAT}}_{k}:{{\mathbb{R}}}^{{d}_{k}}\to {{\mathbb{R}}}^{{d}_{k}}+1$ with heads; $K$

15. Residual connections and dimension reduction;

16. Initialize projectors:

17. ${p}_{1}:{{\mathbb{R}}}^{{d}_{{in}}}\to {{\mathbb{R}}}^{{d}_{e}}$(input projector);

18. ${p}_{2}:{{\mathbb{R}}}^{{d}_{e}}\to {{\mathbb{R}}}^{{d}_{e}}$ (representation projector);

19. Both with architecture: Linear→ LayerNorm→ ReLU × 2 Linear;

20. Initialize MI scoring matrix; ${W}_{{mi}}\in {{\mathbb{R}}}^{{d}_{e}\times {d}_{e}}$

21. // Training Loop

22. for $t=1$ to Tdo

23. for each graph $G=(V,E,X)$ in ${\mathscr{G}}$ do

24. // Generate representations

25. $h\leftarrow {f}_{\theta }(X,E);{v}_{1}\leftarrow {p}_{1}(X);{v}_{2}\leftarrow {p}_{2}(h);$

26. // Compute VICReg losses

27. ${{\mathcal{l}}}_{{sm}}\leftarrow \frac{1}{{|V|}}\mathop{\sum }\limits_{i=1}^{{|V|}}{\left|{v}_{1}^{i}-{v}_{2}^{i}\right|}_{2}^{2};{S}_{j}(v)\leftarrow \sqrt{{Var}\left({v}^{j}\right)+\epsilon }$

28. for $j\in [{d}_{e}];$

29. ${{\mathscr{L}}}_{{vr}}\leftarrow \frac{1}{{d}_{e}}\mathop{\sum }\limits_{j=1}^{{d}_{e}}(\max (0,\gamma -{S}_{j}({v}_{1})))$

30. $C\left(v\right)\leftarrow \frac{1}{\left|V\right|-1}\left({v}^{T}v-{diag}\left({v}^{T}v\right)\right); {{\mathscr{L}}}_{{cv}} \leftarrow \frac{1}{{d}_{e}} \sum _{i\ne j}\left(\left[C{\left({v}_{1}\right)}_{{ij}}^{2}+{\left[C({v}_{2})\right]}_{{ij}}^{2}\right]\right);$

31. // Mutual Information Estimation

32. ${poo}{l}_{1}{AvgPoolNeighbor}\left({v}_{1},E\right){;glo}{b}_{1}\leftarrow \frac{1}{{|V|}}\mathop{\sum }\limits_{i=1}^{{|V|}}{poo}{l}_{1}^{i};$

33. ${cor}{r}_{1}$ ←WindowCorrupt $(v1,w);$ ${poo}{l}_{c1}$

34. AvgPoolNeighbor$({cor}{r}_{1},E);$

35. ${glo}{b}_{c1}\leftarrow \frac{1}{{|V|}}\mathop{\sum }\limits_{i=1}^{{|V|}}{poo}{l}_{c1}^{i};{s}_{{pos}}{softmax}({poo}{l}_{1}{W}_{{mi}}{glo}{b}_{1}^{T})$

36. ${s}_{{neg}}\leftarrow {softmax}\left({poo}{l}_{1}{W}_{{mi}}{glo}{b}_{c1}^{T}\right){\mathscr{L}}m\leftarrow -\log \frac{\exp ({s}_{{pos}})}{\exp ({s}_{{pos}})+\exp \left({s}_{{neg}}\right)};$

37. // Update Model

38. ${\mathscr{L}}{\mathscr{\leftarrow }}{\lambda }_{{sim}}{{\mathscr{L}}}_{{sim}}+{\lambda }_{\mathrm{var}}{{\mathscr{L}}}_{\mathrm{var}}+{\lambda }_{\mathrm{cov}}{{\mathscr{L}}}_{\mathrm{cov}}+{\lambda }_{{mi}}{{\mathscr{L}}}_{{mi}}$

39. Update parameters using AdamW optimizer;

40. end

41. end

42. return trained encoder ${f}_{\theta }$, discard projectors ${p}_{1}$ and ${p}_{2}$

Self-supervised learning

Our SSL training algorithm uses a VICReg framework⁸⁶, more commonly used for image encoding, enhanced with mutual information (MI) estimation tailored to graph data. The procedure initializes an encoder with ASAPooling and GAT layers, along with two projectors for input and representation transformation. During training, each graph generates two views: raw features projected through p₁ and encoded features through p₂. The projectors, implemented as multilayer perceptrons with normalization and nonlinearities, serve as learnable transformations that map the input and encoded representations to a shared embedding space while preventing the collapse of information. This architectural choice enforces an information bottleneck that prevents the encoder from learning trivial solutions, while the projectors’ flexibility allows the contrastive learning objective to be optimized without constraining the encoder’s representation capacity. Post-training, the projectors are discarded, preserving the encoder’s learned manifold structure for downstream tasks.

The loss function combines four components: similarity loss (ensuring view alignment), variance loss (preventing dimensional collapse), covariance loss (decorrelating features), and mutual information (MI) loss (maximizing node-to-graph information while minimizing it for corrupted samples). The MI estimation uses a structured corruption scheme and InfoNCE-style loss computation.

The corruption scheme generates negative samples by shuffling node features within windows of the EHR graph (WindowCorrupt). The effect is to introduce random and unrealistic relationships and orderings between clinical events. The MI estimation encourages the model to distinguish valid clinical structures and their representations from these unrealistic examples.

The total loss is weighted as Equation (32), optimized using AdamW. After training through $T$ epochs (for 1000 epochs), the encoder is preserved for downstream tasks while projectors are discarded.

This SSL loss function encourages EHR representations that capture local temporal patterns within patient records and also global patient states, allowing for encodings that capture inter-patient variation (differing global states) simultaneously with encoding shared local temporal structures. This simultaneity promotes meaningful semantics in several ways: high-density regions are likely to represent patients with common clinical patterns or disease trajectories, whereas sparse regions may indicate rare conditions or unique patient presentations. Spatial distances can be used to infer the disease state as follows.

Deriving instance-level priors automatically

Label propagation using self-supervised embeddings

We train the GNN encoder to produce self-supervised embeddings as above and apply label propagation as described in ref. ⁶⁰ and implemented in scikit-learn. We hypothesize that training on a self-supervised objective, as described above, results in automatic alignment of phenotypically similar people so as to meet the assumption of label smoothing that semantic similarity is a function of spatial distance.

The label smoothing algorithm iteratively learns a smooth classification function whereby we take the 110 labeled samples and spread label information to spatially proximate samples and label each sample according to the flow of labels it receives during the propagation process. We apply the label spreading algorithm using an RBF kernel (gamma at 70) to determine probabilistic distances between embeddings and set the clamping parameter alpha, controlling the relative importance of the initially labeled examples in deciding the predicted labels for unlabeled examples, to 0.5 based on previous experience. We achieved 0.18 (CN-S) and 0.34 (PO-AKI) recall, with precision of 0.67 and 0.78 respectively (outperforming the clinical heuristic in both cases), suggesting the spatial similarity assumptions appreciably obtained in the self-supervised embeddings.

Integration of spatial information and structural information

We derive weak labeling heuristics from structural features of EHR graphs using label information provided from label propagation over the self-supervised embeddings (spatial information as described). Ordinarily such weak heuristics are generated by a human expert which involves bias and challenges in precisely these clinical settings where existing clinical knowledge is limited. We present a method to automatically generate them at scale in uncertain conditions and follow established guidelines for combining them^61,62 to generate initial probabilities. We find that existing literature^21,22,86 has corroborated a random sample of automatically generated heuristics. Another potential application of InfEHR could be generating hypotheses from weakly predictive heuristics.

Weakly supervised learning over uncertain priors

The ASAPooling operation⁸⁷ in the GNN uses an attention-based mechanism to derive cluster medoids and assign cluster memberships to nodes over a fixed receptive field to produce a new, pooled graph. Clusters are scored for inclusion in the pooled graph and reconnected with edge weights that indicate the topology of the original graph. We apply a message-passing algorithm over the pooled graph using graph self-attention to compute new node representations successively. InfEHR derives the node representations by learning an attentional coefficient that weighs the relative importance of a node to its neighbor in the aggregation phase of the message-passing algorithm. To avoid over-smoothing of node representations, we apply a residual between successive message passing steps. Finally, to obtain the whole graph representation we take the mean over node features for all nodes in the pooled graph resulting in a single high dimensional vector. We further process the whole graph representation using linear layers to return likelihoods according to the following loss criterion described below.

Module 3: resolving prior probabilities

GNN Training with Feature-Based Weighting of Kullback-Leibler Loss

We train the GNN (previously described) under our own loss function similar to the RQ loss proposed in ref. ⁶³ and include a small network to learn example specific loss weighting functions. RQ loss consists of a generative formulation allowing the optimization of the log-likelihood of learned graph features relative to an assumed underlying generative process (here a disease latent). The loss function is consistent with the overall data representation strategy in which we capture disease latents at multiple levels (from initial node encodings to the naive temporal EHR graphs). We extend this function further by learning a dynamic weighting mechanism that continuously adapts during training, learning to adjust sample importance based on evolving patterns in the penultimate layer representations. Modulating the RQ loss through learned weights focuses attention on the most informative samples as the feature space becomes progressively more structured throughout training.

Weighted RQ Loss

The weighted RQ loss minimizes the following function, jointly optimizing parameters for the primary model and for the weighting network:

$$\mathop{\min }\limits_{\theta , \phi }\mathop{\sum }\limits_{{iinB}}\frac{\exp ({w}_{i}(\phi ))}{{\sum }_{{jinB}} \exp \left({w}_{j}(\phi )\right)}\cdot {KL}\left[q\left({\mathcal{l}},|,{x}_{i}\right)\theta \left)\right. || {c}_{i} \cdot {p}_{i} ({\mathcal{l}})\frac{q({\mathcal{l}}|{x}_{i};\theta )}{{\sum }_{j}q\left({\mathcal{l}}|{x}_{j};\theta \right)}\right]$$

(18)

Définitions:

B Virtual batch

f_i Penultimate layer features

c_i Normalization constants ensuring $\Sigma {r}_{i}({\rm{l}})=1$

${p}_{i}({\mathcal{l}})$ Prior beliefs about latent labels, derived from InfEHR heuristics

w_i (ϕ) Weight computed as:

$$\sigma ({W}_{2}{ReLU}({W}_{1}\,{f}_{i}+{b}_{1})+{b}_{2})$$

(19)

where $\phi=\{{W}_{1},{W}_{2},{b}_{1},{b}_{2}\}$ are trainable weights and biases of the neural network.

Parameters:

θ Parameters of the primary model estimating $q({\mathcal{l}} \, | \, {xi;}\, \theta )$;

ϕ Parameters of the neural network calculating ${w}_{i}$;

Optimization is by simultaneous updates to $\theta $ and $\phi $, aligning the model’s outputs with instance importance in batch $B$.

In sequential order, this loss definition uses Eqs. (18, 19).

Performing Validation

We construct EHR graphs from the UCIMC data contained in the MOVER dataset by first aligning the records in UCIMC to the same namespace as the MSHS data (e.g., creating a mapping between the same medication or laboratory measurement with varying names) and then using the learned clinical manifold from MSHS data to extract clinical events from the UCIMC records into naive temporal graphs, as described previously. Notably, UCIMC data does not include clinical progress notes, which limits the full translation of UCIMC events to the MSHS manifold. All vitals measurement types in the UCIMC dataset were duplicated in MSHS; however, some laboratory measurements and medications in UCIMC had no correspondence in MSHS. We omit any such record from the temporal graphs.

To apply the GNN for semi-supervision from MSHS data to graphs from UCIMC, we ablated all nodes from clinical text in the MSHS graphs and retrained the GNN on the MSHS graphs. We applied this GNN to the UCIMC graphs to obtain initial probabilistic labels (see Fig. 6, InfEHR priors). We therefore transferred knowledge from previous training in the form of the learned clinical manifold and in the prior probabilities.

InfEHR is designed to learn dynamical temporal features that can be used for clinical uncertainty reduction. To show that InfEHR does this, we trained the InfEHR GNN on UCIMC graphs (n = 2427) constructed using the clinical manifold and embeddings from MSHS. Probabilistic prior labels were obtained from previous training on MSHS graphs and without human-provided labels. This parallels discriminative model training while maintaining consistency with the InfEHR loss function and framework. Using these priors (without human-provided labels) we trained for 20 epochs (by early stopping criterion). We used this trained GNN to compute final likelihoods (see Table 3 and Fig. 6).

We then performed benchmarking experiments with GRU-D and SeFT models implemented and trained as in their reference implementations^64,65. Although both of these models ingested tabular nongeometric data, we retained the same variables used to construct the graphs for InfEHR GNN to facilitate comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

b"asta binance h"anvisningskod commented on Hiring platform Uplers ups the ante; claims to have created two pronged strategy for workforce : I don't think the title of your article matches th
创建个人账户 commented on WestMetric Defends Controversial On-Page SEO Services for the Era of AI: Your article helped me a lot, is there any more re
Registro commented on Security Architect | eFinancialCareers: Thanks for sharing. I read many of your blog posts
Anm"al dig f"or att fa 100 USDT commented on Best ChatGPT Tips and Tricks shared by ChatGPT Experts: Turbo-Charge Your AI Experience: Prompts included | by Michael King | Oct, 2023: Thanks for sharing. I read many of your blog posts
Elizabeth Nash commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: 🌍 Global crypto mining is now at your fingertips h

InfEHR: Clinical phenotype resolution through deep geometric learning on electronic health records

Overview of InfEHR

Module 1: Processing Electronic Health Records into Graphs

Training Datasets

Validation Datasets

EHR Preprocessing

Preprocessing Numerical Values

Preprocessing Nonnumerical Values

Node Discovery and Embedding

Continuous Variables

Discrete Variables

Method of Node Embedding

Tuning the General Node Representation to Individual Temporal Contexts

Module 2: Deep Geometric Learning Approach

Notation

Problem definition

Construction and Definitions of EHR Graphs

Definition 1: EHR

Definition 2: EHR Graph

Training the GNN on Attributed EHR Graphs

Self-Supervised Learning for Whole Graph Representations

Box 1 Algorithm to construct EHR Graph

Box 2 Self-supervised training algorithm: Detailed Graph Self-Supervised Learning with VICReg and MI

Self-supervised learning

Deriving instance-level priors automatically

Label propagation using self-supervised embeddings

Integration of spatial information and structural information

Weakly supervised learning over uncertain priors

Module 3: resolving prior probabilities

GNN Training with Feature-Based Weighting of Kullback-Leibler Loss

Weighted RQ Loss

Performing Validation

Reporting summary

Leave a Reply

RECENT POSTS

What is Mistral AI? Everything you need to know about OpenAI’s competitors

AI + Web3 Reports — Quasa

AI can’t grow without this stock (hint: it’s not Nvidia)

Overview of InfEHR

Module 1: Processing Electronic Health Records into Graphs

Training Datasets

Validation Datasets

EHR Preprocessing

Preprocessing Numerical Values

Preprocessing Nonnumerical Values

Node Discovery and Embedding

Continuous Variables

Discrete Variables

Method of Node Embedding

Tuning the General Node Representation to Individual Temporal Contexts

Module 2: Deep Geometric Learning Approach

Notation

Problem definition

Construction and Definitions of EHR Graphs

Definition 1: EHR

Definition 2: EHR Graph

Training the GNN on Attributed EHR Graphs

Self-Supervised Learning for Whole Graph Representations

Self-supervised learning

Deriving instance-level priors automatically

Label propagation using self-supervised embeddings

Integration of spatial information and structural information

Weakly supervised learning over uncertain priors

Module 3: resolving prior probabilities

GNN Training with Feature-Based Weighting of Kullback-Leibler Loss

Weighted RQ Loss

Performing Validation

Reporting summary

Related Posts

Leave a Reply