This section presents the architecture, learning framework, and overall workflow of the proposed hybrid reinforcement learning and knowledge graph-based system for financial risk optimization in healthcare systems. We describe the proposed methodology, training and implementation details, and summarize the workflow as a Fig. 1.

A complete methodology pipeline of our proposed Hybrid Reinforcement Learning and Knowledge Graph Framework.
Data preprocessing
To enable effective training of our hybrid reinforcement learning and knowledge graph framework, we applied advanced preprocessing techniques to two distinct datasets: the US Health Insurance Dataset and a synthetic Healthcare Classification Dataset. The primary goals of preprocessing were to normalize heterogeneous data types, extract semantic features, and generate structured input for reinforcement learning and graph-based modules. We detail each stage below.
Data cleaning and normalization
We began by addressing inconsistencies and missing values. As both datasets were complete, no imputation was required. All categorical variables were standardized males and females were unified. Continuous features such as age, BMI, and charges were normalized using min-max scaling:
$$\begin{aligned} {\hat{x}}_{i} = \frac{x_i – \min (x)}{\max (x) – \min (x)} \end{aligned}$$
(1)
This transformation ensures that all input features reside within the range [0, 1], facilitating convergence in gradient-based optimization.
Categorical embedding with deep learning
Categorical attributes such as region, insurance provider, admission type, and medical condition were embedded using trainable entity embeddings. Given a categorical variable \(c \in {\mathscr {C}}\) with cardinality \(|{\mathscr {C}}|\), we define its embedding as:
$$\begin{aligned} {\textbf{e}}_c = \text {Embedding}(c) \in {\mathbb {R}}^d \end{aligned}$$
(2)
We used \(d = \lfloor |{\mathscr {C}}|^{0.25} \rfloor\) as a heuristic for embedding dimension. These embeddings were jointly optimized with downstream learning tasks and served as inputs to both the neural policy network and the knowledge graph module.
Temporal parsing and feature engineering
The synthetic healthcare dataset included time-stamped columns (date of admission, discharge date). From these, we computed the hospitalization duration:
$$\begin{aligned} \text {Duration} = \text {DischargeDate} – \text {AdmissionDate} \end{aligned}$$
(3)
We also encoded time-based features such as weekday of admission and length-of-stay bins to capture temporal healthcare patterns.
Semantic feature augmentation
To enhance the expressivity of input data, we applied autoencoding to structured tabular features. A deep variational autoencoder (VAE) was used to project structured patient profiles to a lower-dimensional latent space \({\textbf{z}}\):
$$\begin{aligned} & {\textbf{z}} \sim q_\phi ({\textbf{z}}|{\textbf{x}}) = {\mathscr {N}}(\mu _\phi ({\textbf{x}}), \sigma ^2_\phi ({\textbf{x}}) {\textbf{I}}) \end{aligned}$$
(4)
$$\begin{aligned} & \hat{{\textbf{x}}} = p_\theta ({\textbf{x}}|{\textbf{z}}) \end{aligned}$$
(5)
The encoder parameters \(\phi\) and decoder parameters \(\theta\) were trained to minimize the VAE objective:
$$\begin{aligned} {\mathscr {L}}_{\text {VAE}} = {\mathbb {E}}_{q_\phi ({\textbf{z}}|{\textbf{x}})}[\log p_\theta ({\textbf{x}}|{\textbf{z}})] – D_{\text {KL}}(q_\phi ({\textbf{z}}|{\textbf{x}}) || p({\textbf{z}})) \end{aligned}$$
(6)
This helped capture non-linear dependencies and regularize the latent feature space.
Knowledge graph input generation
From the medical dataset, we extracted entity-relation triples to build a knowledge graph \({\mathscr {G}} = ({\mathscr {E}}, {\mathscr {R}})\), where each edge \(r_{ij} \in {\mathscr {R}}\) connects entities \(e_i\) and \(e_j\) from the set \({\mathscr {E}}\), such as:
$$\begin{aligned} (\text {Diabetes}, \textit{treated}\_\textit{with}, \text {Insulin}) \in {\mathscr {G}} \end{aligned}$$
(7)
Each patient profile was mapped to a subgraph embedding using a Graph Neural Network (GNN) encoder described in Section 3.
Train-validation-test split
We used stratified sampling based on test result categories (Normal, Abnormal, Inconclusive) and insurance charge distribution to ensure representation across risk levels. The dataset was divided as follows:
$$\begin{aligned} {\mathscr {D}}_{\text {train}}&= 70\% \text { of full dataset} \end{aligned}$$
(8)
$$\begin{aligned} {\mathscr {D}}_{\text {val}}&= 15\% \end{aligned}$$
(9)
$$\begin{aligned} {\mathscr {D}}_{\text {test}}&= 15\% \end{aligned}$$
(10)
The final input for the model was the concatenation of:
This enriched representation ensured that both structured data and domain knowledge contributed meaningfully to model training.
Proposed methodology
This section outlines the proposed hybrid framework, which integrates reinforcement learning (RL) with knowledge graph-augmented neural networks for optimizing financial risk in healthcare systems. The architecture includes three major components: (i) Knowledge Graph Construction and Embedding, (ii) Reinforcement Learning Formulation, and (iii) Policy Network Design and Training as presented in Fig. 2.

Visualization of the reinforcement learning framework.
Knowledge graph construction and embedding
The knowledge graph component is designed to encode structured medical relationships (e.g., disease–treatment pairs) into a form that enhances the RL agent’s contextual understanding of patient cases. It provides semantic structure and domain knowledge that pure tabular data cannot capture. To incorporate structured domain knowledge, we constructed a healthcare-specific knowledge graph \({\mathscr {G}} = ({\mathscr {E}}, {\mathscr {R}})\), where \({\mathscr {E}}\) denotes entities (e.g., diseases, medications, test results) and \({\mathscr {R}}\) denotes directed, labeled relations (e.g., treated_with, associated_with). Each triple \((h, r, t) \in {\mathscr {G}}\) represents a directed edge from head h to tail t under relation r.
$$\begin{aligned} (h, r, t) \in {\mathscr {G}}, \quad h,t \in {\mathscr {E}}, \; r \in {\mathscr {R}} \end{aligned}$$
(11)
We encoded this graph using a Relational Graph Convolutional Network (R-GCN), which learns embeddings \({\textbf{h}}_v\) for each entity v by aggregating information from its neighbors:
$$\begin{aligned} {\textbf{h}}_v^{(l+1)} = \sigma \left( \sum _{r \in {\mathscr {R}}} \sum _{u \in {\mathscr {N}}_r(v)} \frac{1}{c_{v,r}} {\textbf{W}}_r^{(l)} {\textbf{h}}_u^{(l)} + {\textbf{W}}_0^{(l)} {\textbf{h}}_v^{(l)} \right) \end{aligned}$$
(12)
Here, \({\mathscr {N}}_r(v)\) denotes the neighbors of v under relation r, \(c_{v,r}\) is a normalization constant, and \({\textbf{W}}_r^{(l)}\) is a relation-specific weight matrix at layer l.
The final entity embeddings \({\textbf{h}}_v\) for each relevant node in a patient’s subgraph are pooled to generate a knowledge-aware vector \({\textbf{z}}_{\text {KG}}\) for that patient.
Reinforcement learning formulation
Reinforcement learning is used to optimize billing policies by learning from interactions with a simulated environment. It enables dynamic, sequential decision-making that balances diagnostic accuracy and cost-effectiveness over time. We model the healthcare billing optimization problem as a Markov Decision Process (MDP) defined by the tuple \(({\mathscr {S}}, {\mathscr {A}}, {\mathscr {P}}, {\mathscr {R}}, \gamma )\):
-
\({\mathscr {S}}\): State space, representing a patient profile, including socio-clinical features and graph-based embeddings.
-
\({\mathscr {A}}\): Action space, representing billing decisions such as predicted cost bins or resource allocation strategies.
-
\({\mathscr {P}}\): State transition probability distribution, approximated using an environment simulator.
-
\({\mathscr {R}}\): Reward function that balances financial cost and diagnostic accuracy.
-
\(\gamma\): Discount factor for future rewards.
Each state \(s \in {\mathscr {S}}\) is defined as:
$$\begin{aligned} s = [{\textbf{x}}_{\text {norm}}; {\textbf{e}}_{\text {cat}}; {\textbf{z}}_{\text {VAE}}; {\textbf{z}}_{\text {KG}}] \end{aligned}$$
(13)
where \({\textbf{x}}_{\text {norm}}\) are normalized scalars, \({\textbf{e}}_{\text {cat}}\) are categorical embeddings, \({\textbf{z}}_{\text {VAE}}\) is the deep latent vector, and \({\textbf{z}}_{\text {KG}}\) is the knowledge graph embedding.
The reward \(r_t\) at each step t is computed as:
$$\begin{aligned} r_t = – \text {BillingCost}_t + \alpha \cdot \text {OutcomeScore}_t \end{aligned}$$
(14)
Here, \(\alpha\) controls the trade-off between financial and clinical objectives. OutcomeScore is derived from correct classification of test results (e.g., penalizing misdiagnoses).
Policy network design
We employ a deep neural policy network \(\pi _\theta (a|s)\), parameterized by \(\theta\), that maps the state vector to a probability distribution over actions. The policy is trained to maximize the expected discounted return:
$$\begin{aligned} J(\theta ) = {\mathbb {E}}_{\pi _\theta } \left[ \sum _{t=0}^{T} \gamma ^t r_t \right] \end{aligned}$$
(15)
We implement \(\pi _\theta\) using a fully-connected feedforward architecture with ReLU activations and dropout regularization. The training is performed using either:
-
Deep Q-Network (DQN): A value-based method where we learn a Q-function Q(s, a) and derive the policy as \(\pi (s) = \arg \max _a Q(s,a)\).
-
Proximal Policy Optimization (PPO): A policy-gradient method where updates are clipped to avoid large steps:
$$\begin{aligned} {\mathscr {L}}^{\text {PPO}}(\theta ) = {\mathbb {E}}_t \left[ \min \left( r_t(\theta ) {\hat{A}}_t, \text {clip}(r_t(\theta ), 1 – \epsilon , 1 + \epsilon ) {\hat{A}}_t \right) \right] \end{aligned}$$
(16)
where \(r_t(\theta ) = \frac{\pi _\theta (a_t|s_t)}{\pi _{\theta _{\text {old}}}(a_t|s_t)}\) and \({\hat{A}}_t\) is the estimated advantage function.
Architectural details
The proposed hybrid architecture integrates multiple modules: (i) a deep encoder for patient profile embeddings, (ii) a Relational Graph Convolutional Network (R-GCN) for knowledge-aware representations, and (iii) a reinforcement learning policy/value network. Table 1 summarizes the architectural components and associated parameter details.
The input dimensionality to the policy network is the result of concatenating the outputs of:
-
Patient latent vector (\({\mathbb {R}}^{16}\))
-
Embedded categorical features (\({\mathbb {R}}^{15}\))
-
Knowledge graph embedding (\({\mathbb {R}}^{32}\))
-
Hand-engineered features (\({\mathbb {R}}^{27}\))
The policy head consists of either a softmax distribution (for PPO) or Q-values (for DQN) over 5 discrete billing classes. ReLU was used as the activation function throughout the hidden layers. Dropout was applied to improve generalization.
All model weights were initialized using He-normal initialization. The total parameter count across all components is approximately 36.8K, ensuring computational tractability while maintaining high expressiveness.
System overview
The overall framework consists of several interconnected modules that work in a sequential pipeline. First, patient data comprising structured numeric features, categorical codes, and diagnostic information is preprocessed and transformed into three types of embeddings: normalized scalars, categorical vectors, and deep latent vectors via a Variational Autoencoder (VAE). These are concatenated to form the base patient state representation.
In parallel, a static knowledge graph (KG) is constructed based on known relationships among diagnoses, treatments, and tests. This KG is encoded using a Relational Graph Convolutional Network (R-GCN), producing semantic embeddings for each clinical entity. These embeddings are injected into the patient state vector, enriching it with domain-specific medical knowledge.
The complete, multi-modal state vector is then passed to the reinforcement learning agent, which is trained using either Deep Q-Network (DQN) or Proximal Policy Optimization (PPO). The agent learns to select billing actions (adjust cost or resource allocation) that maximize a reward function designed to balance financial efficiency and diagnostic correctness. The environment returns a scalar reward after each action, and the agent updates its policy accordingly using the observed transitions.
This integrated design allows the model to reason over complex clinical relationships while learning dynamic, cost-sensitive decision policies that generalize across diverse patient cases.
Training and implementation details
This section outlines the training procedures, loss functions, optimization strategy, and implementation setup used to train the hybrid framework. The objective is to optimize both financial decision-making and predictive reliability using a reinforcement learning policy network augmented with semantic and graph-structured inputs.
Loss functions
The loss functions depend on the chosen RL variant—either value-based (DQN) or policy-gradient based (PPO). In both cases, we use a reward signal that combines financial efficiency and diagnostic quality.
DQN Loss: The Q-network is trained by minimizing the Temporal Difference (TD) error:
$$\begin{aligned} {\mathscr {L}}_{\text {DQN}} = {\mathbb {E}}_{(s,a,r,s’)} \left[ \left( r + \gamma \max _{a’} Q_{\theta ^-}(s’, a’) – Q_\theta (s, a) \right) ^2 \right] \end{aligned}$$
(17)
PPO Loss: The PPO objective is clipped to prevent large updates and stabilize training:
$$\begin{aligned} & r_t(\theta ) = \frac{\pi _\theta (a_t|s_t)}{\pi _{\theta _{\text {old}}}(a_t|s_t)} \end{aligned}$$
(18)
$$\begin{aligned} & {\mathscr {L}}_{\text {PPO}} = {\mathbb {E}}_t \left[ \min \left( r_t(\theta ) {\hat{A}}_t, \text {clip}(r_t(\theta ), 1 – \epsilon , 1 + \epsilon ) {\hat{A}}_t \right) \right] \end{aligned}$$
(19)
where \({\hat{A}}_t\) is the advantage function estimated using Generalized Advantage Estimation (GAE).
Training strategy
The model is trained over multiple episodes. In each episode:
-
1.
A batch of patient profiles is processed to generate state vectors.
-
2.
The agent interacts with the simulated environment by making billing decisions.
-
3.
Rewards are computed using actual billing values and test outcomes.
-
4.
Experiences are stored in a replay buffer (for DQN) or collected into batches (for PPO).
-
5.
The policy is updated every k steps.
The environment simulation approximates transitions using changes in cost and test result feedback, with limited stochasticity introduced to mimic real-world healthcare uncertainty.
Hyperparameters and optimizer settings
The training setup was experimentally tuned to balance convergence speed and policy quality. Table 2 lists the key hyperparameters used in our experiments.
Validation and early stopping criteria
Validation was conducted using a held-out 15% test set from the original dataset split. The policy was evaluated based on cumulative reward, cost prediction error, and classification accuracy for diagnostic categories. Early stopping was triggered when no improvement was observed over 10 validation rounds, based on the average reward metric.
Model checkpoints were saved using the highest validation reward, and the final policy was evaluated on the test set using unseen patient profiles.
Algorithmic overview
Algorithm 1 outlines the end-to-end training process of the proposed hybrid reinforcement learning framework. The procedure begins by preprocessing the patient dataset and generating structured features, including categorical embeddings and VAE-based latent vectors. A medical knowledge graph is then constructed and embedded using a Relational Graph Convolutional Network to obtain semantic graph features. These representations are concatenated to form a comprehensive patient state vector. During each training episode, the reinforcement learning agent selects actions based on the current state, receives a reward that combines billing cost and diagnostic accuracy, and updates the policy using collected transitions. This iterative process continues until the policy converges to an optimal strategy for cost-aware decision-making.

Hybrid RL + Knowledge Graph Framework for Financial Risk Prediction
Theoretical complexity analysis
To complement the empirical runtime analysis, we present a theoretical complexity overview of the proposed hybrid framework. The overall time complexity is additive over three primary components: the Variational Autoencoder (VAE), the Relational Graph Convolutional Network (R-GCN), and the reinforcement learning policy network.
-
VAE Encoder: Given input dimensionality d and latent space size z, the encoder and decoder each consist of L dense layers. The time complexity per forward pass is \(O(L \cdot d \cdot z)\) per sample, assuming uniform layer sizes.
-
R-GCN: For a graph with n nodes, e edges, and hidden dimension h, each layer performs message passing with time complexity \(O(e \cdot h + n \cdot h^2)\). For K layers, the total complexity is \(O(K \cdot (e \cdot h + n \cdot h^2))\).
-
Reinforcement Learning (PPO or DQN): The RL policy network has time complexity \(O(B \cdot S \cdot d^2)\) per update, where B is the batch size, S is the number of steps per episode, and d is the state vector size. PPO further incurs an additional cost due to policy clipping and advantage estimation, though it converges with fewer updates in practice.
