An interpretable model based on concept and argumentation for tabular data

Knowledge representation

Knowledge in tabular data

In tabular data, the knowledge may be divided into two categories: human-level knowledge presented in data description and implicit knowledge learned from data. The former expresses the semantics of features in natural language and is intuitively understandable to humans. From a cognitive perspective²², we believe that a concept should be the common characteristics abstracted from a set of features or lower-level concepts, which should be consistent with human cognition, as shown in Fig. 1. Through hierarchical abstraction, we can naturally obtain a tree structure of features and concepts that are understandable to humans, with a comprehensiveness level ranging from low to high. We denote this tree as the concept tree. Another kind of knowledge can be learned from data and expressed as the value of higher-level concepts relevant to the prediction targets which are aggregated by exploiting the correlative features or lower-level concepts.

Suppose $\mathscr {T} = \langle L, X, D \rangle$ is one type of tabular data structure, where L are descriptions of features, X is the data, and D is the decision target. In a tabular dataset of the $\mathscr {T}$ type, an instance $x \in \mathbb {R}^n$ in X is defined as n-element vector representing n scalar raw features in $\mathscr {F}$, where $\mathscr {F}$ is a set of the raw features contained in tabular data. In this paper, we assume that there are some underlying feature groups in a tabular data structure. The features in a group are semantically similar and target-relevant and can be abstracted for a more general semantic unit as a concept. Note that some features may not be in any group and some may be in multiple groups. We are interested in mining the concepts for a decision target from data description L, and data X, and utilizing quantitative argumentation for explicit knowledge representation and reasoning to form the interpretable decision-making model.

Representing knowledge in quantitative argumentation frameworks

A concept tree generated from a tabular data $\mathscr {T} = \langle L, X, D \rangle$ can be represented as a quantitative argumentation framework (QAF), denoted as $QAF_{\mathscr {T}}: \langle \mathscr {A}_{\mathscr {T}}, E_{\mathscr {T}}, \beta _{\mathscr {T}}, \omega _{\mathscr {T}} \rangle$. Here, each argument $a\in \mathscr {A}_{\mathscr {T}}$ represents a concept $c \in \mathscr {C}$ or a feature $f \in \mathscr {F}$, where $\mathscr {A}_{\mathscr {T}} = \mathscr {C} \cup \mathscr {F}$. The edges $E_{\mathscr {T}} \subseteq \mathscr {A}_{\mathscr {T}} \times \mathscr {A}_{\mathscr {T}}$ describe positive and negative correlations between these concepts and features. The framework also includes a function $\beta _{\mathscr {T}}: \mathscr {A}_{\mathscr {T}} \rightarrow [0,1]$ that assigns a base score to each argument, and a function $\omega _{\mathscr {T}}: E_{\mathscr {T}} \rightarrow \mathbb {R}$ that assigns a weight to each edge.

In our QAF, the arguments representing features are at the leaf nodes. Their strength can be obtained directly from the data without the need for a base score function to assign initial values. Categorical features are target encoded, and all features are subsequently transformed using a quantile transformer with a uniform output distribution, resulting in continuous values into a quantifiable strength score in the range [0, 1], which serves as the initial value for the feature arguments. The functions $\beta _{\mathscr {T}}$ and $\omega _{\mathscr {T}}$ for the remaining arguments will be defined when we introduce the field-wise learning algorithm.

Knowledge acquisition: procedure of quantitative argumentation layer

In this section, we utilize $P_{QAL}$ to mine semantic and quantitative knowledge of the concepts as shown in Fig. 3. Semantic knowledge mining is realized by the semantic knowledge mining approach to automatically search lower-level knowledge units (such as features and lower-level concepts) with similar meanings and abstract higher-level concepts from the data description. Then, A field-wise learning algorithm is designed for quantitative knowledge mining by learning the values of concepts and their relations from data and evaluating the concepts for deleting the unimportant ones. In each $P_{QAL}$, the selected concepts and the ungrouped features can be represented as a QAL, and their semantics are noted in $L’$ for the next $P_{QAL}$.

Unlike conventional clustering or summarization approaches that produce static semantic groupings, our $P_{QAL}$ dynamically integrates semantic similarity with quantitative reasoning through the QAF structure. This coupling allows each cluster to be assigned interpretable argumentative roles (support/attack), learned adaptively via field-wise optimization, thereby transforming unsupervised grouping into a reasoning-guided semantic abstraction process.

Semantic knowledge mining

To simulate the process of abstracting concepts in human cognitive learning, we need to combine features with similar meanings and extract the same characteristics as the meaning of the generated concepts. To achieve this goal, natural language preprocessing is necessary for semantic knowledge mining²³. A pretrained multi-lingual language model transfers the natural language information into a vector space. In that way, the meanings of features or concepts are embedded into vectors from natural sentences.

Given a data description L of tabular data $\mathscr {T}$, $l_a \in L$ represents the description of a feature a, and $l_a$ is also a set of words such that description $l_a$ can perform the intersection operation with other descriptions. It is worth noting that after the first round of $P_{QAL}$, the descriptions also contain descriptions of concepts, which means that a can represent a feature or a concept. After the embedding, the descriptions are transferred into vectors. We denote the vector version of L and $l_a$ as $\hat{L}$ and $\hat{l_a}$ repectively. Suppose a group of features or concepts (denoted as $\mathscr {A}_c=\{a_j,\ldots ,a_k\}, \mathscr {A}_c \subseteq \mathscr {A}_{\mathscr {T}}$) can be combined together to generate higher-level concept c. The description of the generated concept can be defined as $l_c = \bigcap \nolimits _{a \in \mathscr {A}_c} l_a$. The tree structure with root c is denoted as $tree_c = \{c:\{a_j,\ldots ,a_k\}\}$. Our goal in semantic knowledge mining is to obtain all the possible concepts’ descriptions $L^c$ and structures $Tree^c$ from L, where $L^c$ is a set of $l_c$ and $Tree^c$ is a set of $tree_c$.

The clustering algorithm can be utilized to find the groups of descriptions. The agglomerative hierarchical clustering (AHC) algorithm²⁴ is adopted since it can capture the hierarchical relationship between clusters, which other clustering algorithms cannot achieve. The main idea of the AHC algorithm is that each object starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy to form the tree structure. The semantic knowledge mining approach is designed based on the AHC algorithm to conduct semantic grouping and concept abstraction, as described in Algorithm 1. In Algorithm 1, we have designed a threshold, denoted as $\mu$, to ensure that the semantics within each cluster are sufficiently similar, thereby guaranteeing that the concepts generated within each cluster are meaningful.

Quantitative knowledge mining: the field-wise learning algorithm

After the process of semantic knowledge mining, assume that we obtain t possible concepts from n features and their structural information $Tree^c$. In this part, the quantitative knowledge of concepts from the data needs to be mined. The importance of concepts should be evaluated in order to eliminate the unimportant ones. Given training data $X_{TR}$, we split it into a sub-training set $X_{tr}$ and a validation set $X_{}vld$, as shown in Fig. 4. Then we represent the knowledge in $X_{tr}$ as a QAF, and with learning algorithm $\mathscr {L}$ learn a CAM model $\mathscr {L}(X_{tr}, QAF)$. To evaluate this CAM model, we use the validation set $X_{vld}$ to evaluate the same QAF and calculate the performance $\mathscr {E}(\mathscr {L}(X_{tr}, QAF), X_{vld}, QAF)$, denoted as $\mathscr {E}(QAF)$ for short. The performance can result in metrics such as area-under-curve (AUC).

Many researchers²⁵ believe that the concept is “important” for the decision targets if its presence is necessary. Thus, we define the rule for evaluating the importance of concepts and select the important ones as follows:

$$\begin{aligned} {\text {If }} \mathscr {E}(QAF_{i}) \ge \mathscr {E}(QAF_{0}), {\text {then keep }} c_i. {\text { Else drop }} c_i. \end{aligned}$$

(3)

where $\mathscr {E}(QAF_{0})$ represents the performance of original quantitative argumentation framework (denoted as $QAF_{0}$), $\mathscr {E}(QAF_{i})$ represents the performance of the $QAF_{i}$, and $QAF_{i}$ is the QAF by adding a new concept $c_i$ to $QAF_{0}$.

Though highly accurate, direct evaluation for feature (and concept) sets is often rather expensive. In real-world business scenarios, training a model to converge may take great computational resources. Such direct evaluations are often too expensive to be invoked repetitively in the concept generation procedure. In order to improve the evaluation efficiency, we proposed a field-wise learning algorithm in CAM.

To accelerate concept evaluation, the field-wise learning algorithm runs in two steps. In the first step, we use $X_{tr}$ to train a MLP for learning the strength of nodes and edges of $QAF_{0}$ from a previous QAL to concepts of decision node D and evaluate $QAF_{0}$ as $\mathscr {E}(QAF_{0})$ on the validation set $X_{vld}$. We chose MLP as the learning model since QAF an MLP correspond in structure and reasoning approach.

In the second step, we link the newly generated concept $c_i$ with its structural information $tree_c$ as a sub-framework of $QAF_{0}$ and delete the edges that link the children of $c_i$ directly to D, thus we get a new $QAF_{i}$ by adding $tree_c$ and removing the repeated arguments. Then, we use a MLP that has the same structure with $QAF_{i}$ to learn the unknown strength of nodes and edges of $QAF_{i}$. The same parts of $QAF_{i}$ and $QAF_{0}$ have been learned in the first step, thus the MLP only learns the strength of edges and nodes related to $c_i$. Hence, the learning process is ‘field-wise’ and can be processed in parallel.

Formally, the previous QAL is denoted as $\mathscr {A}_{0} = \{a_1,\ldots , a_q\}$, where $a_i$, $1\le i\le q$, may be features or concepts, and the value of $a_i$ is denoted as $s(a_i)$. Especially, in the first round of concept mining, $a_i$ only represents features, and $s(a_i)$ is the value in the interval of 0 to 1 obtained by data pre-processing of the feature value $x_{a_i}$. The structure of a newly generated concept $c_i$ are denoted as {$c_i: \{a_j,a_k\}$}. And when $a_i$ is a concept, the quantitative information of $a_i$ and its children is learned in the previous $P_{QAL}$. In the first step, the MLP can be described as:

$$\begin{aligned} s(D) = \Phi \left( \sum _{a_i\in \mathscr {A}_{0}} w_i\times s(a_i) +b_D\right) \end{aligned}$$

(4)

where $\Phi (z) = \frac{1}{1+exp(-z)}$ is the logistic function, $w_i \in (w_1,\ldots ,w_k)$ is the weight and $b_D$ is bias of decision concept D.

To represent the knowledge learned from $X_{tr}$ in the form of QAF, edges between arguments and the concept of decision node D are represented as $E=\{(a_1, D),\ldots ,(a_q, D)\}$, and $w_i \in \{w_1,\ldots ,w_k\}$ is the strength of edge $(a_i, D)$. Thus, the function $\omega$ in $QAF_{0}$ can be instantiated as $\omega ((a_i,D)) = w_i$, where $w_i \in \{w_1,\ldots ,w_k\}$ and $(a_i,D) \in E$. $b_D$ represents the initial score of D. But in a QAF, $\beta (D) \in [0,1]$, thus we define $\beta (D) = \Phi (b_D)$ according to Eq. (2). In the second step, a MLP model can be described as:

$$\begin{aligned} s(D) = \Phi \left( \sum _{a_i \in \mathscr {A}_{0} \setminus \left\{ a_j, a_k\right\} } \left( w_i\times s\left( a_i\right) \right) +b_D + w_c\left( \Phi \left( w_j^{\prime } \times s\left( a_j\right) + w_k{‘} \times s\left( a_k\right) +b_c \right) \right) \right) \end{aligned}$$

(5)

where $w_i$ is learned in the last step, thus we fix $w_i$ as a constant score during the parallel training process. $w_c$ is the weight of newly generated concept $c_i$, and $w_j^{\prime }, w_k{‘}$ are new weights of $a_j$ and $a_k$ respectively. $b_c$ is bias of $c_i$. All the weights and biases can be represented in $QAF_{i}$ to instantiate $\omega$ and $\beta$ functions.

We continue the parallel learning process until all the strengths of edges and nodes related to newly generated concepts are mined. We obtain a list of fully learned QAF: $(QAF_{1},\ldots , QAF_{t})$. We evaluate the QAFs in parallel by using $X_{vld}$ and finally select the important concepts by the rule 3. A new description $L’$ for the next $P_{QAL}$ is generated by selecting the descriptions of important concepts stored in $L^c$ and adding the descriptions of ungrouped features or concepts stored in L.

Construct CAM model

Based on the proposed $P_{QAL}$, we introduce CAM for tabular data decision-making. The construction of CAM is a hierarchical and iterative process, which sequentially stacks Quantitative Argumentation Layers (QALs) to progressively find and select meaningful and important concepts, as illustrated in Fig. 5.

The process begins with the raw feature layer (considered as $QAL^0$). The first round of $P_{QAL}$ takes the initial features, their descriptions (L), and the data (X) as input. It first generates candidate concepts via the semantic knowledge mining approach and then selects the important ones using the field-wise learning algorithm. This forms the first layer, $QAL^1$. The output of this layer—comprising the newly formed concepts and any ungrouped features—constitutes a new, more abstract set of nodes with updated descriptions ($L^1$).

Subsequently, $L^1$ and the output values from $QAL^1$ serve as the input for the next round of $P_{QAL}$ to construct a higher-level layer, $QAL^2$. This stacking process continues, with each new layer building a higher level of semantic abstraction upon the previous one. The construction process of CAM terminates when one of two end conditions is met: (1) No new meaningful concepts can be abstracted during the semantic knowledge mining step. (2) The performance of the base model in the current layer ($QAF_0$) is lower than that of the previous one, indicating that further abstraction is no longer beneficial.

Once the construction is complete, the decision node (D) is linked to the final learned QAL. The unknown parameters of the entire network (i.e., the weights and biases of the final connections) are then learned, similar to step 1 of the field-wise learning algorithm. This final, integrated hierarchical structure forms the complete quantitative argumentation framework (QAF) that constitutes the trained CAM model.

Dialogical explanation within CAM

Dialogical explanation mechanism within CAM

By leveraging the advantages of argumentation structure, CAM is capable of providing the underlying structure for generating dialogical explanations for users. A user may interact with CAM by requesting an explanation of a node (a decision, a concept, or a feature) in CAM. The structure of these explanations, specifically which arguments attack or support a given argument, is defined by the following sets.

Given a tabular data $\mathscr {T} = \langle L, X, D \rangle$, a corresponding $QAF_{\mathscr {T}}: \langle \mathscr {A}_{\mathscr {T}}, E_{\mathscr {T}}, \beta _{\mathscr {T}}, \omega _{\mathscr {T}} \rangle$, an instance $x \in X$ and its decision D with strength s(D), an argumentation dialogue between a user and CAM consists of an explanation request $\mathscr {Q}_{(a)}$ for a node $a \in \mathscr {A}_{\mathscr {T}}$ and an explanation $\mathscr {X}_{(a)}$ which CAM responds with. To generate these explanations, we first define the attacker set Att(a) and supporter set Sup(a) for any argument $a \in \mathscr {A}_{\mathscr {T}}$. The construction of these sets is based on the weights of the edges from its direct child nodes. Specifically, for any direct child node b of a:

If the edge weight $\omega _{\mathscr {T}}(b, a)$ is positive, b is a supporter of a. The set of all supporters of a is defined as:

$$\begin{aligned} Sup(a) = \left\{ b \in \mathscr {A}_{\mathscr {T}} \mid (b, a) \in E_{\mathscr {T}} \wedge \omega _{\mathscr {T}}(b, a) > 0\right\} \end{aligned}$$

(6)
If the edge weight $\omega _{\mathscr {T}}(b, a)$ is negative, b is an attacker of a. The set of all attackers of a is defined as:

$$\begin{aligned} Att(a) = \left\{ b \in \mathscr {A}_{\mathscr {T}} \mid (b, a) \in E_{\mathscr {T}} \wedge \omega _{\mathscr {T}}(b, a) < 0\right\} \end{aligned}$$

(7)

The user’s explanation request $\mathscr {Q}_{(a)}$ is answered by the explanation $\mathscr {X}_{(a)}$, which is constructed based on the strength of the arguments within these defined sets. According to Eqs. (1) and (2), for any argument $a \in \mathscr {A}$, its strength s(a) can be obtained by following formula:

$$\begin{aligned} s(a)=\Phi \left( ln\left( \frac{\beta _{\mathscr {T}}(a)}{1-\beta _{\mathscr {T}}(a)}\right) +\sum _{b\in Att(a)}\omega _{\mathscr {T}}(b,a)\times s(b) +\sum _{b\in Sup(a)}\omega _{\mathscr {T}}(b,a)\times s(b)\right) \end{aligned}$$

(8)

Our intuition is that the dialogical explanation is simpler than but consistent with CAM. The explanation of an argument a may consist of its supporters and attackers, which have significant impacts on a. Therefore, we propose a mechanism to simplify the structure of CAM, called the SSC algorithm by searching the dominant arguments, described in Algorithm 2. For any $S \subseteq \mathscr {A}$, if $S = \emptyset$, let $max(S) = \emptyset$; else, let $max(S) = argmax_{b \in S}(|\omega (b,a)\times s(b)|)$, where argmax refers to the argument b, at which the absolute value of ($\omega _{\mathscr {T}}(b,a)\times s(b)$) is as large as possible. In Algorithm 2, we aim to find the least supporters and attackers in Sup(a) and Att(a) with significant impacts on a to make the same decision with CAM, and the sets containing the selected supporters and attackers are denoted as $Sup^{\prime }(a)$ and $Att^{\prime }(a)$, respectively.

Then, according to the simplified argumentation structure, we provide a simple argumentation dialogue for risk assessment as follows. Building on prior work in argumentation theory¹⁶, we define two phrase-generating functions, $r^{safe}(a)$ and $r^{risky}(a)$, for any argument $a \in \mathscr {A}$. These functions are based on $l_a$, which is a natural language description of the feature or concept that argument a represents. This natural language representation makes the model’s internal reasoning comprehensible to a human user. The determination of whether an argument’s value is “good” or “bad” is based on its impact on the target outcome’s risk. Specifically, if an argument’s value increases the risk of the target outcome, it is considered “bad” and triggers the “risky” response template. Conversely, if it decreases the risk, it is considered “good” and triggers the “safe” response template.

$$\begin{aligned}&\ r^{safe}(a) = ({\text {the}})\; l_a \; {\text {was good}};\\&\ r^{risky}(a) = ({\text {the}})\; l_a \; {\text {was bad}};\\&\ r^{risky}(\varnothing ) = r^{safe}(\varnothing ) = \{\}. \end{aligned}$$

We acknowledge that this dialogue template is specifically designed for the risk assessment problem to align with the application context and model logic. For different application domains, a new set of corresponding templates would be designed. Since this paper focuses on the risk assessment problem, we provide this specific template as a concrete example, demonstrating how our framework can be applied to a real-world task. An argumentation dialogue is such that for any $a \in \mathscr {A}$:

$$\begin{aligned} {\text {if}}&\ a = D\; {\text {and}}\; s(D) > 0.5{:} \\&\mathscr {Q}(a)= \{{\text {Why was}}\; a\; {\text {assessed as high risk?}} \}\\&\mathscr {X}(a) = \{{\text {This case was assessed as high risk because}} \} \\&\qquad \qquad + \sum _{b \in Sup^{\prime }(a)} r^{risky}(b) +\{although\} + \sum _{b \in Att^{\prime }(a)}r^{safe}(b) ;\; {\text {else}}\\ {\text {if}}&\ a = D\; {\text {and}}\; s(D) \le 0.5 {:} \\&\mathscr {Q}(a)= \{ {\text {Why was}}\; a\; {\text {assessed as low risk?}} \}\\&\mathscr {X}(a) = \{ {\text {This case was assessed as low risk because}} \} \\&\qquad \qquad + \sum _{b \in Att^{\prime }(a)} r^{safe}(b) +\{although\} + \sum _{b \in Sup^{\prime }(a)}r^{risky}(b) ;\; {\text {else}} \\ {\text {if}}&\ a \in \mathscr {C}\; {\text {and}}\; a\in Sup(D){:} \\&\mathscr {Q}(a)= \{ {\text {Why was}}\; a \; {\text {considered to be bad?}}\}\\&\mathscr {X}(a) = \{ l_a \; {\text {was considered to be bad because}} \} \\&\qquad \qquad + \sum _{b \in Sup^{\prime }(a)} r^{risky}(b) +\{although\} + \sum _{b \in Att^{\prime }(a)}r^{safe}(b) ;\; {\text {else}}\\ {\text {if}}&\ a \in \mathscr {C}\; {\text {and}}\; a\in Att(D) {:} \\&\mathscr {Q}(a)= \{ {\text {Why was}}\; a \; {\text {considered to be good?}}\}\\&\mathscr {X}(a) = \{ l_a\; {\text {was considered to be good because}} \} \\&\qquad \qquad + \sum _{b \in Att^{\prime }(a)} r^{safe}(b) +\{although\} + \sum _{b \in Sup^{\prime }(a)}r^{risky}(b) ;\; {\text {else}}\\ \textrm{if }&\ a \in \mathscr {F}{:} \\&\mathscr {Q}(a)= \{ {\text {Why was}}\; a \; {\text {considered to be bad (or good)?}} \}\\&\mathscr {X}(a) = \{{\text {Because in this case,}} \; l_a\; {\text {was}}\; x_a \} ; \end{aligned}$$

where Att(D) and Sup(D) represent arguments attacking or supporting D in a broad sense, and these arguments are not necessarily directly connected to D, $x_a$ denotes the input value of feature a.

Case study: explanation for risk estimation

Consider CAM in Fig. 6 for a risk assessment case (which is labeled as high risk by the bank) from the Fico dataset.^{Footnote 2} The QAF in the figure is built up by extracting the concepts and dialectical relations between them from raw features with the help of $P_{QAL}$. The key paths for reasoning searched by the SSC algorithm are thickened, starting with the strongest supporting and attacking arguments “Inquiry” and “ExternalRiskEstimation”.

The base score of the decision node “Risk” is initially set to 0.5, while the base scores of other concept nodes are calculated by field-wise algorithm in CAM. According to the MLP-based reasoning method, the strength for “Risk” with original QAF is $s(Risk)=0.84$, which is seen as a “high-risk case” because $s(Risk)>0.5$. With the help of the SSC algorithm, the complex QAF can be simplified as shown in Fig. 6b. The strength for “Risk” with simplified QAF is $s(Risk)=0.54$, which has the same decision as the original one but with a more intuitive and simpler reasoning process. A simple argumentation dialogue between a user and CAM may then be as follows:

User: Why was this case assessed as high risk?

CAM: This case was assessed as high risk because the information about this consumer’s credit bureau report pulled by a lending institution (the description of “Inquiry”) was bad, although the consolidated safe score (the description of “ExternalRiskEstimation”) was good.

User: Why was the “Inquiry” considered to be bad?

CAM: The “Inquiry” was considered to be bad because the number of times that a lending institution has pulled this consumer’s credit bureau report (the description of “NumInquiry”) was bad, although the month since the most recent inquiry excluding 7 days (the description of “MsinceMostRecentInqexcel7days”) was good.

User: Why was the “NumInquiry” considered to be bad?

CAM: The “NumInquiry” was considered to be bad because the number of inquiries in the last 6 months (the description of “NumInqLast6M”) was bad, and the number of inquiries in the last 6 months excluding 7 days (the description of “NumInqLast6Mexcel7days”) was bad.

User: Why was the “NumInqLast6M” considered to be bad?

CAM: Because the number of inquiries in the last 6 months was 9.

Analysis From the explanation, we know that “Inquiry” has the largest positive influence on “Risk”, while “ExternalRiskEstimation” has the largest negative influence. From the perspective of banks, the number of inquiries increased because customer actively applied for a new credit card or mortgage. Researches show that opening several credit accounts in a short period represents greater credit risk. In this case, this customer applied 9 times for new credit cards or mortgages within 6 months, even though in the last month there was no new application record, the “Inquiry” information of this customer still looks bad. Although the bank has a higher evaluation score (69) for this user, the influence of “Inquiry” is bigger than “ExternalRiskEstimation”, the customer is still assessed as a high risk for delinquency. The explanation shows the key path of the reasoning process, it is more intuitive and straightforward. Moreover, the human knowledge within the explanation makes it more acceptable and understandable.

Source link