Analytical approach and methodology
The framework is designed to reveal correlations between key variables by focusing on their behavior in response to changes in a specific independent variable. We establish that the correlation between two dependent variables is implied if they exhibit concurrent changes when a common independent variable is varied, while holding all other factors constant. Specifically, the dependent variables are considered to be correlated when their partial derivatives with respect to the common independent variable are either both positive or both negative. Observing the same sign of these partial derivatives serves as a proxy for the underlying functional relationships between key variables, providing a reliable indication of the directional influence of each factor on the fidelity of the recovered data. This is particularly valuable in high-dimensional or nonlinear systems, where deriving closed-form expressions for direct relationships between key variables is computationally prohibitive or analytically impossible. By leveraging a common independent variable as a shared proxy, we approximate these correlations in a consistent framework that ensures comparability across key variables, enabling mathematically rigorous yet computationally feasible insights without requiring explicit functional forms.
We acknowledge that nonlinear dependencies may be significant in contexts where variable interactions exhibit strong non-monotonic behaviors or feedback loops, making monotonic patterns insufficient to capture the complexity of the system. However, simpler patterns are preferable when they can sufficiently describe the relationships of interest, as they are more interpretable and easier to analyze. In our context, the derivative-based method allows us to extract such monotonic or linear relationships, which are sufficient to provide meaningful insights by capturing the most critical directional tendencies. Moreover, this analytical approach aligns with our empirical evaluation, where we use the Pearson correlation coefficient-a metric that also assumes linear relationships. This consistency between the theoretical and empirical approaches ensures that our analysis remains coherent and meaningful within the context of this study. Our approach also aligns with established mathematical practices where critical features are extracted without modelling the full complexity of the system. For example, in stability analysis, the sign of the real part of eigenvalues provides sufficient information to determine whether perturbations will decay or grow over time. In optimization using quadratic forms, as another example, definiteness-derived from the sign of eigenvalues-guarantees the existence of maxima or minima without requiring detailed modelling of the functional surface. Similarly, in our framework, the sign of partial derivatives serves as a proxy for the underlying functional relationships, enabling us to approximate correlations and directional tendencies without fully modeling the complexities of the nonlinear dependencies.
To establish the correlations, our framework introduces key definitions and theorems that underpin the analysis and lead to the two main corollaries. The full proofs for all theorems, along with the supporting lemmas, are provided in Appendix B. The two key corollaries highlight the following important relationships:
-
1.
the correlation between the Kullback-Leibler divergence (KLD) of the true and assumed priors versus the deviation of recovered data, as implied in Corollary 1;
-
2.
the correlation between the prediction error of the model versus the deviation of recovered data, as suggested by Corollary 2.
Our mathematical framework employs theoretical priors, likelihoods, and posteriors that are designed to define or approximate the empirical counterparts used in our experimental work presented elsewhere in the paper. Throughout this paper, when we compare these approaches, we specifically refer to them as either theoretical or empirical priors, likelihoods, and posteriors to clearly distinguish between the two.
The inverse estimation problem is inherently multivariate. However, in this theoretical framework, we focus on univariate aspects for two main reasons. First, the empirical counterpart of this problem is similarly approached in a univariate manner, where data is often analyzed and visualized one variable at a time. Even the empirical evaluation of the KLD is typically computed for marginal distributions due to the challenges of accurately estimating high-dimensional distributions. Thus, to maintain alignment with the empirical methodology, our framework mirrors this focus by primarily investigating univariate distributions. Second, the framework is designed to reveal correlations between key variables by analyzing their behavior through partial derivatives with respect to a single independent parameter. Since partial derivatives are inherently defined for single-variable changes, this univariate perspective makes the analysis tractable, allowing us to derive clearer theoretical expectations in the context of inverse estimation.
Key definitions and theorems
In this subsection, we establish the foundational definitions and theorems that underpin the theoretical framework of our study. These definitions and theorems provide the formal structure needed to explore the relationships that will be addressed in the subsequent subsections.
Definition 1
(Marginal assumed prior) The marginal assumed prior probability density function (pdf) for a parameter \(\theta _i\) is a univariate uniform distribution over the range \([-3, 3]\), where the probability density is \(\frac{1}{6}\) everywhere across this parameter range:
$$\begin{aligned} q(\theta _i) = U(\theta _i \mid -3, 3) = {\left\{ \begin{array}{ll} \frac{1}{6} & \text {if } -3 \le \theta _i \le 3 \\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$
Interpretation. Definition 1 formalizes the use of a simple, uninformative prior for each parameter, as discussed previously. It adopts a uniform distribution over the specified range, ensuring that all values within the parameter plausible range \([-3, 3]\) are equally likely.
Definition 2
(Assumed prior) The multivariate assumed prior pdf \(q(\varvec{\Theta })\) is derived from the marginal version by assuming independence among the parameters, such that the pdf is uniform over an \(d\)-dimensional space with each dimension independently sampled from the univariate uniform distribution:
$$\begin{aligned} q(\varvec{\Theta }) = q(\theta _1, \theta _2, \dots , \theta _d) = \prod _{i=1}^d q(\theta _i) = \prod _{i=1}^d U(\theta _i \mid -3, 3) \end{aligned}$$
$$\begin{aligned} q(\varvec{\Theta }) = {\left\{ \begin{array}{ll} \left( \frac{1}{6} \right) ^d & \text {if } -3 \le \theta _i \le 3 \text { for all } i = 1, 2, \dots , d \\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$
Interpretation. By assuming independence among the parameters, Definition 2 extends the univariate uniform prior to a multivariate space, which is essential because the actual inverse problem is inherently multivariate. This construct represents our theoretical assumed multivariate prior and is also implemented as the empirical assumed multivariate prior in our Bayesian inference experiments.
Definition 3
(True prior) The true prior \(p(\varvec{\Theta })\) is derived from the dataset and assumed to follow a \(d\)-dimensional Gaussian distribution with mean vector \(\varvec{\mu }_{\text {p}}\) and covariance matrix \(\varvec{\Sigma }_{\text {p}}\):
$$\begin{aligned} p(\varvec{\Theta }) = {\mathcal {N}}_d(\varvec{\Theta } \mid \varvec{\mu }_{\text {p}}, \varvec{\Sigma }_{\text {p}}) \end{aligned}$$
Interpretation. In our theoretical construct, Definition 3 assumes that the true prior follows a \(d\)-dimensional Gaussian distribution characterized by a mean vector and a covariance matrix. This assumption provides a structured way to model the true prior, facilitating mathematical tractability and analytical convenience. However, it is important to note that the empirical true prior derived from the dataset may not necessarily adhere to this Gaussian assumption.
Definition 4
(Marginal true prior) The marginal true prior \(p(\theta _i)\) for each component \(\theta _i\) of the parameter vector \(\varvec{\Theta }\) is obtained by marginalizing the multivariate Gaussian prior \(p(\varvec{\Theta })\) over all components of \(\varvec{\Theta }\) except \(\theta _i\):
$$\begin{aligned} p(\theta _i) = \int \cdots \int p(\varvec{\Theta }) \, d\varvec{\Theta }_{\setminus i} \end{aligned}$$
where \(\varvec{\Theta }_{\setminus i}\) denotes all components of \(\varvec{\Theta }\) except \(\theta _i\). This marginal true prior \(p(\theta _i)\) follows a Gaussian distribution with mean \(\mu _{\text {p}}\) and variance \(\sigma _{\text {p}}^2\), derived from the multivariate Gaussian prior \(p(\varvec{\Theta })\):
$$\begin{aligned} p(\theta _i) = {\mathcal {N}}(\theta _i \mid \mu _{\text {p}}, \sigma _{\text {p}}^2) \end{aligned}$$
where \(-3 \le \mu _{\text {p}} \le 3\) and \(\sigma _{\text {p}} \le 1\).
Interpretation. In Definition 4, we obtain the marginal true prior distribution for a particular parameter \(\theta _i\) by integrating out the other dimensions of the multivariate Gaussian prior \(p(\varvec{\Theta })\) defined in Definition 3. This marginal true prior \(p(\theta _i)\) follows a Gaussian distribution with mean \(\mu _{\text {p}}\) and variance \(\sigma _{\text {p}}^2\), where \(-3 \le \mu _{\text {p}} \le 3\) and \(\sigma _{\text {p}} \le 1\). These bounds on the mean and standard deviation are chosen to simulate the empirical marginal true prior, examples of which are shown in Fig. 2. In our empirical setting, marginal true priors have a standard deviation of 1 and a mean of 0 (as this is defined by the problem setup where the data is z-normalized), but their forms can vary from unimodal to multimodal. Most multimodal priors feature a dominant mode along with smaller minor modes. By focusing on the major mode and disregarding the minor modes, the range \(-3 \le \mu _{\text {p}} \le 3\) and \(\sigma _{\text {p}} \le 1\) approximately simulates these empirical marginal true priors.
Definition 5
(Likelihood) The likelihood \(p(c \mid \varvec{\Theta })\) is assumed to follow a \(d\)-dimensional Gaussian distribution with mean vector \(\varvec{\mu }_{\ell }\) and covariance matrix \(\varvec{\Sigma }_{\ell }\):
$$\begin{aligned} p(c \mid \varvec{\Theta }) = {\mathcal {N}}_d (\varvec{\Theta } \mid \varvec{\mu }_{\ell }, \varvec{\Sigma }_{\ell }) \end{aligned}$$
Interpretation. We assume in Definition 5 that the theoretical likelihood function \(p(c \mid \varvec{\Theta })\) follows a \(d\)-dimensional Gaussian distribution characterized by a mean vector and a covariance matrix. However, it is important to note that the empirical likelihood derived from the machine learning model may not necessarily adhere to this Gaussian assumption.
Definition 6
(Marginal likelihood) The marginal likelihood \(p(c \mid \theta _i)\) for each component \(c\) conditioned on \(\theta _i\) is obtained by integrating (marginalizing) the multivariate Gaussian likelihood \(p(c \mid \varvec{\Theta })\) over all components of \(\varvec{\Theta }\) except \(\theta _i\):
$$\begin{aligned} p(c \mid \theta _i) = \int \cdots \int p(c \mid \varvec{\Theta }) \, d\varvec{\Theta }_{\setminus i} \end{aligned}$$
where \(\varvec{\Theta }_{\setminus i}\) denotes all components of \(\varvec{\Theta }\) except \(\theta _i\). This marginal likelihood follows a Gaussian distribution with mean \(\mu _{\ell }\) and variance \(\sigma _{\ell }^2\):
$$\begin{aligned} p(c \mid \theta _i) = {\mathcal {N}}(\theta _i \mid \mu _{\ell }, \sigma _{\ell }^2) \end{aligned}$$
where \(\mu _{\ell }\) is the scalar mean value corresponding to \(\theta _i\) derived from the mean vector \(\varvec{\mu }_{\ell }\) and \(\sigma _{\ell }^2\) is the scalar variance value corresponding to \(\theta _i\) derived from the covariance matrix \(\varvec{\Sigma }_{\ell }\) of the likelihood distribution \(p(c \mid \varvec{\Theta })\).
Interpretation. In Definition 6, we obtain the likelihood function for a particular parameter \(\theta _i\) by integrating out the other dimensions of the multivariate Gaussian likelihood \(p(c \mid \varvec{\Theta })\) defined in Definition 5. This integration yields a marginal likelihood that is a univariate Gaussian distribution with a specific mean (\(\mu _{\ell }\)) and variance (\(\sigma _{\ell }^2\)).
Definition 7
(Estimated posterior) The estimated posterior pdf \(q(\varvec{\Theta } \mid c)\) is obtained using Bayes’ theorem:
$$\begin{aligned} q(\varvec{\Theta } \mid c) = \frac{p(c \mid \varvec{\Theta }) q(\varvec{\Theta })}{p(c)} \end{aligned}$$
Interpretation. The estimated posterior, as defined in Definition 7, is derived using Bayes’ theorem, which combines the assumed prior distribution \(q(\varvec{\Theta })\) and the likelihood \(p(c \mid \varvec{\Theta })\) to update our knowledge about the parameters \(\varvec{\Theta }\) after observing a specific class label \(c\).
Definition 8
(Marginal estimated posterior) The marginal estimated posterior pdf \(q(\theta _i \mid c)\) for each component \(\theta _i\) is obtained by marginalizing the multivariate estimated posterior \(q(\varvec{\Theta } \mid c)\) over all other components of \(\varvec{\Theta }\):
$$\begin{aligned} q(\theta _i \mid c) = \int \cdots \int q(\varvec{\Theta } \mid c) \, d\varvec{\Theta }_{\setminus i} \end{aligned}$$
where \(\varvec{\Theta }_{\setminus i}\) denotes all components of \(\varvec{\Theta }\) except \(\theta _i\).
Interpretation. In Definition 8, we obtain the estimated posterior for a particular parameter \(\theta _i\) by integrating out the other dimensions of the estimated posterior \(q(\varvec{\Theta } \mid c)\) defined in Definition 7.
Definition 9
(True posterior) The true posterior pdf \(p(\varvec{\Theta } \mid c)\) is given by:
$$\begin{aligned} p(\varvec{\Theta } \mid c) = \frac{p(c \mid \varvec{\Theta }) p(\varvec{\Theta })}{p(c)} \end{aligned}$$
Interpretation. The theoretical true posterior, as defined in in Defintion 9, updates the true prior with the likelihood to reflect the distribution of the parameters after observing the class label using the Bayesian formula. When the prior assumptions and likelihood are perfectly accurate, the empirical true posterior derived from the dataset will align with this theoretical definition.
Definition 10
(Marginal true posterior) The marginal true posterior pdf \(p(\theta _i \mid c)\) for each component \(\theta _i\) is obtained by marginalizing the true posterior \(p(\varvec{\Theta } \mid c)\) over all components of \(\varvec{\Theta }\) except \(\theta _i\):
$$\begin{aligned} p(\theta _i \mid c) = \int \cdots \int p(\varvec{\Theta } \mid c) \, d\varvec{\Theta }_{\setminus i} \end{aligned}$$
where \(\varvec{\Theta }_{\setminus i}\) denotes all components of \(\varvec{\Theta }\) except \(\theta _i\).
Interpretation. In Definition 10, we obtain the theoretical true posterior for a particular parameter \(\theta _i\) by integrating out the other dimensions of the theoretical true posterior \(p(\varvec{\Theta } \mid c)\) defined in Definition 9.
Theorem 1
(Simplified marginal estimated posterior with uniform prior) The marginal estimated posterior distribution \(q(\theta _i \mid c)\) given class \(c\) is proportional to the product of the marginal likelihood \(p(c \mid \theta _i)\) and the marginal assumed prior \(q(\theta _i)\) , i.e.,
$$\begin{aligned} q(\theta _i \mid c) \propto p(c \mid \theta _i) q(\theta _i) \text { for } -3 \le \theta _i \le 3 \end{aligned}$$
Interpretation. By using a uniform prior, the marginal posterior is as simple as the product of the marginal prior and marginal likelihood, properly normalized to have the total area under the pdf equal to 1. The constant nature of the uniform prior allows the posterior to be expressed directly in this simple analytical form. This enables the analysis, which is fundamentally a multivariate Bayesian problem, to be simplified to a univariate Bayesian analysis, thanks to the uniform priors. That is not the case when the prior is not uniform, where the marginal posterior is generally more complex and cannot be expressed as a simple product of the marginal prior and marginal likelihood.
Theorem 1 relies on a uniform prior distribution \(q(\theta _i)\) as defined in Definition 2. However, the true prior \(p(\varvec{\Theta })\) is assumed to follow a Gaussian distribution as defined in Definition 3. The marginal true posterior \(p(\theta _i \mid c)\) is generally not proportional to \(p(c \mid \theta _i) p(\theta _i)\). Representing \(p(\theta _i \mid c)\) in terms of the marginal likelihood and marginal prior is generally intractable due to the complexities introduced by the dependencies between parameters and the need to perform high-dimensional integrations. This means that we cannot simplify the multivariate Bayesian inference to univariate Bayesian analysis if we were to use the exact marginal true posterior. However, we aim to retain the simplicity observed in the marginal estimated posterior shown in Theorem 1. To approximate the marginal true posterior \(p(\theta _i \mid c)\) within the Bayesian framework, we introduce its approximation \(\hat{p}(\theta _i \mid c)\) in Definition 11 as the marginal true posterior under the assumption of independence between parameters.
Definition 11
(Approximated marginal true posterior) The approximated marginal true posterior distribution \(\hat{p}(\theta _i \mid c)\) for each component \(\theta _i\) is the marginal true posterior associated with the assumption of diagonal covariance matrices for both the true prior \(q(\varvec{\Theta })\) and likelihood \(p(c \mid \varvec{\Theta })\). This assumption implies independence between parameters and can be expressed as:
$$\begin{aligned} p(\varvec{\Theta }) = \prod _{j=1}^n p(\theta _j), \quad p(c \mid \varvec{\Theta }) = \prod _{j=1}^n p(c \mid \theta _j). \end{aligned}$$
Theorem 2
(Simplified marginal true posterior under diagonal covariance matrices assumption) The approximated marginal true posterior distribution \(\hat{p}(\theta _i \mid c)\) for each component \(\theta _i\), under the assumption of diagonal covariance matrices for both the true prior \(p(\varvec{\Theta })\) and the likelihood \(p(c \mid \varvec{\Theta })\), is proportional to the product of the marginal true prior \(p(\theta _i)\) and the marginal likelihood \(p(c \mid \theta _i)\):
$$\begin{aligned} \hat{p}(\theta _i \mid c) \propto p(c \mid \theta _i) p(\theta _i). \end{aligned}$$
Interpretation. In Definition 11, we define \(\hat{p}(\theta _i \mid c)\) as the approximated marginal true posterior under the assumption of diagonal covariance matrices for both the prior and likelihood. As established in Theorem 2, this assumption implies the proportionality between the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\), the marginal true prior \(p(\theta _i)\), and the marginal likelihood \(p(c \mid \theta _i)\). In the subsequent derivations, we will use this approximation \(\hat{p}(\theta _i \mid c)\) instead of \(p(\theta _i \mid c)\), as this approximation allows us to maintain a more tractable form for our Bayesian analysis.
One aspect of our framework involves the use of uniform assumed priors, which ensures that the analysis of the estimated posterior can be simplified into a univariate Bayesian analysis without introducing inaccuracies, as established in Theorem 1. Another aspect pertains to the analysis of true priors and true posteriors, where the assumption of independence between parameters, as specified in Definition 11, introduces a trade-off. As formalized in Theorem 2, this assumption allows the analysis of the true posterior to be simplified into a univariate Bayesian analysis for each parameter. The inaccuracy of this simplification arises from the degree to which the independence assumption diverges from the actual dependencies present in the distributions. The accuracy of the results therefore depends on how well the independence assumption approximates the true prior and likelihood. When the parameters exhibit weak or no dependencies, the simplification is highly accurate; conversely, as the parameter dependencies strengthen, the approximation introduces greater inaccuracy. However, this compromise between simplicity and accuracy employed in this framework is a common approach in similar methodologies. Similar independence-based assumptions are made in widely used algorithms such as Naïve Bayes, which often perform well in practice47,48,49. This suggests that the proposed simplification, while imperfect, is a theoretically sound and analytically tractable approach that provides meaningful insights in many scenarios.
Theorem 3
(Marginal estimated posterior as univariate Gaussian) The marginal estimated posterior \(q(\theta _i \mid c)\) for each parameter \(\theta _i\) is a univariate Gaussian distribution with mean \(\mu _{\ell }\) and covariance \(\sigma _{\ell }\):
$$\begin{aligned} q(\theta _i \mid c) = {\mathcal {N}}(\theta \mid \mu _{\ell }, {\sigma _{\ell }}^2) \text { for } -3 \le \theta _i \le 3 \end{aligned}$$
Interpretation. Under the theoretical framework defined, the derivation presented in Theorem 3 shows that the marginal estimated posterior distribution for a particular parameter \(\theta _i\) is a Gaussian distribution with the same mean and standard deviation as those of the likelihood function. This result follows directly from the use of a uniform prior within the parameter plausible range, which does not introduce any additional bias or change to the shape of the posterior distribution. Consequently, the posterior retains the characteristics of the likelihood.
Theorem 4
(Approximated marginal true posterior as univariate Gaussian) The approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) for each parameter \(\theta _i\) is a univariate Gaussian distribution represented as:
$$\begin{aligned} \hat{p}(\theta _i \mid c) = {\mathcal {N}}(\theta _i \mid \mu _{\phi }, \sigma _{\phi }^2) \end{aligned}$$
where \(\mu _{\phi } = \frac{\mu _{\text {p}} \sigma _{\ell }^2 + \mu _{\ell } \sigma _{\text {p}}^2}{\sigma _{\ell }^2 + \sigma _{\text {p}}^2}\) is the mean value of the posterior and \(\sigma _{\phi }^2 = \frac{\sigma _{\text {p}}^2 \sigma _{\ell }^2}{\sigma _{\text {p}}^2 + \sigma _{\ell }^2}\) is the variance of the posterior.
Interpretation. Under the theoretical framework defined, the derivation presented in Theorem 4 reveals that the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) is a Gaussian distribution, with its mean being a weighted average of the means of the prior and the likelihood, and its variance being a combination of the variances of both.
Decreasing data recovery deviation with smaller prior discrepancy
Investigative focus and hypothesis: In this subsection, we investigate whether there is a correlation between two key dependent variables: the discrepancy between the assumed prior and the true prior versus the deviation of the recovered data distribution from the true data distribution. The core question we seek to answer is whether changes in the accuracy of the assumed prior are reflected in the accuracy of data recovery under different conditions. These different dataset conditions are represented by varying standard deviations of the marginal true prior, which serve as a proxy for the diversity and variability empirically found in our datasets. By examining the relationship between these variables, we aim to analytically uncover whether more accurate priors lead to better data recovery outcomes.
Theorem 5
(KLD between marginal true prior and marginal assumed prior) The KLD \(D_{\text {KL}}(p(\theta _i) \, \Vert \, q(\theta _i))\) between the marginal true prior \(p(\theta _i)\) and the marginal assumed prior \(q(\theta _i)\) is given by:
$$\begin{aligned} D_{\text {KL}}(p(\theta _i) \, \Vert \, q(\theta _i)) = \log \left( \frac{1}{\sigma _{\text {p}}}\right) + \log \left( 6 \right) \left( \Phi \left( \frac{3 – \mu _p}{\sigma _p} \right) – \Phi \left( \frac{-3 – \mu _p}{\sigma _p} \right) \right) – \left( \frac{1}{2} + \log \left( \sqrt{2 \pi } \right) \right) \end{aligned}$$
where \(\Phi\) denotes the cumulative distribution function (CDF) of the standard normal distribution.
Interpretation. Theorem 5 provides an expression that measures the degree of divergence between the marginal assumed prior and the marginal true prior using KLD. A higher KLD value indicates a greater discrepancy between the assumed and true priors. This divergence is primarily affected by the standard deviation of the true prior, \(\sigma _{\text {p}}\), as well as the position of the true prior’s mean \(\mu _{\text {p}}\) relative to the bounds of the uniform assumed prior \([-3, 3]\).
Theorem 6
(KLD between approximated marginal true posterior and marginal estimated posterior) The KLD \(D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\) between the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) and the marginal estimated posterior \(q(\theta _i \mid c)\) is given by:
$$\begin{aligned}&D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c)) = \log \left( \frac{\sigma _{\ell }}{\sigma _{\phi }}\right) + \frac{{\sigma _{\phi }}^2 + (\mu _{\phi } – \mu _{\ell })^2}{2 {\sigma _{\ell }}^2} – \frac{1}{2} \end{aligned}$$
Interpretation. Theorem 6 provides an expression that shows how much the marginal estimated posterior deviates from the approximated marginal true posterior, again measured by KLD. The KLD in this context is driven by both the difference in the means of the marginal true posterior \(\mu _{\phi }\) and the marginal likelihood \(\mu _{\ell }\), as well as their variances, \(\sigma _{\phi }\) and \(\sigma _{\ell }\). It is important to note that, as shown in Theorem 4, the mean and variance of the true posterior, \(\mu _{\phi }\) and \(\sigma _{\phi }^2\), are also influenced by the mean and variance of the marginal true prior, \(\mu _p\) and \(\sigma _p^2\).
Theorem 7
(Derivative of KLD between marginal true prior and marginal assumed prior with respect to standard deviation) The partial derivative of the KLD \(D_{\text {KL}}(p(\theta _i) \, \Vert \, q(\theta _i))\) between the marginal true prior \(p(\theta _i)\) and the marginal assumed prior \(q(\theta _i)\) with respect to the standard deviation \(\sigma _{\text {p}}\) of the marginal true prior is:
$$\begin{aligned} \frac{\partial }{\partial \sigma _{\text {p}}} D_{\text {KL}}(p(\theta _i) \, \Vert \, q(\theta _i)) = -\frac{1}{\sigma _{\text {p}}} -\log \left( 6 \right) \left( \left( \frac{3 – \mu _p}{\sigma _p^2} \right) \phi \left( \frac{3 – \mu _p}{\sigma _p} \right) + \left( \frac{3 + \mu _p}{\sigma _p^2} \right) \phi \left( \frac{-3 – \mu _p}{\sigma _p} \right) \right) \end{aligned}$$
where \(\phi\) denotes the probability density function (PDF) of the standard normal distribution.
Interpretation. The derivative of the KLD with respect to \(\sigma _{\text {p}}\) presented in Theorem 7 reveals how the discrepancy between the marginal true prior and the marginal assumed prior changes as the standard deviation of the marginal true prior increases. Since \(-3 \le \mu _p \le 3\) according to Definition 4, the factors \((3 – \mu _p)\) and \((3 + \mu _p)\) remain within a positive range. Additionally, \(\phi\) refers to the standard normal probability density function evaluated at specific points, which is also positive. Notably, given that all terms involved are positive, the derivative is exclusively negative. This indicates that as the standard deviation of the marginal true prior increases, the KLD consistently decreases, reflecting a reduced discrepancy between the true and assumed priors.
Theorem 8
(Derivative of KLD between approximated marginal true posterior and marginal estimated posterior) The derivative of the KLD between the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) and the marginal estimated posterior \(q(\theta _i \mid c)\) with respect to the standard deviation \(\sigma _{\text {p}}\) of the marginal true prior is:
$$\begin{aligned} \frac{\partial }{\partial \sigma _{\text {p}}} D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c)) = -\frac{{\sigma _{\ell }}^2}{{\sigma _{\ell }}^2 {\sigma _{\text {p}}}+{\sigma _{\text {p}}}^3} -\frac{{\sigma _{\ell }}^2 {\sigma _{\text {p}}} \left( 2 ({\mu _{\text {p}}} – {\mu _{\ell }})^2 – \left( {\sigma _{\ell }}^2 + {\sigma _{\text {p}}}^2 \right) \right) }{({\sigma _{\ell }}^2 + {\sigma _{\text {p}}}^2)^3} \end{aligned}$$
Interpretation. The derivative of the KLD with respect to \(\sigma _{\text {p}}\) in Theorem 8 reveals how the discrepancy between the approximated marginal true posterior and the marginal estimated posterior changes as the standard deviation of the marginal true prior increases. The first term in the derivative is unambiguously negative, as it is composed solely of positive quantities. The second term is more complex, involving the interaction between the means \(\mu _{\text {p}}\) and \(\mu _{\ell }\), and the variances \(\sigma _{\text {p}}^2\) and \(\sigma _{\ell }^2\). For the second term to remain negative or only slightly positive, the distance between the mean parameters \(\mu _{\text {p}}\) and \(\mu _{\ell }\) needs to be sufficiently large, and both variances \(\sigma _{\text {p}}^2\) and \(\sigma _{\ell }^2\) must be sufficiently small. Given that \(-3 \le \mu _{\text {p}} \le 3\) and \(\sigma _{\text {p}} \le 1\) as per Definition 4, the expression \(2 (\mu _{\text {p}} – \mu _{\ell })^2 – (\sigma _{\ell }^2 + \sigma _{\text {p}}^2)\) tends to be positive or only slightly negative, ensuring that the second term is negative or slightly positive. Under these conditions, the entire derivative remains negative, indicating that as the standard deviation of the marginal true prior increases, the KLD decreases consistently, reflecting a reduced discrepancy between the true and estimated posteriors.
Corollary 1
(Decrease of KLD with standard deviation) Since the derivative of the KLD with respect to the standard deviation \(\sigma _{\text {p}}\) is negative for both \(D_{\text {KL}}(p(\theta _i) \, \Vert \, q(\theta _i))\) (Theorem 7) and \(D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\) (Theorem 8), it follows that both \(D_{\text {KL}}(p(\theta _i) \, \Vert \, q(\theta _i))\) and \(D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\) decrease together with an increase in the standard deviation \(\sigma _{\text {p}}\) of the marginal true prior, provided that the distance between the mean parameters of the prior \(\mu _{\text {p}}\) and likelihood \(\mu _{\ell }\) is sufficiently large and both variances are sufficiently small to ensure the second term in Theorem 8remains negative.
Interpretation. Corollary 1 ties together the results of the previous theorems, showing that as the standard deviation of the true prior increases, both the KLD between the true and assumed priors, and the KLD between the true and estimated posteriors decrease. This indicates that, under varying dataset conditions represented by different standard deviations of the marginal true prior, a more accurate assumed prior positively correlates with better data recovery outcomes.
Increasing data recovery deviation with higher model prediction error
Consider a machine learning model that employs a marginal inaccurate likelihood \(q(c \mid \theta _i)\) for a specific parameter \(\theta _i\), with a classification threshold set at 0.5. The marginal accurate likelihood \(p(c \mid \theta _i)\) serves as the ground truth, representing the correct likelihood distribution. The marginal inaccurate likelihood \(q(c \mid \theta _i)\) introduces the only source of error in the model’s predictions, as all other parameters in the model utilize their respective marginal accurate likelihoods \(p(c \mid \theta _j)\) for \(j \ne i\). This setup isolates the impact of the flawed likelihood on the parameter \(\theta _i\) as the sole cause of any inaccuracies.
Definition 12
(Marginal accurate likelihood and marginal inaccurate likelihood) The marginal accurate likelihood \(p(c \mid \theta _i)\) for each class \(c\) conditioned on \(\theta _i\) serves as the ground truth likelihood and is the same marginal likelihood previously defined in Definition 6, characterized by a univariate Gaussian distribution with mean \(\mu _{\ell }\) and variance \(\sigma _{\ell }^2\):
$$\begin{aligned} p(c \mid \theta _i) = {\mathcal {N}}(\theta _i \mid \mu _{\ell }, \sigma _{\ell }^2) \end{aligned}$$
The marginal inaccurate likelihood \(q(c \mid \theta _i)\) for each class \(c\) conditioned on \(\theta _i\) is assumed to follow a univariate Gaussian function with the same variance \(\sigma _{\ell }^2\) as the marginal accurate likelihood \(p(c \mid \theta _i)\). The mean \(\mu _q\) of the marginal inaccurate likelihood is shifted by \(\beta\) from the mean \(\mu _{\ell }\) of the accurate likelihood in a direction that deviates more from both \(\mu _{\ell }\) and \(\mu _{\phi }\) (mean of the approximated marginal true posterior). Therefore, the marginal inaccurate likelihood \(q(c \mid \theta _i)\) is defined as:
$$\begin{aligned} q(c \mid \theta _i) = {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _{\ell }^2) = {\left\{ \begin{array}{ll} {\mathcal {N}}(\theta _i \mid \mu _{\ell } – \beta , \sigma _{\ell }^2) & \text {if } \mu _q = \mu _{\ell } – \beta , \\ {\mathcal {N}}(\theta _i \mid \mu _{\ell } + \beta , \sigma _{\ell }^2) & \text {if } \mu _q = \mu _{\ell } + \beta . \end{array}\right. } \end{aligned}$$
Let \(\alpha = |\mu _{\ell } – \mu _{\phi }|\) denote the absolute mean difference between the accurate likelihood and the posterior, \(\beta = |\mu _{\ell } – \mu _q|\) the absolute mean difference between the accurate likelihood and the inaccurate likelihood, and \(\gamma = |\mu _{\phi } – \mu _q|\) the absolute mean difference between the approximated marginal true posterior and the marginal inaccurate likelihood, such that \(\gamma = \alpha + \beta\). Figure 7 presents the two scenarios for this theoretical setup.

Visual comparison of marginal accurate and inaccurate likelihoods with classification outcomes. This figure illustrates the marginal accurate likelihood \(p(c \mid \theta _i)\) and the marginal inaccurate likelihood \(q(c \mid \theta _i)\) for the parameter \(\theta _i\). In scenario (a), the marginal inaccurate likelihood \(q(c \mid \theta _i)\) is shifted by \(-\beta\) from the accurate likelihood \(p(c \mid \theta _i)\), and in scenario (b), it is shifted by \(+\beta\). The absolute mean differences \(\alpha\), \(\beta\), and \(\gamma\) between these distributions are indicated, reflecting their respective relationships. The shaded areas under the curves represent different classification outcomes (True Positive, False Positive, True Negative, and False Negative) relative to the decision threshold of 0.5.
Interpretation. Definition 12 establishes the framework for understanding how discrepancies between the marginal accurate likelihood and the marginal inaccurate likelihood introduce errors in the predictions of a machine learning model. The marginal accurate likelihood \(p(c \mid \theta _i)\) represents the ground truth distribution of the data, assuming it follows a univariate Gaussian function with mean \(\mu _{\ell }\) and variance \(\sigma _{\ell }^2\). On the other hand, the marginal inaccurate likelihood \(q(c \mid \theta _i)\), which the model erroneously uses, is also assumed to be Gaussian with the same variance \(\sigma _{\ell }^2\), but its mean \(\mu _q\) is shifted by \(\beta\) from the accurate mean \(\mu _{\ell }\). This shift in mean \(\beta\) can occur in either direction: \(\mu _{\ell } – \beta\) or \(\mu _{\ell } + \beta\). This discrepancy leads to classification errors because the model’s decision boundary (set at a threshold of 0.5) will misclassify some data points due to the inaccurate likelihood. The definition also introduces the concept of three key parameters: \(\alpha\), \(\beta\), and \(\gamma\), which quantify the relationships between the accurate likelihood, the inaccurate likelihood, and the posterior distribution. These discrepancies translate into false positive (FP) and false negative (FN) errors, which are depicted in the accompanying figure (Fig. 7).
Theorem 9
(Marginal estimated posterior via inaccurate likelihood) The marginal estimated posterior \(q(\theta _i \mid c)\) for the parameter \(\theta _i\) via inaccurate likelihood \(q(\theta _i \mid c) = {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _q^2)\) is a univariate Gaussian distribution with mean \(\mu _q\) and covariance \(\sigma _{\ell }\):
$$\begin{aligned} q(\theta _i \mid c) = {\mathcal {N}}(\theta \mid \mu _q, {\sigma _{\ell }}^2) \text { for } -3 \le \theta _i \le 3 \end{aligned}$$
Interpretation. Theorem 9 describes the form of the marginal estimated posterior \(q(\theta _i \mid c)\) when using an inaccurate likelihood \(q(c \mid \theta _i) = {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _{\ell }^2)\). The result shows that, under these conditions, the marginal estimated posterior \(q(\theta _i \mid c)\) is also a univariate Gaussian distribution with the same mean \(\mu _q\) as the inaccurate likelihood and the same variance \(\sigma _{\ell }^2\).
Investigative focus and hypothesis: In this subsection, we explore the potential correlation between two key dependent variables: the model’s prediction error versus the deviation of the recovered data distribution from the true data distribution. Our primary question is whether variations in the model’s classification error influence the accuracy of data recovery across different model conditions. These different model conditions are represented by varying discrepancies \(\beta = |\mu _{\ell } – \mu _q|\) between the marginal accurate and inaccurate likelihoods, which act as simplified proxies for the complexities and variability present in error-prone models. By analyzing the relationship between these variables, we seek to determine whether higher prediction errors in the model correspond to greater deviations in data recovery outcomes.
Theorem 10
(Error generated by marginal inaccurate likelihood) The error generated by this model, represented as the cumulative probability of false positives (FP) and false negatives (FN), arises due to the discrepancy \(\beta = |\mu _{\ell } – \mu _q|\) between \(\mu _q\) and \(\mu _{\ell }\).
Case 1: \(\ \mu _q = \mu _{\ell } – \beta\) (This case is depicted in Fig. 7a)
$$\begin{aligned} \text {FP}_1= & \int _{\theta _{\text {left}}}^{\theta _{\text {left}} + \beta } {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _{\ell }^2) \, d\theta _i \\ \text {FN}_1= & \int _{\theta _{\text {right}}}^{\theta _{\text {right}} + \beta } {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _{\ell }^2) \, d\theta _i \end{aligned}$$
Case 2: \(\ \mu _q = \mu _{\ell } + \beta\) (This case is depicted in Fig. 7b)
$$\begin{aligned} \text {FN}_2= & \int _{\theta _{\text {left}}’}^{\theta _{\text {left}}’ + \beta } {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _{\ell }^2) \, d\theta _i \\ \text {FP}_2= & \int _{\theta _{\text {right}}’}^{\theta _{\text {right}}’ + \beta } {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _{\ell }^2) \, d\theta _i \end{aligned}$$
where \(\theta _{\text {left}}\) and \(\theta _{\text {right}}\) are the left and right 0.5 points of the marginal inaccurate likelihood, respectively; while \(\theta _{\text {left}}’\) and \(\theta _{\text {right}}’\) are the left and right 0.5 points of the marginal accurate likelihood, respectively.
Interpretation. Theorem 10 quantifies the error in a model’s predictions caused by the discrepancy \(\beta = |\mu _{\ell } – \mu _q|\) between the marginal accurate and inaccurate likelihoods. The error is represented by the cumulative probabilities of false positives (FP) and false negatives (FN), which arise from this shift. The theorem shows how the magnitude of \(\beta\) directly influences prediction errors by shifting the inaccurate likelihood relative to the accurate one, resulting in misclassifications.
Theorem 11
(KLD between approximated marginal true posterior and marginal estimated posterior via inaccurate likelihood) The KLD \(D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\) between the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) and the marginal estimated posterior \(q(\theta _i \mid c)\) via inaccurate likelihood \(q(\theta _i \mid c) = {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _q^2)\) is given by:
$$\begin{aligned} D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c)) = \log \left( \frac{\sigma _q}{\sigma _{\phi }}\right) + \frac{{\sigma _{\phi }}^2 + (\alpha + \beta )^2}{2 \sigma _q^2} – \frac{1}{2} \end{aligned}$$
Interpretation. Theorem 11 provides the KLD between the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) and the marginal estimated posterior \(q(\theta _i \mid c)\) when the estimation is based on an inaccurate likelihood. The KLD is driven by several factors, including the ratio of the standard deviations \(\sigma _q\) and \(\sigma _{\phi }\), and the squared deviation \((\alpha + \beta )^2\).
Theorem 12
(Derivative of error with respect to \(\beta\)) The derivative of the error \(E\) (represented as false positives or false negatives given in Theorem 10) with respect to the absolute mean difference \(\beta\) between the marginal accurate likelihood and the marginal inaccurate likelihood is given by:
$$\begin{aligned} \frac{\partial }{\partial \beta } \text {E} = {\mathcal {N}}(\theta _i = u \mid \mu _q, \sigma _{\ell }^2) \end{aligned}$$
where \(u\) denotes the upper bound of the integral of the respective error area given in Theorem 10.
Interpretation. Theorem 12 provides the derivative of the error \(E\) (represented as false positives or false negatives specified in Theorem 10) with respect to the absolute mean difference \(\beta\). Notably, since the Gaussian probability density function \({\mathcal {N}}(\theta _i = u \mid \mu _q, \sigma _{\ell }^2)\) is always positive, the derivative is exclusively positive. This means that as the discrepancy \(\beta\) between the accurate and inaccurate likelihoods increases, the error (false positives or false negatives) always increases.
Theorem 13
(Derivative of KLD between approximated marginal true posterior and marginal estimated posterior with respect to \(\beta\)) The derivative of the KLD \(D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\) between the approximated marginal true posterior \(\hat{p}(\theta _i \mid c)\) and the marginal estimated posterior \(q(\theta _i \mid c)\) via inaccurate likelihood \(q(\theta _i \mid c) = {\mathcal {N}}(\theta _i \mid \mu _q, \sigma _q^2)\) with respect to the absolute mean difference between the accurate likelihood and the inaccurate likelihood \(\beta\) is given by:
$$\begin{aligned} \frac{\partial }{\partial \beta } D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c)) = \frac{\alpha + \beta }{\sigma _q^2} \end{aligned}$$
Interpretation. Theorem 13 provides the derivative of the KLD \(D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\) with respect to the absolute mean difference \(\beta\). Since \(\alpha\) and \(\beta\) are both positive quantities, and \(\sigma _q^2\) is positive, the derivative is exclusively positive. This means that as the discrepancy \(\beta\) between the accurate and inaccurate likelihoods increases, the divergence between the true and estimated posterior grows, reflecting a worsening data reconstruction.
Corollary 2
(Positive relationship between model error and posterior KLD with \(\beta\)) Since the derivatives \(\frac{\partial }{\partial \beta } \text {E} = {\mathcal {N}}(\theta _i = u \mid \mu _q, \sigma _{\ell }^2)\) (Theorem 12) and \(\frac{\partial }{\partial \beta } D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c)) = \frac{\alpha + \beta }{\sigma _q^2}\) (Theorem 13) are both strictly positive, increasing the absolute mean difference between the accurate likelihood and the inaccurate likelihood \(\beta\) leads to simultaneous increases in the model error and posterior KLD.
Interpretation. Corollary 2 ties together the findings from the previous theorems, showing that as \(\beta\) increases, both the model error (represented by false positives and false negatives) and the KLD between the approximated marginal true posterior and the marginal estimated posterior increase. The strictly positive derivatives, \(\frac{\partial }{\partial \beta } \text {E}\) and \(\frac{\partial }{\partial \beta } D_{\text {KL}}(\hat{p}(\theta _i \mid c) \, \Vert \, q(\theta _i \mid c))\), indicate that any increase in \(\beta\) consistently leads to higher error rates and a greater divergence between the estimated and true posteriors. This correlation implies that, across different model conditions characterized by varying discrepancies between the marginal accurate and inaccurate likelihoods, higher model prediction errors are associated with greater deviations in data recovery. In other words, the accuracy of the model’s predictions positively correlates with the fidelity of the recovered posterior distribution.