Formulation of logic rule explanations
For given input x from the dataset X, our explanation α = (αx, αw) comprises two components: an antecedent αx and a linear weight αw. The antecedent αx and a consequent y together form a logic rule αx ⇒ y, as illustrated in Fig. 1a. The linear weight αw indicates the contribution of the atoms in the antecedent αx within the logic rule. Meanings of symbols used in this paper are defined in Section C.1 of the Supplementary Information.
An antecedent αx represents the condition under which a rule applies and corresponds to an explanation expressed in logical form. It is defined as a sequence αx = (o1,…, oL), where each oi is an atom, and L is the length of the sequence. An atom is the smallest unit of explanation and corresponds to a single interpretable feature of a given input—for instance, a condition such as “awesome ≥2”. These interpretable features may differ from the features used by deep learning models. They can have different granularities (e.g., words or phrases vs. tokens), be based on statistical properties (e.g., word frequency), or be derived using external tools (e.g., grammatical tagging of a word). Mathematically, each atom oi is a Boolean-valued function that returns true if the ith interpretable feature is present in the input x, and false otherwise. Additional details about the atom selection process are provided in Section C.3 of the Supplementary Information. An input sample x is said to satisfy an antecedent αx if the logical condition αx(x) evaluates to true.
The consequent y denotes the model’s predicted output, given that the antecedent is satisfied. In a classification task, y typically corresponds to the predicted class label; in regression, y would be a real-valued number.
Finally, the linear weight αw models the contribution of each atom in the logical relationship αx ⇒ y. It is represented as a matrix \({{\boldsymbol{\alpha }}}_{w}=\left(\begin{array}{ccc}{w}_{11}&\ldots &{w}_{1L}\\ \ldots &\ldots &\ldots \\ {w}_{K1}&\ldots &{w}_{KL}\end{array}\right)\), where K is the number of possible classes and wki indicates the contribution of atom oi to the prediction of class k. The magnitude of each weight reflects the strength of its corresponding atom’s contribution to the output prediction.
Framework for deep logic rule reasoning
Let us denote f as a deep learning model that estimates probability p(y∣x), where x is the input data sample and y is a candidate class. We upgrade model f to a self-explaining version by adding a logic rule explanation α. Then, we can reformulate p(y∣x) as
$$p({y}| {\bf{x}},b)=\sum _{{\boldsymbol{\alpha }}}p({y}| {\boldsymbol{\alpha }},{\bf{x}},b)p({\boldsymbol{\alpha }}| {\bf{x}},b)=\sum _{{\boldsymbol{\alpha }}}p({y}| {\boldsymbol{\alpha }})p({\boldsymbol{\alpha }}| {\bf{x}},b),\quad s.t.,\quad \Omega ({\boldsymbol{\alpha }})\le S$$
(3)
Here, b represents a human’s prior belief about the rules, e.g., the desirable form of atoms, Ω(α) is the required number of logic rules to explain given input x, and S is the number of samples (logic rules chosen by the model). Eq. (3) includes two constraints essential for ensuring explainability. The first constraint p(y∣α, x, b) = p(y∣α) requires that explanation α contains all information in the input x and b that is useful to predict y. Without the constraint, the model may “cheat” by predicting y directly from the input instead of using the explanation, which leads to a decrease of faithfulness. The second constraint Ω(α) ≤ S requires that the model can be well explained by using only S explanations, where S is small enough to ensure readability (S = 1 in our implementation). We can further decompose Eq. (3) based on the independence between the input x and the human prior belief b. (proof and assumptions in Section C.2 of the Supplementary Information):
$$p({y}| {\bf{x}},b)=\sum _{{\boldsymbol{\alpha }}}p({y}| {\boldsymbol{\alpha }})p({\boldsymbol{\alpha }}| {\bf{x}},b)\propto \sum _{{\boldsymbol{\alpha }}}p(b| {\boldsymbol{\alpha }})\cdot p({y}| {\boldsymbol{\alpha }})\cdot p({\boldsymbol{\alpha }}| {\bf{x}}),\,\,s.t.,\,\,\Omega ({\boldsymbol{\alpha }})\le S$$
(4)
Then, we further decompose Eq. (4) using an antecedent αx and its linear weight αw:
$$\begin{array}{rcl} p(y|{\mathbf{x}},b)&\propto & \sum\limits_{{\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w} p(b | {\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w)\cdot p(y | {\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w)\cdot p({\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w | {\mathbf{x}}) \\ &=& \sum\limits_{{\boldsymbol{\alpha}}_x} p(b | {\boldsymbol{\alpha}}_x) \left(\sum\limits_{{\boldsymbol{\alpha}}_w} p(y | {\boldsymbol{\alpha}}_w) \cdot p({\boldsymbol{\alpha}}_w | {\boldsymbol{\alpha}}_x) \right) \cdot\ {p({\boldsymbol{\alpha}}_x | {\mathbf{x}})}, \\ &=& \sum\limits_{{\boldsymbol{\alpha}}_x} \underbrace{p(b | {\boldsymbol{\alpha}}_x)}_{\begin{array}{c}{\rm{Human}}\\ {\rm{prior}}\end{array}} \cdot \underbrace{p(y | {\boldsymbol{\alpha}}_x)}_{\begin{array}{c}{\rm{Consequent}}\\ {\rm{estimation}}\end{array}} \ \cdot\ \ {\underbrace{p({\boldsymbol{\alpha}}_x | {\mathbf{x}})}_{\begin{array}{c}{\rm{Deep}}\,{\rm{antecedent}}\\ {\rm{generation}}\end{array}}}, \quad s.t., \quad {{\Omega}}({\boldsymbol{\alpha}}_x) \leq S \end{array}$$
(5)
In Eq. (5), two additional constraints are introduced to ensure that the weight αw functions as a faithful explanation. The first constraint, p(y∣αw) = p(y∣αx, αw), is designed to prevent the model from bypassing the explanatory weight αw and relying instead on latent representations of the antecedent αx for predicting the consequent y. The second constraint, p(αw∣αx) = p(αw∣αx, x), ensures that the estimation of αw is based solely on the selected antecedent αx and not directly on the raw input x. This guards against information leakage that could undermine the interpretability of the explanation.
We assume p(b∣αx) = p(b∣αx, αw), as b represents a human’s prior belief about the rules encoded in αx, which should not depend on how the model weights them internally. We can observe that the only difference between Eq. (5) and Eq. (4) lies in the use of an antecedent αx instead of full explanation α. This implies that the introduction of the weight αw only affects the internal estimation process of the consequent, and without explicit guidance, this process may diverge significantly from human expectations.
The three derived terms correspond to three main modules of the proposed framework, SELOR. The first component, human prior p(b∣αx), encodes human guidance on preferred rule forms, aiming to reduce the likelihood of misunderstanding, as discussed in Section “Human Prior p(b|αx)”. The second, consequent estimation p(y∣αx), models the relationship between the explanation αx and the predicted output y through the use of the weight αw. This weight is carefully estimated to ensure a meaningful and consistent relationship, so that each explanation naturally leads to the prediction according to human perception, as described in Section “Consequent Estimation p(y|αx)”. Lastly, deep antecedent generation p(α∣x) leverages the deep representation of input x learned by the given deep model f to infer an appropriate explanation α, as elaborated in Section “Deep Antecedent Generation p(α|x)”.
The sparsity constraint Ω(αx)≤S for the explanations can be enforced by sampling from p(αx∣x). In particular, we rewrite Eq. (5) as an expectation and estimate it through sampling:
$$\begin{array}{lll}p({y}| {\bf{x}},b)\;\propto \;\sum\limits_{{{\boldsymbol{\alpha }}}_{x}}p(b| {{\boldsymbol{\alpha }}}_{x})\cdot p({y}| {{\boldsymbol{\alpha }}}_{x})\cdot p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})\\\qquad\qquad =\mathop{{\mathbb{E}}}\limits_{{{\boldsymbol{\alpha }}}_{x} \sim p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})}p(b| {{\boldsymbol{\alpha }}}_{x})\cdot p({y}| {{\boldsymbol{\alpha }}}_{x})\approx \frac{1}{S}\sum\limits_{\begin{array}{c}s\in [1,S]\\ {{\boldsymbol{\alpha }}}_{x}^{(s)} \sim p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})\end{array}}p(b| {{\boldsymbol{\alpha }}}_{x}^{(s)})\,p({y}| {{\boldsymbol{\alpha }}}_{x}^{(s)})\end{array}$$
(6)
where \({{\boldsymbol{\alpha }}}_{x}^{(s)}\) is the sth sample of αx. For example, to maximize the approximation term with S = 1, the antecedent generator p(αx∣x) must find a single sample \({{\boldsymbol{\alpha }}}_{x}^{(s)}\) that yields the largest \(p(b| {{\boldsymbol{\alpha }}}_{x}^{(s)})p({y}| {{\boldsymbol{\alpha }}}_{x}^{(s)})\), and it needs to assign a high probability to the best \({{\boldsymbol{\alpha }}}_{x}^{(s)}\). Otherwise, other samples with a lower \(p(b| {{\boldsymbol{\alpha }}}_{x}^{(s)})p({y}| {{\boldsymbol{\alpha }}}_{x}^{(s)})\) may be generated, thereby decreasing p(y∣x, b). This ensures the sparsity of p(αx∣x), which improves the model interpretability.
Human prior p(b∣α
x)
Human prior p(b∣αx) = ph(b∣αx)ps(b∣αx) consists of hard priors ph(b∣αx) and soft ones ps(b∣αx).
Hard priors categorize the feasible solution space for the rules: ph(b∣αx) = 0 if αx is not a feasible solution. Humans can easily define hard priors of αx by choosing the atom types, such as whether the interpretable features are words, phrases, or statistics like word frequency, and the antecedent’s maximum length L. SELIN does not require a predefined rule set. Nonetheless, we allow users to enter one if it is more desirable in some application scenarios. A large solution space increases the time cost for deep logic rule reasoning (Section “Optimization and Time Complexity”) but also decreases the probability of introducing undesirable bias.
Soft priors model different levels of human preference for logic rules. For example, people may prefer shorter rules or high-coverage rules that satisfy many input samples. The energy function can parameterize such soft priors: \({p}_{s}(b| {{\boldsymbol{\alpha }}}_{x})\propto \exp (-{{\mathcal{L}}}_{b}({{\boldsymbol{\alpha }}}_{x}))\), where \({{\mathcal{L}}}_{b}\) is the loss function for punishing undesirable logic rules. We do not include any soft priors in our current implementation.
For example, suppose we are inducing logic rules to explain a sentiment classifier’s decision on restaurant reviews. The interpretable features αx may include binary indicators for the presence of words like “awesome”, “tasty”, or “not” in the input text. A hard prior ph(b∣αx) may rule out any rule that includes more than L = 2 words in the antecedent (e.g., a rule using “awesome” and “not” is allowed, but not one using “awesome”, “not”, and “tasty” together), if the user has defined a maximum antecedent length of 2 as part of their hard prior. A soft prior ps(b∣αx) can reflect a user’s preferences over logic rules. For instance, if a user prefers commonly used words, a rule like “awesome ≥ 1” may be favored over “pulchritudinous ≥1”, even though both convey a positive meaning.
Consequent estimation p(y∣α
x)
Consequent estimation models p(y∣αx), the relationship between the antecedent αx and the prediction y using the weight αw. The weight αw is computed to ensure a meaningful and consistent relationship, so that each explanation naturally leads to the prediction according to human perception. This is achieved by testing the logic rule αx ⇒ y across the entire training dataset, ensuring that it represents the human knowledge embedded in the data distribution.
A straightforward way to compute p(y∣αx) is an empirical estimation: first, collect all samples that satisfy antecedent αx, and then calculate the percentage of them that have label y30. For example, given explanation αx = “awesome ≥2”, if we obtain all instances in which awesome appears more than twice and find that 90% of them have label y = positive sentiment, then p(y∣αx) = 0.9. Large p(y∣αx) corresponds to global patterns that naturally align with human perception. Mathematically, this is equivalent to approximating p(y∣αx) with the empirical probability \(\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x})\):
$$\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x})={n}_{{{\boldsymbol{\alpha }}}_{x},y}/{n}_{{{\boldsymbol{\alpha }}}_{x}}$$
(7)
where \({n}_{{{\boldsymbol{\alpha }}}_{x},y}\) is the number of training samples that satisfy the antecedent αx and has the consequent y, and \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) is the number of training samples that satisfy the antecedent αx. Directly setting p(y∣αx) to \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) can cause three problems. First, when nα is not large enough, the empirical probability \(\hat{p}(y| {\boldsymbol{\alpha }})\) may be inaccurate, and the modeling of such uncertainty is inherently missing in this formulation. Second, statistically modeling the probability of y based solely on αx, without details about the contributions of each atom in αx, may lead users to feel there is still an unaddressed part in the explanation. Third, computing \(\hat{p}(y| {\boldsymbol{\alpha }})\) for every antecedent α is intractable, since the number of feasible antecedents A increases exponentially with antecedent length L.
To address the aforementioned problems, we employ a neural estimation of categorical distribution, which jointly model \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) and the uncertainty caused by low-coverage antecedents with the categorical distribution. For example, suppose antecedent αx is “tasty ≥2”. If this antecedent is only satisfied by 3 training samples, among which 2 have the label y = negative sentiment, then the empirical estimate \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})=2/3\). However, since \({n}_{{{\boldsymbol{\alpha }}}_{x}}=3\) is small, the model considers this estimate uncertain, and the resulting p(y∣αx) is smoothed toward a more uniform distribution according to the learned β. In contrast, for a high-coverage antecedent like “great ≥1”, if 900 out of 1000 samples have positive sentiment, the empirical estimate 0.9 will be trusted more, and p(y∣αx) will stay close to 0.9. See results in Section “Explainability Evaluation on Data Consistency” for the approximation capability of our model.
Assume that, given antecedent αx, the class y follows a categorical distribution, where each category corresponds to a class. We define β as the concentration hyperparameter of this categorical distribution, which controls how uniformly the probability is distributed across the classes – higher values of β lead to more uniform distributions, while lower values concentrate the probability on fewer classes. Then, according to the posterior predictive distribution, y takes one of K potential classes, and we may compute the probability of a new observation y given existing observations:
$$p(y| {{\boldsymbol{\alpha }}}_{x})=p({y}| {{\mathcal{Y}}}_{{{\boldsymbol{\alpha }}}_{x}},\beta )\approx \frac{\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x}){n}_{{{\boldsymbol{\alpha }}}_{x}}+\beta }{{n}_{{{\boldsymbol{\alpha }}}_{x}}+K\beta }$$
(8)
Here, \({{\mathcal{Y}}}_{{{\boldsymbol{\alpha }}}_{x}}\) denotes \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) observations of class label y obtained by checking the training data, and β is automatically trained. Eq. (8) becomes Eq. (7) when \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) increases to ∞, and becomes a uniform distribution when \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) goes to 0. Thus, a low-coverage antecedent with a small \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) is considered uncertain (i.e., close to uniform distribution). By optimizing Eq. (8), our method automatically balances the empirical probability \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) and the number of observations \({n}_{{{\boldsymbol{\alpha }}}_{x}}\). Probability p(y∣αx) also serves as the confidence score for the logic rule αx ⇒ y.
We then employ atom weight αw to model \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) based on the contribution of each atom in αx. We adopt deep neural network as a consequent estimator that predicts αw, which improves generalization to unseen cases and enhances noise handling. The details about the deep neural network is in Section C.3 of the Supplementary Information. Given the chosen antecedent αx for the given input x, we define an arbitrary data sample in the dataset as xj ∈ X. The candidates set of atoms for xj is denoted by \({\mathcal{C}}({{\bf{x}}}^{j})\). Each atom candidate in \({\mathcal{C}}({{\bf{x}}}^{j})\) should satisfy both global and local constraints. The hard priors discussed in Section “Human Prior p(b|αx)” provide the global constraint, ensuring that the atom conforms to a human-defined logical form. The local constraint requires that xj satisfies the atom. An atom “awesome > 1”, for example, is sampled only if xj mentions “awesome” more than once. Next, uj is the vector that indicates whether each atom oi in αx is also included in the candidate set \({\mathcal{C}}({{\bf{x}}}^{j})\), i.e., \({u}_{i}^{j}={\mathbb{I}}({o}_{i}\in {\mathcal{C}}({{\bf{x}}}^{j}))\) where \({u}_{i}^{j}\) is i-th element of uj and \({\mathbb{I}}\) represents an indicator function. Additionally, we define the region \({{\mathcal{R}}}_{i}\) for the atom oi as the set of train data samples that satisfies atom oi, and we define \({\mathcal{R}}={{\mathcal{R}}}_{1}\cup \ldots \cup {{\mathcal{R}}}_{L}\) as the entire region of the antecedent αx. The deep model then predicts \({{\boldsymbol{\alpha }}}_{w}=\left(\begin{array}{ccc}{w}_{11}&\ldots &{w}_{1L}\\ \ldots &\ldots &\ldots \\ {w}_{K1}&\ldots &{w}_{K\,L}\end{array}\right)\) from αx, and minimizes the following loss objective.
$${{\mathcal{L}}}_{w}=\sum _{({{\bf{x}}}_{j},{y}_{j})\in {\mathcal{R}}}CrossEntropyLoss({{\boldsymbol{\alpha }}}_{w}^{T}{{\bf{u}}}^{j},{y}_{j})$$
(9)
This regression tests combinations of atoms in αx on train dataset, allowing the deep model to learn αw as the relationship between the prediction y and each atom, based on the human knowledge reflected in the labels. Since x naturally satisfies all atoms in αx, we calculate the sum of αw across the atoms to derive a logit for each class. We then apply a softmax function across classes to obtain \(\tilde{p}(y| {{\boldsymbol{\alpha }}}_{x})\), where \(\tilde{p}(y| {{\boldsymbol{\alpha }}}_{x})\) represents the predicted empirical probability.
$$\tilde{p}({y}| {{\boldsymbol{\alpha }}}_{x})=\frac{\exp \left({\sum }_{i\in [1,L]}{w}_{yi}\right)}{{\sum }_{k\in [1,K]}\exp \left({\sum }_{i\in [1,L]}{w}_{ki}\right)}$$
(10)
Subsequently, we use the multi-task learning framework in ref. 31 to train the neural network, ensuring that its predicted probability \(\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x})\) aligns with the empirical probability. This alignment is achieved by minimizing the loss specified in the following equation.
$${{\mathcal{L}}}_{r}=\frac{1}{2{\sigma }_{p}^{2}}| | \hat{p}({y}| {\boldsymbol{\alpha }})-\tilde{p}({y}| {\boldsymbol{\alpha }})| {| }^{2}+\frac{1}{2{\sigma }_{n}^{2}}| | {n}_{{\boldsymbol{\alpha }}}-{\tilde{n}}_{{\boldsymbol{\alpha }}}| {| }^{2}+\log {\sigma }_{p}{\sigma }_{n},$$
(11)
where \({\tilde{n}}_{{\boldsymbol{\alpha }}}\) is the predicted coverage given by the neural model, and σp and σn are standard deviations of ground truth probability and coverage, respectively. Finally, we adjust two loss objective \({{\mathcal{L}}}_{r}\) and \({{\mathcal{L}}}_{w}\) with a hyperparameter λ for the training of neural consequent estimator.
$${{\mathcal{L}}}_{c}={{\mathcal{L}}}_{r}+\lambda {{\mathcal{L}}}_{w}$$
(12)
Deep antecedent generation p(α∣x)
Deep antecedent generation finds an antecedent αx for explanation by reshaping the given deep model f. Specifically, we replace the prediction layer in f of the backbone model with an explanation generator, so that the latent representation z of input x is mapped to an explanation, instead of directly mapping to a prediction (e.g., class label). We outline the generation process with a formal definition. First, we precompute the embedding of each atom by averaging the embeddings of all training instances that satisfy the atom. During both training and inference, the antecedent generator sequentially selects atoms to form an explanation. At each selection step, the input embedding z is combined with the embeddings of the previously selected atoms to form a latent representation h via an encoder. We compute a probability distribution over candidate atoms based on their similarity to h, and an atom is sampled from this distribution using the Gumbel-softmax trick, excluding already selected atoms. This process repeats until a predefined number of atoms is selected, forming the final antecedent.
Formally, given z, which is the representation of input x in the last hidden layer of f, we generate explanation α = (o1…, oL) with a recursive formulation. Note that this process has a complexity that is linear with L (Section “Optimization and Time Complexity”). Formally, given z and o1, …oi−1, we obtain atom oi by
$${{\bf{h}}}_{i}=Encoder([{\bf{z}};{{\bf{o}}}_{1}\ldots ;{{\bf{o}}}_{i-1}]),\quad p({o}_{i}| {\bf{x}},{o}_{1}\ldots ,{o}_{i-1})=\frac{{\mathbb{I}}({o}_{i}\in {\mathcal{C}}({\bf{x}}))\exp ({{\bf{h}}}_{i}^{T}{{\bf{o}}}_{i})}{{\sum }_{\tilde{o}}{\mathbb{I}}(\tilde{o}\in {\mathcal{C}}({\bf{x}}))\exp ({{\bf{h}}}_{i}^{T}\tilde{{\bf{o}}})}$$
(13)
where oi is the embedding of atom oi and Encoder is a neural sequence encoder such as GRU32 or Transformer33. \({\mathbb{I}}\) is the indicator function, and \({\mathcal{C}}({\bf{x}})\) is the set of atom candidates for x. Note that we set the probability of the atoms that do not satisfy global or local constraints to zero. This ensures that only the atoms satisfying the specified conditions will be chosen in the following sampling process. We then sample oi from p(oi∣x, o1…, oi−1) in a differentiable way to ensure end-to-end training:
$${o}_{i}=Gumbel(p(\tilde{o}\in {\mathcal{C}}({\bf{x}})\subset {\mathcal{O}}\,| \,{\bf{x}},{o}_{1}\ldots ,{o}_{i-1})),\,\,\,p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})=\prod _{i\in [1,L]}p({o}_{i}| {\bf{x}},{o}_{1}\ldots ,{o}_{i-1})$$
(14)
Gumbel is Straight-Through Gumbel-Softmax34, a differentiable function for sampling discrete values. An atom oi is represented as a one-hot vector with a dimension of \(| {\mathcal{O}}|\), where \({\mathcal{O}}\) is the set of all atoms that satisfies hard priors. Then oi is multiplied with the embedding matrix of atoms to derive the embedding oi.
Optimization and time complexity
A deep logic rule reasoning model is learned in two steps. The first step optimizes the neural consequent estimator to learn p(y∣αx) by minimizing loss \({{\mathcal{L}}}_{c}\) in Eq. (12). The second step converts the deep model f to an explainable version by maximizing p(y∣x, b) in Eq. (6) with a cross-entropy loss. This is equivalent to minimizing loss \({{\mathcal{L}}}_{d}=-{{\mathcal{L}}}_{b}({{\boldsymbol{\alpha }}}_{x}^{(s)})-\log p({y}^{* }| {{\boldsymbol{\alpha }}}_{x}^{(s)})\). Here, \(-{{\mathcal{L}}}_{b}({{\boldsymbol{\alpha }}}_{x}^{(s)})\) punishes explanations that do not fit human’s prior preference for rules, while \(\log p({y}^{* }| {{\boldsymbol{\alpha }}}_{x}^{(s)})\) finds an antecedent \({{\boldsymbol{\alpha }}}_{x}^{(s)}\) that leads to the ground-truth class y* with high confidence. We repeat the first step and the second step for every iteration of batch. For stable optimization, the parameters of the antecedent generator are frozen during the first step, and the parameters of the consequent estimator are frozen during the second step.
We analyze the per-sample time complexity of modules in our method. The complexity is \(O(L\cdot | {\mathcal{O}}| )\) for antecedent generation, \(O({L}^{2}+L| {\mathcal{R}}| )\) for neural consequent estimator. Therefore, the total time complexity is \(O(L| {\mathcal{O}}| +L| {\mathcal{R}}| )\) since \(L < < | {\mathcal{O}}|\) and \(L < < | {\mathcal{R}}|\).