Connect with us

AI Insights

AI is pervasive for asset managers. Can it power ETF gains? – Pensions & Investments

Published

on

AI Insights

While the generative artificial intelligence (AI) craze is approaching its peak, promises that “AI w..

Published

on


Builder AI launches liquidation process in Delaware after controversy over sales overestimation, Nate founder’s federal indictment, GameOn false data, etc

[Picture = Gemini]

While the generative artificial intelligence (AI) craze is approaching its peak, promises that “AI will do everything on its own” are collapsing throughout Silicon Valley. The bankruptcy of Builder AI, which was revered as a unicorn, is a symbolic event.

According to the New York Times on the 31st (local time), Builder AI has launched a massive promotion with high growth in 2024, but a board investigation has confirmed overstatement of sales. After management changes and a liquidity crisis, the company entered liquidation proceedings in Delaware courts in the first half of 2025. As suspicions spread that “people took care of it from behind” over the reality of AI manager Natasha, who said he would automatically make the app, management explained, “AI was an auxiliary tool and did not replace people,” but failed to restore trust.

The incident shows how easily verification of the actual level of automation of technology and financial numbers can be pushed back while the label ‘AI’ draws the attention of investment and media.

Similar scenes were repeated on other stages. The shopping app “Nate” promoted that “Deep Learning replaces payment and checkout,” but allegations arose that the Philippine outsourcing staff handled the order manually. Eventually, the Southern New York Federal Prosecutor’s Office (SDNY) charged its founder with investor fraud in the spring of 2025.

San Francisco startup “Game On” put forward an AI sports chatbot, but was indicted on false financial data, fake audit reports, and allegations of inflated sales. What these events have in common is that they promoted “AI-washing,” that is, processes that are largely performed by humans or low automation maturity, as if they were “completely automatic.”

‘AI done by humans’ is not small in the field of large corporations. Amazon’s “Just Walk Out” was a concept that sensors and computer vision handled automatic payments, but reports continued that personnel identified and inspected transactions in actual operations. Amazon denied the controversy over the exaggeration, but adjusted its store strategy to focus on smart carts.

Presto Automation, which introduced a fast-food drive-through automatic response solution, was also found to have processed a significant percentage of orders at a certain time. Legal technology start-up advocated automating personal injury case documents, but when internal testimony was reported that many of the actual tasks depend on human inspection, the company emphasized that “the combination of AI and humans is essential for high quality.”

“The fall of Builder AI clearly shows what to believe and what to doubt in the current AI boom,” the New York Times said. “As it is said that AI is sold, but automation is not, the gap between the actual level of technology and market expectations is still large.”



Source link

Continue Reading

AI Insights

Can AI bring more good than harm to the future of our jobs? Here’s what the data says

Published

on


No generation is spared from the cultural upheaval of new technology. In the 2020s, it is AI fuelling that disruption.

Love it or fear it, artificial intelligence offers endless possibilities some people would have now felt on a personal level.

While many are fearing for their jobs, however, others are seeing a widening of possibilities.

Max Hamilton is worried about the impact AI would have on creatives, including copyright issues. (Supplied: Max Hamilton)

Forced to seek change

Max Hamilton, a graphic designer with over two decades of experience, has already adapted her career to meet the threat of generative AI head on.

The increasingly scarce availability of jobs pushed her to venture into illustration work for children’s books.

“I saw that happening a few years ago and that’s when I pivoted,” Ms Hamilton said.

I’ve been really focusing on using watercolour and hand drawing, which I did on purpose because I thought that might set me apart from having the computer-generated look.

To stay ahead of the fast-changing landscape, she has also expanded her skill set to include writing, meaning she can be involved in every aspect of producing a book.

“As a creative, I think we like to think that our creativity is our special weapon,” Ms Hamilton said.

How will AI affect jobs?

Data shows that creatives like Ms Hamilton are right to expect increasing AI influence in their sectors.

A recent Jobs and Skills Australia (JSA) report confirms artificial intelligence will bring about an impending change to the labour market, either through automation or augmentation.

The body assessed various tasks within ANZSCO-listed occupations and ranked them based on the degree to which AI could impact them.

Here are the sectors JSA predicts are most likely to be automated by artificial intelligence, with existing workflows replaced.

Here are the sectors most likely to be augmented by artificial intelligence, improving the output of existing workers.

Evan Shellshear, an innovation and technology expert from The University of Queensland, explains what this means for the availability of jobs in the market.

“It’s not jobs that are at risk of AI, it’s actual tasks and skills,” Dr Shellshear said.

We’re seeing certain skills and parts of jobs disappearing, but not necessarily whole occupations disappearing.

The report further supports this, saying Generative AI technologies are more likely to help boost workers’ productivity, as opposed to replacing them, especially in high-skilled occupations.

In fact, Dr Shellshear believes there’s a likelihood AI will create job opportunities.

“It’s making a lot of things that were impossible, possible,” he says, especially for small businesses.

“Gen AI can lower the cost for things, expertise and knowledge that were out of reach in the past.”

An opportunity to create the unthinkable

Growing up a big fan of science fiction, Melanie Fisher jumped at the chance to experiment with Generative AI shortly after ChatGPT was released.

Ms Fisher started off by testing the tool’s knowledge of food regulation, with which she was familiar from years of experience in the industry.

“It came up with some untrue stuff so early on I learnt you have to be careful,” said the 67-year old who is based in Canberra.

Woman with short hair and glasses holding open a laptop with codes on screens

Melanie Fisher built an app for her 3 year-old grandchild and now it’s a bonding activity for the two (Supplied: Melanie Fisher)

But Ms Fisher continued pushing the bounds of what AI could offer — using it to find new recipes and suggestions for things to do — before landing on the idea to create a game app for her 3-year old granddaughter Lilly*.

When she heard AI could code, she thought to herself, “Oh I’d love to try that, but … I’m not an IT person or anything.”

So, she threw the question to ChatGPT.

A screenshot of Melanie's prompt to ChatGPT asking for help to build an app with no prior experience.

Melanie spelled out her request on the generative AI tool and made it clear she had no relevant background. (Supplied: Melanie Fisher)

The tool recommended a program that allowed her to drag and drop different elements to produce a coherent story mode gameplay.

Ms Fisher didn’t have to look far for inspiration.

“[The game was] based on stories Lilly* and I made up about her being a girl pirate with her friends, and they have adventures together,” Ms Fisher said.

It took three weeks of work to bring her vision to reality, even getting the characters to loosely resemble Lilly*.

Now the game has become a special pet project for the duo.

A collage of two screenshots of the game play

Melanie continues to build on the gameplay with input from her granddaughter.   (Supplied: Melanie Fisher)

Drawing from her own experience, Ms Fisher sees AI as a double-edged sword.

“I think it’s a great leap forward for people, but I do very much worry it’s going to massively displace lots of people from work,” she said.

Transition still in early days

Many professionals such as recruiters, university staff and health practitioners have incorporated AI into their workflows.

More recently, Commonwealth Bank Australia made headlines by slashing jobs due to artificial intelligence, only to later put out an apology while backtracking its decision.

But news stories about corporate lay-offs and downsizing don’t necessarily point to an AI takeover, according to Professor Nicholas Davis. He is a former World Economic Forum executive, who is now an artificial intelligence expert at the University of Technology Sydney (UTS).

He believes these trends are being driven by “early adopters” and foresees “a disconnect between expectation versus reality”.

“We’re likely to see organisations lay off people in anticipation of gains and then rehiring because it doesn’t quite work the way they expect,” said Professor Davis.

“We’re at very early stages of using the latest forms of AI at the enterprise level.

“Most organisations have yet to see a measurable positive impact on the bottom line.”

An example he provided is how the introduction of self check-out machines at supermarkets resulted in higher levels of staff stress, customer frustration and costs from theft.

This has led a number of UK and US chains to reintroduce manned tills.

“The consumer experience is different to the organisational value and experience,” warns Professor Davis.

A side angle of a man speaking

Nicholas Davis believes humans are still needed alongside AI for it to perform sustainably and reliably. (ABC News: Ian Cutmore)

How can we better prepare for an AI-driven world?

Despite having success with the app, Ms Fisher says, “I’ve learnt a little bit but I don’t think I could become a game developer.”

Speaking to this, Dr Shellshear agrees there’s a distinction to be made between what is possible with AI and the value humans have to offer.

A profile image of Dr Evan Shellshear in a suit

Dr Evan Shellshear believes people should focus on harnessing the right skills for an AI-driven future. (Supplied: Dr Evan Shellshear)

While AI can help a person attain new skills, they still need education, training and real-world expertise to get to a professional level, he adds.

Having conducted his own research into AI’s impact on jobs with a keen interest in what remains relevant in the future, he found professions involving communication, management, collaboration and creativity, assisting and caring to be most difficult to replicate.

Other human traits such as problem-solving, resilience, attentiveness are also irreplaceable, says Professor Davis.

But he says that having a varied set of skills can put you at an advantage.

“The more you’re able to add value, the less it matters that things get taken away,” he said.

“But if your job is doing one specific thing or creating one style, then there’s where it gets problematic.”

“Embracing, engaging and reinventing is how you benefit.”

Here’s also Dr Shellshear’s advice on staying ahead of the game.

Recognise its impact on your life as an individual, especially from a job perspective and ask yourself: ‘How do I position myself to continue to add value with these tools around me?'”At some point, you have to learn how to integrate [AI] into your workflows otherwise you [risk no longer being] efficient or relevant.

*Name changed for privacy



Source link

Continue Reading

AI Insights

Toward faithful and human-aligned self-explanation of deep models

Published

on


Formulation of logic rule explanations

For given input x from the dataset X, our explanation α = (αx, αw) comprises two components: an antecedent αx and a linear weight αw. The antecedent αx and a consequent y together form a logic rule αx y, as illustrated in Fig. 1a. The linear weight αw indicates the contribution of the atoms in the antecedent αx within the logic rule. Meanings of symbols used in this paper are defined in Section C.1 of the Supplementary Information.

An antecedent αx represents the condition under which a rule applies and corresponds to an explanation expressed in logical form. It is defined as a sequence αx = (o1,…, oL), where each oi is an atom, and L is the length of the sequence. An atom is the smallest unit of explanation and corresponds to a single interpretable feature of a given input—for instance, a condition such as “awesome ≥2”. These interpretable features may differ from the features used by deep learning models. They can have different granularities (e.g., words or phrases vs. tokens), be based on statistical properties (e.g., word frequency), or be derived using external tools (e.g., grammatical tagging of a word). Mathematically, each atom oi is a Boolean-valued function that returns true if the ith interpretable feature is present in the input x, and false otherwise. Additional details about the atom selection process are provided in Section C.3 of the Supplementary Information. An input sample x is said to satisfy an antecedent αx if the logical condition αx(x) evaluates to true.

The consequent y denotes the model’s predicted output, given that the antecedent is satisfied. In a classification task, y typically corresponds to the predicted class label; in regression, y would be a real-valued number.

Finally, the linear weight αw models the contribution of each atom in the logical relationship αx y. It is represented as a matrix \({{\boldsymbol{\alpha }}}_{w}=\left(\begin{array}{ccc}{w}_{11}&\ldots &{w}_{1L}\\ \ldots &\ldots &\ldots \\ {w}_{K1}&\ldots &{w}_{KL}\end{array}\right)\), where K is the number of possible classes and wki indicates the contribution of atom oi to the prediction of class k. The magnitude of each weight reflects the strength of its corresponding atom’s contribution to the output prediction.

Framework for deep logic rule reasoning

Let us denote f as a deep learning model that estimates probability p(yx), where x is the input data sample and y is a candidate class. We upgrade model f to a self-explaining version by adding a logic rule explanation α. Then, we can reformulate p(yx) as

$$p({y}| {\bf{x}},b)=\sum _{{\boldsymbol{\alpha }}}p({y}| {\boldsymbol{\alpha }},{\bf{x}},b)p({\boldsymbol{\alpha }}| {\bf{x}},b)=\sum _{{\boldsymbol{\alpha }}}p({y}| {\boldsymbol{\alpha }})p({\boldsymbol{\alpha }}| {\bf{x}},b),\quad s.t.,\quad \Omega ({\boldsymbol{\alpha }})\le S$$

(3)

Here, b represents a human’s prior belief about the rules, e.g., the desirable form of atoms, Ω(α) is the required number of logic rules to explain given input x, and S is the number of samples (logic rules chosen by the model). Eq. (3) includes two constraints essential for ensuring explainability. The first constraint p(yα, x, b) = p(yα) requires that explanation α contains all information in the input x and b that is useful to predict y. Without the constraint, the model may “cheat” by predicting y directly from the input instead of using the explanation, which leads to a decrease of faithfulness. The second constraint Ω(α) ≤ S requires that the model can be well explained by using only S explanations, where S is small enough to ensure readability (S = 1 in our implementation). We can further decompose Eq. (3) based on the independence between the input x and the human prior belief b. (proof and assumptions in Section C.2 of the Supplementary Information):

$$p({y}| {\bf{x}},b)=\sum _{{\boldsymbol{\alpha }}}p({y}| {\boldsymbol{\alpha }})p({\boldsymbol{\alpha }}| {\bf{x}},b)\propto \sum _{{\boldsymbol{\alpha }}}p(b| {\boldsymbol{\alpha }})\cdot p({y}| {\boldsymbol{\alpha }})\cdot p({\boldsymbol{\alpha }}| {\bf{x}}),\,\,s.t.,\,\,\Omega ({\boldsymbol{\alpha }})\le S$$

(4)

Then, we further decompose Eq. (4) using an antecedent αx and its linear weight αw:

$$\begin{array}{rcl} p(y|{\mathbf{x}},b)&\propto & \sum\limits_{{\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w} p(b | {\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w)\cdot p(y | {\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w)\cdot p({\boldsymbol{\alpha}}_x, {\boldsymbol{\alpha}}_w | {\mathbf{x}}) \\ &=& \sum\limits_{{\boldsymbol{\alpha}}_x} p(b | {\boldsymbol{\alpha}}_x) \left(\sum\limits_{{\boldsymbol{\alpha}}_w} p(y | {\boldsymbol{\alpha}}_w) \cdot p({\boldsymbol{\alpha}}_w | {\boldsymbol{\alpha}}_x) \right) \cdot\ {p({\boldsymbol{\alpha}}_x | {\mathbf{x}})}, \\ &=& \sum\limits_{{\boldsymbol{\alpha}}_x} \underbrace{p(b | {\boldsymbol{\alpha}}_x)}_{\begin{array}{c}{\rm{Human}}\\ {\rm{prior}}\end{array}} \cdot \underbrace{p(y | {\boldsymbol{\alpha}}_x)}_{\begin{array}{c}{\rm{Consequent}}\\ {\rm{estimation}}\end{array}} \ \cdot\ \ {\underbrace{p({\boldsymbol{\alpha}}_x | {\mathbf{x}})}_{\begin{array}{c}{\rm{Deep}}\,{\rm{antecedent}}\\ {\rm{generation}}\end{array}}}, \quad s.t., \quad {{\Omega}}({\boldsymbol{\alpha}}_x) \leq S \end{array}$$

(5)

In Eq. (5), two additional constraints are introduced to ensure that the weight αw functions as a faithful explanation. The first constraint, p(yαw) = p(yαx, αw), is designed to prevent the model from bypassing the explanatory weight αw and relying instead on latent representations of the antecedent αx for predicting the consequent y. The second constraint, p(αwαx) = p(αwαx, x), ensures that the estimation of αw is based solely on the selected antecedent αx and not directly on the raw input x. This guards against information leakage that could undermine the interpretability of the explanation.

We assume p(bαx) = p(bαx, αw), as b represents a human’s prior belief about the rules encoded in αx, which should not depend on how the model weights them internally. We can observe that the only difference between Eq. (5) and Eq. (4) lies in the use of an antecedent αx instead of full explanation α. This implies that the introduction of the weight αw only affects the internal estimation process of the consequent, and without explicit guidance, this process may diverge significantly from human expectations.

The three derived terms correspond to three main modules of the proposed framework, SELOR. The first component, human prior p(bαx), encodes human guidance on preferred rule forms, aiming to reduce the likelihood of misunderstanding, as discussed in Section “Human Prior p(b|αx)”. The second, consequent estimation p(yαx), models the relationship between the explanation αx and the predicted output y through the use of the weight αw. This weight is carefully estimated to ensure a meaningful and consistent relationship, so that each explanation naturally leads to the prediction according to human perception, as described in Section “Consequent Estimation p(y|αx)”. Lastly, deep antecedent generation p(αx) leverages the deep representation of input x learned by the given deep model f to infer an appropriate explanation α, as elaborated in Section “Deep Antecedent Generation p(α|x)”.

The sparsity constraint Ω(αx)≤S for the explanations can be enforced by sampling from p(αxx). In particular, we rewrite Eq. (5) as an expectation and estimate it through sampling:

$$\begin{array}{lll}p({y}| {\bf{x}},b)\;\propto \;\sum\limits_{{{\boldsymbol{\alpha }}}_{x}}p(b| {{\boldsymbol{\alpha }}}_{x})\cdot p({y}| {{\boldsymbol{\alpha }}}_{x})\cdot p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})\\\qquad\qquad =\mathop{{\mathbb{E}}}\limits_{{{\boldsymbol{\alpha }}}_{x} \sim p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})}p(b| {{\boldsymbol{\alpha }}}_{x})\cdot p({y}| {{\boldsymbol{\alpha }}}_{x})\approx \frac{1}{S}\sum\limits_{\begin{array}{c}s\in [1,S]\\ {{\boldsymbol{\alpha }}}_{x}^{(s)} \sim p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})\end{array}}p(b| {{\boldsymbol{\alpha }}}_{x}^{(s)})\,p({y}| {{\boldsymbol{\alpha }}}_{x}^{(s)})\end{array}$$

(6)

where \({{\boldsymbol{\alpha }}}_{x}^{(s)}\) is the sth sample of αx. For example, to maximize the approximation term with S = 1, the antecedent generator p(αxx) must find a single sample \({{\boldsymbol{\alpha }}}_{x}^{(s)}\) that yields the largest \(p(b| {{\boldsymbol{\alpha }}}_{x}^{(s)})p({y}| {{\boldsymbol{\alpha }}}_{x}^{(s)})\), and it needs to assign a high probability to the best \({{\boldsymbol{\alpha }}}_{x}^{(s)}\). Otherwise, other samples with a lower \(p(b| {{\boldsymbol{\alpha }}}_{x}^{(s)})p({y}| {{\boldsymbol{\alpha }}}_{x}^{(s)})\) may be generated, thereby decreasing p(yx, b). This ensures the sparsity of p(αxx), which improves the model interpretability.

Human prior p(bα
x)

Human prior p(bαx) = ph(bαx)ps(bαx) consists of hard priors ph(bαx) and soft ones ps(bαx).

Hard priors categorize the feasible solution space for the rules: ph(bαx) = 0 if αx is not a feasible solution. Humans can easily define hard priors of αx by choosing the atom types, such as whether the interpretable features are words, phrases, or statistics like word frequency, and the antecedent’s maximum length L. SELIN does not require a predefined rule set. Nonetheless, we allow users to enter one if it is more desirable in some application scenarios. A large solution space increases the time cost for deep logic rule reasoning (Section “Optimization and Time Complexity”) but also decreases the probability of introducing undesirable bias.

Soft priors model different levels of human preference for logic rules. For example, people may prefer shorter rules or high-coverage rules that satisfy many input samples. The energy function can parameterize such soft priors: \({p}_{s}(b| {{\boldsymbol{\alpha }}}_{x})\propto \exp (-{{\mathcal{L}}}_{b}({{\boldsymbol{\alpha }}}_{x}))\), where \({{\mathcal{L}}}_{b}\) is the loss function for punishing undesirable logic rules. We do not include any soft priors in our current implementation.

For example, suppose we are inducing logic rules to explain a sentiment classifier’s decision on restaurant reviews. The interpretable features αx may include binary indicators for the presence of words like “awesome”, “tasty”, or “not” in the input text. A hard prior ph(bαx) may rule out any rule that includes more than L = 2 words in the antecedent (e.g., a rule using “awesome” and “not” is allowed, but not one using “awesome”, “not”, and “tasty” together), if the user has defined a maximum antecedent length of 2 as part of their hard prior. A soft prior ps(bαx) can reflect a user’s preferences over logic rules. For instance, if a user prefers commonly used words, a rule like “awesome ≥ 1” may be favored over “pulchritudinous ≥1”, even though both convey a positive meaning.

Consequent estimation p(yα
x)

Consequent estimation models p(yαx), the relationship between the antecedent αx and the prediction y using the weight αw. The weight αw is computed to ensure a meaningful and consistent relationship, so that each explanation naturally leads to the prediction according to human perception. This is achieved by testing the logic rule αx y across the entire training dataset, ensuring that it represents the human knowledge embedded in the data distribution.

A straightforward way to compute p(yαx) is an empirical estimation: first, collect all samples that satisfy antecedent αx, and then calculate the percentage of them that have label y30. For example, given explanation αx = “awesome ≥2”, if we obtain all instances in which awesome appears more than twice and find that 90% of them have label y = positive sentiment, then p(yαx) = 0.9. Large p(yαx) corresponds to global patterns that naturally align with human perception. Mathematically, this is equivalent to approximating p(yαx) with the empirical probability \(\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x})\):

$$\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x})={n}_{{{\boldsymbol{\alpha }}}_{x},y}/{n}_{{{\boldsymbol{\alpha }}}_{x}}$$

(7)

where \({n}_{{{\boldsymbol{\alpha }}}_{x},y}\) is the number of training samples that satisfy the antecedent αx and has the consequent y, and \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) is the number of training samples that satisfy the antecedent αx. Directly setting p(yαx) to \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) can cause three problems. First, when nα is not large enough, the empirical probability \(\hat{p}(y| {\boldsymbol{\alpha }})\) may be inaccurate, and the modeling of such uncertainty is inherently missing in this formulation. Second, statistically modeling the probability of y based solely on αx, without details about the contributions of each atom in αx, may lead users to feel there is still an unaddressed part in the explanation. Third, computing \(\hat{p}(y| {\boldsymbol{\alpha }})\) for every antecedent α is intractable, since the number of feasible antecedents A increases exponentially with antecedent length L.

To address the aforementioned problems, we employ a neural estimation of categorical distribution, which jointly model \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) and the uncertainty caused by low-coverage antecedents with the categorical distribution. For example, suppose antecedent αx is “tasty ≥2”. If this antecedent is only satisfied by 3 training samples, among which 2 have the label y = negative sentiment, then the empirical estimate \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})=2/3\). However, since \({n}_{{{\boldsymbol{\alpha }}}_{x}}=3\) is small, the model considers this estimate uncertain, and the resulting p(yαx) is smoothed toward a more uniform distribution according to the learned β. In contrast, for a high-coverage antecedent like “great ≥1”, if 900 out of 1000 samples have positive sentiment, the empirical estimate 0.9 will be trusted more, and p(yαx) will stay close to 0.9. See results in Section “Explainability Evaluation on Data Consistency” for the approximation capability of our model.

Assume that, given antecedent αx, the class y follows a categorical distribution, where each category corresponds to a class. We define β as the concentration hyperparameter of this categorical distribution, which controls how uniformly the probability is distributed across the classes – higher values of β lead to more uniform distributions, while lower values concentrate the probability on fewer classes. Then, according to the posterior predictive distribution, y takes one of K potential classes, and we may compute the probability of a new observation y given existing observations:

$$p(y| {{\boldsymbol{\alpha }}}_{x})=p({y}| {{\mathcal{Y}}}_{{{\boldsymbol{\alpha }}}_{x}},\beta )\approx \frac{\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x}){n}_{{{\boldsymbol{\alpha }}}_{x}}+\beta }{{n}_{{{\boldsymbol{\alpha }}}_{x}}+K\beta }$$

(8)

Here, \({{\mathcal{Y}}}_{{{\boldsymbol{\alpha }}}_{x}}\) denotes \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) observations of class label y obtained by checking the training data, and β is automatically trained. Eq. (8) becomes Eq. (7) when \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) increases to , and becomes a uniform distribution when \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) goes to 0. Thus, a low-coverage antecedent with a small \({n}_{{{\boldsymbol{\alpha }}}_{x}}\) is considered uncertain (i.e., close to uniform distribution). By optimizing Eq. (8), our method automatically balances the empirical probability \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) and the number of observations \({n}_{{{\boldsymbol{\alpha }}}_{x}}\). Probability p(yαx) also serves as the confidence score for the logic rule αx y.

We then employ atom weight αw to model \(\hat{p}(y| {{\boldsymbol{\alpha }}}_{x})\) based on the contribution of each atom in αx. We adopt deep neural network as a consequent estimator that predicts αw, which improves generalization to unseen cases and enhances noise handling. The details about the deep neural network is in Section C.3 of the Supplementary Information. Given the chosen antecedent αx for the given input x, we define an arbitrary data sample in the dataset as xj X. The candidates set of atoms for xj is denoted by \({\mathcal{C}}({{\bf{x}}}^{j})\). Each atom candidate in \({\mathcal{C}}({{\bf{x}}}^{j})\) should satisfy both global and local constraints. The hard priors discussed in Section “Human Prior p(b|αx)” provide the global constraint, ensuring that the atom conforms to a human-defined logical form. The local constraint requires that xj satisfies the atom. An atom “awesome > 1”, for example, is sampled only if xj mentions “awesome” more than once. Next, uj is the vector that indicates whether each atom oi in αx is also included in the candidate set \({\mathcal{C}}({{\bf{x}}}^{j})\), i.e., \({u}_{i}^{j}={\mathbb{I}}({o}_{i}\in {\mathcal{C}}({{\bf{x}}}^{j}))\) where \({u}_{i}^{j}\) is i-th element of uj and \({\mathbb{I}}\) represents an indicator function. Additionally, we define the region \({{\mathcal{R}}}_{i}\) for the atom oi as the set of train data samples that satisfies atom oi, and we define \({\mathcal{R}}={{\mathcal{R}}}_{1}\cup \ldots \cup {{\mathcal{R}}}_{L}\) as the entire region of the antecedent αx. The deep model then predicts \({{\boldsymbol{\alpha }}}_{w}=\left(\begin{array}{ccc}{w}_{11}&\ldots &{w}_{1L}\\ \ldots &\ldots &\ldots \\ {w}_{K1}&\ldots &{w}_{K\,L}\end{array}\right)\) from αx, and minimizes the following loss objective.

$${{\mathcal{L}}}_{w}=\sum _{({{\bf{x}}}_{j},{y}_{j})\in {\mathcal{R}}}CrossEntropyLoss({{\boldsymbol{\alpha }}}_{w}^{T}{{\bf{u}}}^{j},{y}_{j})$$

(9)

This regression tests combinations of atoms in αx on train dataset, allowing the deep model to learn αw as the relationship between the prediction y and each atom, based on the human knowledge reflected in the labels. Since x naturally satisfies all atoms in αx, we calculate the sum of αw across the atoms to derive a logit for each class. We then apply a softmax function across classes to obtain \(\tilde{p}(y| {{\boldsymbol{\alpha }}}_{x})\), where \(\tilde{p}(y| {{\boldsymbol{\alpha }}}_{x})\) represents the predicted empirical probability.

$$\tilde{p}({y}| {{\boldsymbol{\alpha }}}_{x})=\frac{\exp \left({\sum }_{i\in [1,L]}{w}_{yi}\right)}{{\sum }_{k\in [1,K]}\exp \left({\sum }_{i\in [1,L]}{w}_{ki}\right)}$$

(10)

Subsequently, we use the multi-task learning framework in ref. 31 to train the neural network, ensuring that its predicted probability \(\hat{p}({y}| {{\boldsymbol{\alpha }}}_{x})\) aligns with the empirical probability. This alignment is achieved by minimizing the loss specified in the following equation.

$${{\mathcal{L}}}_{r}=\frac{1}{2{\sigma }_{p}^{2}}| | \hat{p}({y}| {\boldsymbol{\alpha }})-\tilde{p}({y}| {\boldsymbol{\alpha }})| {| }^{2}+\frac{1}{2{\sigma }_{n}^{2}}| | {n}_{{\boldsymbol{\alpha }}}-{\tilde{n}}_{{\boldsymbol{\alpha }}}| {| }^{2}+\log {\sigma }_{p}{\sigma }_{n},$$

(11)

where \({\tilde{n}}_{{\boldsymbol{\alpha }}}\) is the predicted coverage given by the neural model, and σp and σn are standard deviations of ground truth probability and coverage, respectively. Finally, we adjust two loss objective \({{\mathcal{L}}}_{r}\) and \({{\mathcal{L}}}_{w}\) with a hyperparameter λ for the training of neural consequent estimator.

$${{\mathcal{L}}}_{c}={{\mathcal{L}}}_{r}+\lambda {{\mathcal{L}}}_{w}$$

(12)

Deep antecedent generation p(αx)

Deep antecedent generation finds an antecedent αx for explanation by reshaping the given deep model f. Specifically, we replace the prediction layer in f of the backbone model with an explanation generator, so that the latent representation z of input x is mapped to an explanation, instead of directly mapping to a prediction (e.g., class label). We outline the generation process with a formal definition. First, we precompute the embedding of each atom by averaging the embeddings of all training instances that satisfy the atom. During both training and inference, the antecedent generator sequentially selects atoms to form an explanation. At each selection step, the input embedding z is combined with the embeddings of the previously selected atoms to form a latent representation h via an encoder. We compute a probability distribution over candidate atoms based on their similarity to h, and an atom is sampled from this distribution using the Gumbel-softmax trick, excluding already selected atoms. This process repeats until a predefined number of atoms is selected, forming the final antecedent.

Formally, given z, which is the representation of input x in the last hidden layer of f, we generate explanation α = (o1…, oL) with a recursive formulation. Note that this process has a complexity that is linear with L (Section “Optimization and Time Complexity”). Formally, given z and o1, …oi−1, we obtain atom oi by

$${{\bf{h}}}_{i}=Encoder([{\bf{z}};{{\bf{o}}}_{1}\ldots ;{{\bf{o}}}_{i-1}]),\quad p({o}_{i}| {\bf{x}},{o}_{1}\ldots ,{o}_{i-1})=\frac{{\mathbb{I}}({o}_{i}\in {\mathcal{C}}({\bf{x}}))\exp ({{\bf{h}}}_{i}^{T}{{\bf{o}}}_{i})}{{\sum }_{\tilde{o}}{\mathbb{I}}(\tilde{o}\in {\mathcal{C}}({\bf{x}}))\exp ({{\bf{h}}}_{i}^{T}\tilde{{\bf{o}}})}$$

(13)

where oi is the embedding of atom oi and Encoder is a neural sequence encoder such as GRU32 or Transformer33. \({\mathbb{I}}\) is the indicator function, and \({\mathcal{C}}({\bf{x}})\) is the set of atom candidates for x. Note that we set the probability of the atoms that do not satisfy global or local constraints to zero. This ensures that only the atoms satisfying the specified conditions will be chosen in the following sampling process. We then sample oi from p(oix, o1…, oi−1) in a differentiable way to ensure end-to-end training:

$${o}_{i}=Gumbel(p(\tilde{o}\in {\mathcal{C}}({\bf{x}})\subset {\mathcal{O}}\,| \,{\bf{x}},{o}_{1}\ldots ,{o}_{i-1})),\,\,\,p({{\boldsymbol{\alpha }}}_{x}| {\bf{x}})=\prod _{i\in [1,L]}p({o}_{i}| {\bf{x}},{o}_{1}\ldots ,{o}_{i-1})$$

(14)

Gumbel is Straight-Through Gumbel-Softmax34, a differentiable function for sampling discrete values. An atom oi is represented as a one-hot vector with a dimension of \(| {\mathcal{O}}|\), where \({\mathcal{O}}\) is the set of all atoms that satisfies hard priors. Then oi is multiplied with the embedding matrix of atoms to derive the embedding oi.

Optimization and time complexity

A deep logic rule reasoning model is learned in two steps. The first step optimizes the neural consequent estimator to learn p(yαx) by minimizing loss \({{\mathcal{L}}}_{c}\) in Eq. (12). The second step converts the deep model f to an explainable version by maximizing p(yx, b) in Eq. (6) with a cross-entropy loss. This is equivalent to minimizing loss \({{\mathcal{L}}}_{d}=-{{\mathcal{L}}}_{b}({{\boldsymbol{\alpha }}}_{x}^{(s)})-\log p({y}^{* }| {{\boldsymbol{\alpha }}}_{x}^{(s)})\). Here, \(-{{\mathcal{L}}}_{b}({{\boldsymbol{\alpha }}}_{x}^{(s)})\) punishes explanations that do not fit human’s prior preference for rules, while \(\log p({y}^{* }| {{\boldsymbol{\alpha }}}_{x}^{(s)})\) finds an antecedent \({{\boldsymbol{\alpha }}}_{x}^{(s)}\) that leads to the ground-truth class y* with high confidence. We repeat the first step and the second step for every iteration of batch. For stable optimization, the parameters of the antecedent generator are frozen during the first step, and the parameters of the consequent estimator are frozen during the second step.

We analyze the per-sample time complexity of modules in our method. The complexity is \(O(L\cdot | {\mathcal{O}}| )\) for antecedent generation, \(O({L}^{2}+L| {\mathcal{R}}| )\) for neural consequent estimator. Therefore, the total time complexity is \(O(L| {\mathcal{O}}| +L| {\mathcal{R}}| )\) since \(L < < | {\mathcal{O}}|\) and \(L < < | {\mathcal{R}}|\).



Source link

Continue Reading

Trending