AI Insights

3 Genius Artificial Intelligence (AI) Stocks Billionaires Are Buying That You Should Too

Published

1 week ago

August 24, 2025

The Editors

Key Points

Nvidia is a popular holding among fund managers.
Bill Ackman took a massive stake in Amazon during Q2.
Taiwan Semiconductor is an excellent way to play the artificial intelligence (AI) arms race thanks to its neutrality.

Looking at what billionaire hedge fund managers are doing is a great idea for investors, as it gives you a chance to see what top minds in the investing world think about trends like artificial intelligence (AI). If these massive funds start to sell out of their positions, it could be a warning sign for investors that the party is over. But if they’re increasing their stakes, it could be a bullish indicator.

Three prominent hedge fund managers recently increased their stakes in some of the top names in AI. All three purchases indicate that the AI arms race is gaining momentum, and there is still significant room for these top AI picks to make investors substantial profits.

Where to invest $1,000 right now? Our analyst team just revealed what they believe are the 10 best stocks to buy right now. Learn More »

Image source: Getty Images.

1. Philippe Laffont: Nvidia

Philippe Laffont runs Coatue Management, and recently made purchases of one of the most prominent stocks in the AI realm: Nvidia (NASDAQ: NVDA). Investors gained access to this information after its Form 13F was made available to the public 45 days after Q2 ended. This is an SEC requirement for any fund that has more than $100 million under management so investors can track billionaire hedge fund managers’ moves over time.

In Q2, Coatue Management increased its Nvidia stake by 34%. This is notable, as it sold off Nvidia stock over the past three quarters. Coatue sees momentum in Nvidia’s stock, and it makes sense when you look at the business.

Nvidia is about to benefit from some major tailwinds, including gaining its China export business back once the U.S. government approves its export license. Additionally, the AI hyperscalers are all announcing record data center capital expenditures for next year, which bodes well for increased GPU demand.

Although Nvidia has been one of the top-performing stocks over the past few years, there is still plenty of room for it to run, with massive AI computing demand still being fulfilled.

2: Bill Ackman: Amazon

Bill Ackman, who runs Pershing Square Capital Management, unveiled a massive, $1.28 billion stake in Amazon (NASDAQ: AMZN) during Q2. This makes up about 9.3% of its portfolio, so this is no small bet.

It’s also a smart one, as Amazon has extensive exposure to the AI space through its cloud computing platform, Amazon Web Services (AWS). AWS allows clients to rent computing power from Amazon’s servers to run AI workloads on. Computing clusters are expensive to build for fledgling AI companies, so renting makes a ton of sense here.

Additionally, AWS is a huge part of Amazon’s profit picture, making up 53% of total operating profits in Q2.

Amazon is a smart AI stock pick, and with Ackman bullish on it, it’s a great sign for investors.

3. Stanley Druckenmiller: Taiwan Semiconductor

Last is Taiwan Semiconductor (NYSE: TSM), which Duquesne Family Office’s Stanley Druckenmiller purchased. It increased its stake by 28% in Q2, making Taiwan Semi the fifth-largest position in its portfolio. However, Taiwan Semiconductor is the firm’s largest exposure to AI, making it a huge bet by the firm.

Taiwan Semiconductor is a chip manufacturer that produces chips for some of the biggest names in the tech industry, like Apple and Nvidia. It has huge momentum and is putting up excellent growth. In Q2, TSMC’s revenue increased by 44% in U.S. dollars, and that growth appears to be sticking around.

As demand for AI computing power increases, so will chip demand. Because TSMC is a critical supplier to nearly every company in this space, it appears to be a top stock pick to capitalize on the AI build-out.

Should you invest $1,000 in Nvidia right now?

Before you buy stock in Nvidia, consider this:

The Motley Fool Stock Advisor analyst team just identified what they believe are the 10 best stocks for investors to buy now… and Nvidia wasn’t one of them. The 10 stocks that made the cut could produce monster returns in the coming years.

Consider when Netflix made this list on December 17, 2004… if you invested $1,000 at the time of our recommendation, you’d have $649,657!* Or when Nvidia made this list on April 15, 2005… if you invested $1,000 at the time of our recommendation, you’d have $1,090,993!*

Now, it’s worth noting Stock Advisor’s total average return is 1,057% — a market-crushing outperformance compared to 185% for the S&P 500. Don’t miss out on the latest top 10 list, available when you join Stock Advisor.

See the 10 stocks »

*Stock Advisor returns as of August 18, 2025

Keithen Drury has positions in Amazon, Nvidia, and Taiwan Semiconductor Manufacturing. The Motley Fool has positions in and recommends Amazon, Apple, Nvidia, and Taiwan Semiconductor Manufacturing. The Motley Fool has a disclosure policy.

Source link

Up Next

Artificial Intelligence in the Management of Polypharmacy Among Older Adults: A Scoping Review

Don't Miss

YouTube secretly used AI to edit people’s videos. The results could bend reality

The Editors

Click to comment

AI Insights

LifeGPT: topology-agnostic generative pretrained transformer model for cellular automata

Published

23 minutes ago

September 1, 2025

The Editors

Codes, data, and additional animations/figures are available at https://github.com/lamm-mit/LifeGPT.

Model architecture and hardware information

LifeGPT was constructed in Python using the “x-transformers” library⁶⁵. The models in this study were trained with a workstation equipped with a high-end CUDA-compatible GPU (RTX A4000, NVidia, Santa Clara, CA, USA) for a total of 50 epochs on a 10,000-sample training set.

Hyperparameters

Hyperparameters were initially selected heuristically for optimal performance, as the GPU primarily used for training (RTX A4000, NVidia, Santa Clara, CA, USA) had 16 GB of VRAM. Unless otherwise stated, all instances of LifeGPT used the following set of hyperparameters during training, as described in Table 1. The batch size was initially set to 20 samples and was decreased to 5 samples for later versions of LifeGPT due to memory limitations encountered when using FCM (see ”Forgetful causal masking (FCM) implementation”).

Table 1 LifeGPT’s best-performing hyperparameters

Datasets

Data generation overview

To generate training sets, validation sets, and testing sets, the same basic strategy was used. First, IC game-states were generated stochastically as a 2D, 32 × 32 numpy arrays. Depending on the exact algorithm used, the generated IC game-states would collectively form either high-entropy or broad-entropy datasets. Next, a custom Life Python class was used to generate the corresponding NGS for every previously generated IC. Lastly, each IC and corresponding NGS were concatenated within a string. Every generated pair was subsequently stored within a dataframe from future retrieval.

Data topology

Transformer models are architected to process data as 1D arrays. Therefore, to teach LifeGPT the rules of a 2D CA algorithm, such as Life, the 2D data from each time slice of the game had to be flattened into a 1D array. In this way, LifeGPT functioned similar to a vision transformer, in which 2D data is flattened into a 1D array within which each entry is a tokenizable image patch²⁶. However, due to the low resolution of the 32 × 32 toroidal grid on which Life was simulated to generate our training, we were able to encode every pixel of each time-slice of the game in a 1D array (as opposed to grouping pixels into patches).

Instruction Tuning

In order to encode the time-progression of the game into the training set, the initial-state and next-state 1D arrays were placed within a prompt string, which was subsequently tokenized to form a vector. Specifically, both 1D arrays were converted to strings and placed within a larger string containing start and end tokens (@ and $, respectively), a task statement, and bracket delimitors (e.g., “@PredictNextState [NEXT_STATE]$”).

Tokenization

We employed a byte-level tokenizer that operates on UTF-8 encoded text. UTF-8 is a variable-width character encoding capable of representing every character in the Unicode standard, which allows the tokenizer to process a wide range of scripts, symbols, and special characters uniformly. By converting the text into its byte-level representation, our approach ensures consistent tokenization across different languages and handles out-of-vocabulary words and non-standard text, such as emojis or code, effectively. This method allows for robust and flexible processing of diverse textual data. Tokenization resulted in a vector suitable as input to the embedding layer of the transformer model.

Training set generation

High-entropy IC set generation

High entropy IC game-states were generated by effectively flipping a coin 1024 times to designate the states (0 or 1) on a 32 × 32 grid. When considering the configuration space of a binary 2D array M ∈ {0, 1}^32×32, the following formula may be used to describe its Shannon entropy⁶⁶ (informational entropy):

$$H(M)=-\sum _{x\in \{0,1\}}{p}_{x}{\log }_{2}{p}_{x}$$

(1)

(This is also known as the binary entropy function⁶⁷) where, p_x is the probability of finding the value x in the 32 × 32 array M. p_x is defined as:

$${p}_{x}=\frac{1}{3{2}^{2}}\mathop{\sum }\limits_{i=1}^{32}\mathop{\sum }\limits_{j=1}^{32}{\delta }_{{M}_{ij},x}$$

(2)

where, M_ij is an element of M in the ith row and jth column, and ${\delta }_{{M}_{ij},x}$ is the Kronecker delta function, which is equal to 1 if M_ij = x and 0 otherwise.

Thus, for a “50–50 coin toss” scenario (${p}_{0}={p}_{1}=\frac{1}{2}$), H(M) is at a maximum and is equal to 1 Sh. Moreover, since binary data necessitates the condition p₀ + p₁ = 1, only one probability value is needed to fully describe the entropy of a given array A. We therefore denote the ordering of a given IC by referring to a single order parameter, η, where η = p₁ is always true. When considering the order parameter of a set of ICs, it is important to note that, because IC generation is always a stochastic process, the exact η of any given IC in the set cannot be predicted with certainty. For this reason, we characterize IC sets with the symbol 〈η〉, denoting the expected order parameter.

To generate high-entropy ICs, a binary array was constructed by checking random.random() < 0.5 == True (using the “random” module in Python—see https://python.readthedocs.io/en/latest/library/random.html) to decide each element. If the statement returned true, then the element would be defined as 1, and otherwise, 0. This method resulted in a training set with a binomial, experimentally measured η distribution (Fig. 5A).

Broad-entropy IC set generation

To create a broad-entropy IC set, first, a vector was created representing a set of order parameters ranging from 0 to 1. The length of this vector was set to the desired number of samples in the dataset (10,000 for training, 1000 for validation). This set of order parameters may be thought of as containing different expected probabilities for finding a 1 in an IC.

Then, the same procedure as with the high-entropy IC set was followed, with two exceptions: (1) instead of random.random() < 0.5 == True determining the value of each element in each IC array, random.random() < η == True was the determining equality, and (2) each IC was generated using a unique η from the aforementioned vector (see “Training set generation”). This strategy ensured that the IC set represented a broad range of ordering, from all 0s, to 50–50 0 s and 1 s, to all 1s (Fig. 5B).

Next-game-state generation

NGSs were calculated from IC arrays by applying Life rules assuming a toroidal grid (see the update_grid() function here: game.py).

Reshaping data

To make the handling of training set data easier, the final stage of the training set generator involves reshaping the data into a list of sub-lists, in which each entry in the list contains a sub-list corresponding to a specific IC. Within each unique sub-list, two strings are stored, one corresponding to a flattened IC, and one corresponding to a flattened NGS (see the generate_sets() function here: game.py).

Validation set generation

Validation sets were generated using the same methods in “Training set generation,” as the random.random() function ensures sufficiently random IC generation, ensuring training and validation sets remained entirely independent. Combined with the incredibly large space of possible 32 × 32 binary arrays (2^32 × 32 ≈ 1.80 × 10³⁰⁸ unique possibilities), this made the likelihood of even a single sample being identical between a 10,000-sample training set and a 1000-sample validation set negligible (see “Learning abilities”). This, in turn, ensured that over the course of model training, training loss and validation loss remained independent of one another.

Testing set generation

A 10-sample testing set was constructed to validate the performance of models during and after training, in a manner other than by inspecting the validation and training losses. Five samples in the testing set were generated stochastically in the same manner as in “Training set generation,” and 5 samples were manually defined to match known periodic and complex patterns found in Life (Fig. 3). NGSs were recursively generated for a total of 10 states (including the IC) per sample, for all 10 samples in the testing set.

Dataset generation for differently sized grids

For datasets (training, validation, testing) for LifeGPT-MultiGrid (see “Learning life on differently sized grids”), the only difference in the procedure was to specify different grid sizes (W_G ∈ {2, 4, 8, 16}) during IC generation, and to introduce a padding character (“p”) which was append as many times as needed to ends of each sub-list for those which had grid sizes smaller than the largest specified grid size, such that all sub-lists were the same length.

Forgetful causal masking (FCM) implementation

FCM was implemented using the “x-transformers” library⁶⁵. FCM was built into this library as part of the AutoregressiveWrapper class by default. FCM was enabled by setting mask_prob to 0.15, which was empirically shown to be effective by Liu et al.⁶⁸.

FCM involves randomly masking a predetermined percentage of past-tokens during the learning process, in addition to standard causal attention masking. The authors⁶⁸ argue that this method prevents over-attending to more recent tokens in a given sequence, encouraging attention to tokens in the “distant past.” We implemented FCM into our model, which increased the rate at which model accuracy improved with each epoch. Furthermore, FCM enabled our model to achieve 100% accuracy on our testing set with a sampling temperature of 1.0 in less than 50 epochs, which was previously unattainable when training with a broad-entropy dataset.

Implementing FCM increased the GPU RAM requirements of our LifeGPT, necessitating a decrease in batch size from 20 to 5 samples.

Model development

Training was initially conducted with high-entropy data. Due to the (pseudo)random nature of our training set generation script (see “Training set generation”), and the high number of samples in the training set (10,000), there was some diversity of training data entropy despite use of a static order parameter of (η = 0.5) (Fig. 5A). Nevertheless, observed model accuracy issues for low-entropy ICs prompted the use of broad-entropy datasets (Fig. 5B), which resulted in for improved performance. Later, LifeGPT-MultiGrid (Learning life on differently sized grids) was developed using a modified dataset to show that the LifeGPT framework allowed for simultaneous learning of multiple grid sizes.

Accuracy benchmarking

The testing dataset consisted of 10 flattened 32 × 32 binary arrays, representing initial states in Life, and their resulting iterations in accordance with Life state-transition rules on a toroidal (periodic) grid, numbering one through ten. Depending on the type of model being trained (the number of desired time-step jump predictions), different columns in the testing dataset would be selected as the ground truth. Accuracy at each checkpoint (every 2 epochs, starting with epoch 2) was determined by inputting the task statement (e.g., @PredictNextState) into a tokenizer, and subsequently using the tokenized data as the prompt for the autoregressive model. Since all of LifeGPT’s training was conducted on data corresponding to a 32 × 32 grid, LifeGPT was programmed to output the exact number of tokens necessary to fully describe the NGS. After LifeGPT was finished generating the output data, this data was compared to the ground truth (the flattened NGS in accordance with Life’s rules), and an accuracy score was computed using the following function:

$$A=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\delta }_{{y}_{i}{\hat{y}}_{i}}$$

(3)

where A is the Accuracy of the model, N is the total number of cell predictions across the testing dataset (N = 32 × 32 × 10 = 10,240 cells for a dataset with ten pairs of 32 × 32 grid game examples), y_i is the ground truth value, ${\hat{y}}_{i}$ is the predicted value, and δ is the Kronecker delta function which equals 1 if ${y}_{i}={\hat{y}}_{i}$ and 0 otherwise. An accuracy score was computed once every 2 epochs, for each model sampling temperature in the set 0, 0.25, 0.5, 0.75, 1, starting with epoch 2.

Training set entropy effects experimental procedure

The goal of this experiment was to determine was effect, if any, that the ordering of the ICs making up the training data for LifeGPT would have on accuracy (A), when fed ICs generated with varying expected order parameters (〈η〉_IC). We used two versions of LifeGPT; one was trained on high-entropy training data, and the other on broad-entropy training data. Next, a broad-entropy testing set (comprised of 110 samples, each with a (〈η〉_IC) value ranging linearly from 0 to 1) was generated in the same manner as the broad-entropy training set. The stochasticity of the IC generation process ensured both broad entropy sets remained independent. Finally, both models were benchmarked on each sample in a manner similar to the method in “Accuracy benchmarking and sampling temperature effects,” the only difference being that A was calculated for each sample in the testing set, as opposed to an average of all samples. Finally, A versus (〈η〉_IC) was plotted for both models (see Fig. 4).

Autoregressive loop implementation

The autoregressive loop is simply an implementation of LifeGPT where the model is placed inside a loop, where a portion of its output, corresponding to the NGS, is converted into an input tensor and is fed back into LifeGPT, for a desired number of iterations. As such, the NGS outputs of the previous loop iteration serves as the IC in the next loop iteration. In this way, the autoregressive loop is able to “run” Life in a similar recursive manner as the original algorithm. We ran the autoregressive loop using two versions of LifeGPT trained on the broad-entropy training set: one which stopped training at epoch 16 (chosen due to this version being the earliest instance of A = 1.0) for sampling temperature = 1), and one that continued training until epoch 50, across sampling temperatures 0, 0.25 0.5, 0.75, and 1. We compared the NGSs outputted from our autoregressive loop method with the ground truth NGSs, generated with the Life algorithm, and created animations for all model-sampling temperature combinations, showing the progression of the ground truth Life system, the autoregressive loop-generated NGSs, and the discrepancy between the two.

We also ran the autoregressive loop (and the Life algorithm) for 249 iterations (resulting in 250 game states, including the ICs), using only the epoch 50, sampling temperature = 0 version of LifeGPT due to time and compute constraints, for all 10 samples in the testing set. For each game state, we compared LifeGPT’s predictions to the GT Life algorithm’s output using the metric “Error Rate,” defined as:

$${\rm{Error}}\,{\rm{Rate}}=1-\frac{1}{G}\mathop{\sum }\limits_{i=1}^{G}{\delta }_{{y}_{i}{\hat{y}}_{i}}$$

(4)

where ErrorRate is the fraction of incorrect cells the model, G is the total number of cells comprising each game state (N = 32 × 32 = 1024 cells), y_i is the ground truth value, ${\hat{y}}_{i}$ is the predicted value, and δ is the Kronecker delta function.

LifeGPT-multigrid experimental procedure

Accuracy characterization was performance in the same manner as described in “Accuracy benchmarking and sampling temperature effects,” aside from the use of a different testing dataset. A testing set of 100 samples (25 samples per W_G for W_G ∈ {2, 4, 8, 16}) was created (utilizing broad entropy IC generation). Inference was performed for each sample, and average accuracies were calculated for each 25-sample group in accordance with equation (3).

Use of generative AI

Some Python scripts used for data generation, model training, data processing, and figure generation were written with the assistance of GPT-3.5, GPT-4, and GPT-4o from OpenAI. All scripts generated/edited in this manner were carefully reviewed, validated, and manually corrected, in the case of errors, by an author prior to implementation in our work.

Source link

AI Insights

OpenAI Plans India Data Center in Major Stargate Expansion

Published

44 minutes ago

September 1, 2025

The Editors

OpenAI is seeking to build a massive new data center in India that could mark a major step forward in Asia for its Stargate-branded artificial intelligence infrastructure push.

Source link