AI Insights

LifeGPT: topology-agnostic generative pretrained transformer model for cellular automata

Published

3 days ago

September 1, 2025

The Editors

Codes, data, and additional animations/figures are available at https://github.com/lamm-mit/LifeGPT.

Model architecture and hardware information

LifeGPT was constructed in Python using the “x-transformers” library⁶⁵. The models in this study were trained with a workstation equipped with a high-end CUDA-compatible GPU (RTX A4000, NVidia, Santa Clara, CA, USA) for a total of 50 epochs on a 10,000-sample training set.

Hyperparameters

Hyperparameters were initially selected heuristically for optimal performance, as the GPU primarily used for training (RTX A4000, NVidia, Santa Clara, CA, USA) had 16 GB of VRAM. Unless otherwise stated, all instances of LifeGPT used the following set of hyperparameters during training, as described in Table 1. The batch size was initially set to 20 samples and was decreased to 5 samples for later versions of LifeGPT due to memory limitations encountered when using FCM (see ”Forgetful causal masking (FCM) implementation”).

Table 1 LifeGPT’s best-performing hyperparameters

Datasets

Data generation overview

To generate training sets, validation sets, and testing sets, the same basic strategy was used. First, IC game-states were generated stochastically as a 2D, 32 × 32 numpy arrays. Depending on the exact algorithm used, the generated IC game-states would collectively form either high-entropy or broad-entropy datasets. Next, a custom Life Python class was used to generate the corresponding NGS for every previously generated IC. Lastly, each IC and corresponding NGS were concatenated within a string. Every generated pair was subsequently stored within a dataframe from future retrieval.

Data topology

Transformer models are architected to process data as 1D arrays. Therefore, to teach LifeGPT the rules of a 2D CA algorithm, such as Life, the 2D data from each time slice of the game had to be flattened into a 1D array. In this way, LifeGPT functioned similar to a vision transformer, in which 2D data is flattened into a 1D array within which each entry is a tokenizable image patch²⁶. However, due to the low resolution of the 32 × 32 toroidal grid on which Life was simulated to generate our training, we were able to encode every pixel of each time-slice of the game in a 1D array (as opposed to grouping pixels into patches).

Instruction Tuning

In order to encode the time-progression of the game into the training set, the initial-state and next-state 1D arrays were placed within a prompt string, which was subsequently tokenized to form a vector. Specifically, both 1D arrays were converted to strings and placed within a larger string containing start and end tokens (@ and $, respectively), a task statement, and bracket delimitors (e.g., “@PredictNextState [NEXT_STATE]$”).

Tokenization

We employed a byte-level tokenizer that operates on UTF-8 encoded text. UTF-8 is a variable-width character encoding capable of representing every character in the Unicode standard, which allows the tokenizer to process a wide range of scripts, symbols, and special characters uniformly. By converting the text into its byte-level representation, our approach ensures consistent tokenization across different languages and handles out-of-vocabulary words and non-standard text, such as emojis or code, effectively. This method allows for robust and flexible processing of diverse textual data. Tokenization resulted in a vector suitable as input to the embedding layer of the transformer model.

Training set generation

High-entropy IC set generation

High entropy IC game-states were generated by effectively flipping a coin 1024 times to designate the states (0 or 1) on a 32 × 32 grid. When considering the configuration space of a binary 2D array M ∈ {0, 1}^32×32, the following formula may be used to describe its Shannon entropy⁶⁶ (informational entropy):

$$H(M)=-\sum _{x\in \{0,1\}}{p}_{x}{\log }_{2}{p}_{x}$$

(1)

(This is also known as the binary entropy function⁶⁷) where, p_x is the probability of finding the value x in the 32 × 32 array M. p_x is defined as:

$${p}_{x}=\frac{1}{3{2}^{2}}\mathop{\sum }\limits_{i=1}^{32}\mathop{\sum }\limits_{j=1}^{32}{\delta }_{{M}_{ij},x}$$

(2)

where, M_ij is an element of M in the ith row and jth column, and ${\delta }_{{M}_{ij},x}$ is the Kronecker delta function, which is equal to 1 if M_ij = x and 0 otherwise.

Thus, for a “50–50 coin toss” scenario (${p}_{0}={p}_{1}=\frac{1}{2}$), H(M) is at a maximum and is equal to 1 Sh. Moreover, since binary data necessitates the condition p₀ + p₁ = 1, only one probability value is needed to fully describe the entropy of a given array A. We therefore denote the ordering of a given IC by referring to a single order parameter, η, where η = p₁ is always true. When considering the order parameter of a set of ICs, it is important to note that, because IC generation is always a stochastic process, the exact η of any given IC in the set cannot be predicted with certainty. For this reason, we characterize IC sets with the symbol 〈η〉, denoting the expected order parameter.

To generate high-entropy ICs, a binary array was constructed by checking random.random() < 0.5 == True (using the “random” module in Python—see https://python.readthedocs.io/en/latest/library/random.html) to decide each element. If the statement returned true, then the element would be defined as 1, and otherwise, 0. This method resulted in a training set with a binomial, experimentally measured η distribution (Fig. 5A).

Broad-entropy IC set generation

To create a broad-entropy IC set, first, a vector was created representing a set of order parameters ranging from 0 to 1. The length of this vector was set to the desired number of samples in the dataset (10,000 for training, 1000 for validation). This set of order parameters may be thought of as containing different expected probabilities for finding a 1 in an IC.

Then, the same procedure as with the high-entropy IC set was followed, with two exceptions: (1) instead of random.random() < 0.5 == True determining the value of each element in each IC array, random.random() < η == True was the determining equality, and (2) each IC was generated using a unique η from the aforementioned vector (see “Training set generation”). This strategy ensured that the IC set represented a broad range of ordering, from all 0s, to 50–50 0 s and 1 s, to all 1s (Fig. 5B).

Next-game-state generation

NGSs were calculated from IC arrays by applying Life rules assuming a toroidal grid (see the update_grid() function here: game.py).

Reshaping data

To make the handling of training set data easier, the final stage of the training set generator involves reshaping the data into a list of sub-lists, in which each entry in the list contains a sub-list corresponding to a specific IC. Within each unique sub-list, two strings are stored, one corresponding to a flattened IC, and one corresponding to a flattened NGS (see the generate_sets() function here: game.py).

Validation set generation

Validation sets were generated using the same methods in “Training set generation,” as the random.random() function ensures sufficiently random IC generation, ensuring training and validation sets remained entirely independent. Combined with the incredibly large space of possible 32 × 32 binary arrays (2^32 × 32 ≈ 1.80 × 10³⁰⁸ unique possibilities), this made the likelihood of even a single sample being identical between a 10,000-sample training set and a 1000-sample validation set negligible (see “Learning abilities”). This, in turn, ensured that over the course of model training, training loss and validation loss remained independent of one another.

Testing set generation

A 10-sample testing set was constructed to validate the performance of models during and after training, in a manner other than by inspecting the validation and training losses. Five samples in the testing set were generated stochastically in the same manner as in “Training set generation,” and 5 samples were manually defined to match known periodic and complex patterns found in Life (Fig. 3). NGSs were recursively generated for a total of 10 states (including the IC) per sample, for all 10 samples in the testing set.

Dataset generation for differently sized grids

For datasets (training, validation, testing) for LifeGPT-MultiGrid (see “Learning life on differently sized grids”), the only difference in the procedure was to specify different grid sizes (W_G ∈ {2, 4, 8, 16}) during IC generation, and to introduce a padding character (“p”) which was append as many times as needed to ends of each sub-list for those which had grid sizes smaller than the largest specified grid size, such that all sub-lists were the same length.

Forgetful causal masking (FCM) implementation

FCM was implemented using the “x-transformers” library⁶⁵. FCM was built into this library as part of the AutoregressiveWrapper class by default. FCM was enabled by setting mask_prob to 0.15, which was empirically shown to be effective by Liu et al.⁶⁸.

FCM involves randomly masking a predetermined percentage of past-tokens during the learning process, in addition to standard causal attention masking. The authors⁶⁸ argue that this method prevents over-attending to more recent tokens in a given sequence, encouraging attention to tokens in the “distant past.” We implemented FCM into our model, which increased the rate at which model accuracy improved with each epoch. Furthermore, FCM enabled our model to achieve 100% accuracy on our testing set with a sampling temperature of 1.0 in less than 50 epochs, which was previously unattainable when training with a broad-entropy dataset.

Implementing FCM increased the GPU RAM requirements of our LifeGPT, necessitating a decrease in batch size from 20 to 5 samples.

Model development

Training was initially conducted with high-entropy data. Due to the (pseudo)random nature of our training set generation script (see “Training set generation”), and the high number of samples in the training set (10,000), there was some diversity of training data entropy despite use of a static order parameter of (η = 0.5) (Fig. 5A). Nevertheless, observed model accuracy issues for low-entropy ICs prompted the use of broad-entropy datasets (Fig. 5B), which resulted in for improved performance. Later, LifeGPT-MultiGrid (Learning life on differently sized grids) was developed using a modified dataset to show that the LifeGPT framework allowed for simultaneous learning of multiple grid sizes.

Accuracy benchmarking

The testing dataset consisted of 10 flattened 32 × 32 binary arrays, representing initial states in Life, and their resulting iterations in accordance with Life state-transition rules on a toroidal (periodic) grid, numbering one through ten. Depending on the type of model being trained (the number of desired time-step jump predictions), different columns in the testing dataset would be selected as the ground truth. Accuracy at each checkpoint (every 2 epochs, starting with epoch 2) was determined by inputting the task statement (e.g., @PredictNextState) into a tokenizer, and subsequently using the tokenized data as the prompt for the autoregressive model. Since all of LifeGPT’s training was conducted on data corresponding to a 32 × 32 grid, LifeGPT was programmed to output the exact number of tokens necessary to fully describe the NGS. After LifeGPT was finished generating the output data, this data was compared to the ground truth (the flattened NGS in accordance with Life’s rules), and an accuracy score was computed using the following function:

$$A=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\delta }_{{y}_{i}{\hat{y}}_{i}}$$

(3)

where A is the Accuracy of the model, N is the total number of cell predictions across the testing dataset (N = 32 × 32 × 10 = 10,240 cells for a dataset with ten pairs of 32 × 32 grid game examples), y_i is the ground truth value, ${\hat{y}}_{i}$ is the predicted value, and δ is the Kronecker delta function which equals 1 if ${y}_{i}={\hat{y}}_{i}$ and 0 otherwise. An accuracy score was computed once every 2 epochs, for each model sampling temperature in the set 0, 0.25, 0.5, 0.75, 1, starting with epoch 2.

Training set entropy effects experimental procedure

The goal of this experiment was to determine was effect, if any, that the ordering of the ICs making up the training data for LifeGPT would have on accuracy (A), when fed ICs generated with varying expected order parameters (〈η〉_IC). We used two versions of LifeGPT; one was trained on high-entropy training data, and the other on broad-entropy training data. Next, a broad-entropy testing set (comprised of 110 samples, each with a (〈η〉_IC) value ranging linearly from 0 to 1) was generated in the same manner as the broad-entropy training set. The stochasticity of the IC generation process ensured both broad entropy sets remained independent. Finally, both models were benchmarked on each sample in a manner similar to the method in “Accuracy benchmarking and sampling temperature effects,” the only difference being that A was calculated for each sample in the testing set, as opposed to an average of all samples. Finally, A versus (〈η〉_IC) was plotted for both models (see Fig. 4).

Autoregressive loop implementation

The autoregressive loop is simply an implementation of LifeGPT where the model is placed inside a loop, where a portion of its output, corresponding to the NGS, is converted into an input tensor and is fed back into LifeGPT, for a desired number of iterations. As such, the NGS outputs of the previous loop iteration serves as the IC in the next loop iteration. In this way, the autoregressive loop is able to “run” Life in a similar recursive manner as the original algorithm. We ran the autoregressive loop using two versions of LifeGPT trained on the broad-entropy training set: one which stopped training at epoch 16 (chosen due to this version being the earliest instance of A = 1.0) for sampling temperature = 1), and one that continued training until epoch 50, across sampling temperatures 0, 0.25 0.5, 0.75, and 1. We compared the NGSs outputted from our autoregressive loop method with the ground truth NGSs, generated with the Life algorithm, and created animations for all model-sampling temperature combinations, showing the progression of the ground truth Life system, the autoregressive loop-generated NGSs, and the discrepancy between the two.

We also ran the autoregressive loop (and the Life algorithm) for 249 iterations (resulting in 250 game states, including the ICs), using only the epoch 50, sampling temperature = 0 version of LifeGPT due to time and compute constraints, for all 10 samples in the testing set. For each game state, we compared LifeGPT’s predictions to the GT Life algorithm’s output using the metric “Error Rate,” defined as:

$${\rm{Error}}\,{\rm{Rate}}=1-\frac{1}{G}\mathop{\sum }\limits_{i=1}^{G}{\delta }_{{y}_{i}{\hat{y}}_{i}}$$

(4)

where ErrorRate is the fraction of incorrect cells the model, G is the total number of cells comprising each game state (N = 32 × 32 = 1024 cells), y_i is the ground truth value, ${\hat{y}}_{i}$ is the predicted value, and δ is the Kronecker delta function.

LifeGPT-multigrid experimental procedure

Accuracy characterization was performance in the same manner as described in “Accuracy benchmarking and sampling temperature effects,” aside from the use of a different testing dataset. A testing set of 100 samples (25 samples per W_G for W_G ∈ {2, 4, 8, 16}) was created (utilizing broad entropy IC generation). Inference was performed for each sample, and average accuracies were calculated for each 25-sample group in accordance with equation (3).

Use of generative AI

Some Python scripts used for data generation, model training, data processing, and figure generation were written with the assistance of GPT-3.5, GPT-4, and GPT-4o from OpenAI. All scripts generated/edited in this manner were carefully reviewed, validated, and manually corrected, in the case of errors, by an author prior to implementation in our work.

Source link

Up Next

Toward faithful and human-aligned self-explanation of deep models

Don't Miss

OpenAI Plans India Data Center in Major Stargate Expansion

The Editors

Click to comment

AI Insights

Artificial intelligence helps break barriers for Hispanic homeownership

Published

2 hours ago

September 4, 2025

The Editors

For many Hispanics the road to homeownership is filled with obstacles, including loan officers who don’t speak Spanish or aren’t familiar with buyers who may not fit the boxes of a traditional mortgage applicant.

Some mortgage experts are turning to artificial intelligence to bridge the gap. They want AI to help loan officers find the best lender for a potential homeowner’s specific situation, while explaining the process clearly and navigating residency, visa or income requirements.

This new use of a bilingual AI has the potential to better serve homebuyers in Hispanic and other underrepresented communities. And it’s launching as federal housing agencies have begun to switch to English-only services, part of President Donald Trump’s push to make it the official language of the United States. His executive order in August called the change a way to “reinforce shared national values, and create a more cohesive and efficient society.”

The number of limited-English households tripled over the past four decades, according to the Urban Institute, a nonprofit research organization based in Washington, D.C. The institute says these households struggle to navigate the mortgage process, making it difficult for them to own a home, which is a key factor in building generational wealth.

The nonprofit Hispanic Organization of Mortgage Experts launched an AI platform built on ChatGPT last week, which lets loan officers and mortgage professionals quickly search the requirements of more than 150 lenders, instead of having to contact them individually.

The system, called Wholesale Search, uses an internal database that gives customized options for each buyer. HOME also offers a training program for loan officers called Home Certified with self-paced classes on topics like income and credit analysis, compliance rules and intercultural communication.

Cubie Hernandez, the organization’s chief technology and learning officer, said the goal is to help families have confidence during the mortgage process while pushing the industry to modernize. “Education is the gateway to opportunity,” he said.

HOME founder Rogelio Goertzen said the platform is designed to handle complicated cases like borrowers without a Social Security number, having little to no credit history, or being in the U.S. on a visa.

Loan officer Danny Velazquez of GFL Capital said the platform has changed his work. Before, he had to contact 70 lenders one by one, wait for answers and sometimes learn later that they wouldn’t accept the buyer’s situation.

The AI tool lets him see requirements in one place, narrow the list and streamline the application. “I am just able to make the process faster and get them the house,” Velazquez said.

One of Velazquez’s recent clients was Heriberto Blanco-Joya, 38, who bought his first home this year in Las Vegas. Spanish is Blanco-Joya’s first language, so he and his wife expected the process to be confusing.

Velazquez told him exactly what paperwork he needed, explained whether his credit score was enough to buy a home, and answered questions quickly.

“He provided me all the information I needed to buy,” Blanco-Joya said. “The process was pleasant and simple.”

From their first meeting to closing day took about six weeks.

Mortgage experts and the platform’s creators acknowledge that artificial intelligence creates new risks. Families rely on accurate answers about loans, immigration status and credit requirements. If AI gives wrong information, the consequences could be serious.

Goertzen, the CEO of HOME, said his organization works to reduce errors by having the AI pull information directly from lenders and loan officers. The platform’s database is updated whenever new loan products appear, and users can flag any problems to the developers.

“When there are things that are incorrect, we are constantly correcting it,” Goertzen said. “AI is a great tool, but it doesn’t replace that human element of professionalism, and that is why we are constantly tweaking and making sure it is correct.”

Jay Rodriguez, a mortgage broker at Arbor Financial Group, said figuring out the nuances of different investors’ requirements can mean the difference between turning a family away and getting them approved.

Rodriguez said HOME’s AI platform is especially helpful for training new loan officers and for coaching teams on how to better serve their communities.

Better Home & Finance Holding Company, an AI-powered mortgage lender, has created an AI platform called Tinman. It helps loan officers find lenders for borrowers who have non-traditional income or documents, which is common among small business owners.

They also built a voice-based assistant called Betsy that manages more than 127,000 borrower interactions each month. A Spanish-language version is in development.

“Financial literacy can be challenging for Hispanic borrowers or borrowers in other underserved populations,” Pierce said. “Tools like Betsy can interact and engage with customers in a way that feels supportive and not judgmental.”

Source link