Ethics & Policy

Mamba Explained

Published

1 year ago

March 28, 2024

Kola Ayonrinde

The State Space Model taking on Transformers

Right now, AI is eating the world.

And by AI, I mean Transformers. Practically all the big breakthroughs in AI over the last few years are due to Transformers.

Mamba, however, is one of an alternative class of models called State Space Models (SSMs). Importantly, for the first time, Mamba promises similar performance (and crucially similar scaling laws) as the Transformer whilst being feasible at long sequence lengths (say 1 million tokens). To achieve this long context, the Mamba authors remove the “quadratic bottleneck” in the Attention Mechanism. Mamba also runs fast – like “up to 5x faster than Transformer fast”¹.

Scaling Laws for Mamba vs other Language Models — Mamba performs similarly (or slightly better than) other Language Models on The Pile (source)

Gu and Dao, the Mamba authors write:

Mamba enjoys fast inference and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modelling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Here we’ll discuss:

The advantages (and disadvantages) of Mamba (🐍) vs Transformers (🤖),
Analogies and intuitions for thinking about Mamba, and
What Mamba means for Interpretability, AI Safety and Applications.

Problems with Transformers – Maybe Attention Isn’t All You Need

We’re very much in the Transformer-era of history. ML used to be about detecting cats and dogs. Now, with Transformers, we’re generating human-like poetry, coding better than the median competitive programmer, and solving the protein folding problem.

But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions. For this lookback, we cache detailed information about each token in the so-called KV cache.

attention — When using the Attention Mechanism, information from all previous tokens can be passed to the current token

This pairwise communication means a forward pass is O(n²) time complexity in training (the dreaded quadratic bottleneck), and each new token generated autoregressively takes O(n) time. In other words, as the context size increases, the model gets slower.

To add insult to injury, storing this key-value (KV) cache requires O(n) space. Consequently, the dreaded CUDA out-of-memory (OOM) error becomes a significant threat as the memory footprint expands. If space were the only concern, we might consider adding more GPUs; however, with latency increasing quadratically, simply adding more compute might not be a viable solution.

On the margin, we can mitigate the quadratic bottleneck with techniques like Sliding Window Attention or clever CUDA optimisations like FlashAttention. But ultimately, for super long context windows (like a chatbot which remembers every conversation you’ve shared), we need a different approach.

Foundation Model Backbones

Fundamentally, all good ML architecture backbones have components for two important operations:

Communication between tokens
Computation within a token

In transformers, this is Attention (communication) and MLPs (computation). We improve transformers by optimising these two operations².

We would like to substitute the Attention component³ with an alternative mechanism for facilitating inter-token communication. Specifically, Mamba employs a Control Theory-inspired State Space Model, or SSM, for Communication purposes while retaining Multilayer Perceptron (MLP)-style projections for Computation.

Like a Transformer made up of stacked transformer blocks, Mamba is made up of stacked Mamba blocks as above.

We would like to understand and motivate the choice of the SSM for sequence transformations.

Motivating Mamba – A Throwback to Temple Run

Imagine we’re building a Temple Run agent⁴. It chooses if the runner should move left or right at any time.

To successfully pick the correct direction, we need information about our surroundings. Let’s call the collection of relevant information the state. Here the state likely includes your current position and velocity, the position of the nearest obstacle, weather conditions, etc.

Claim 1: if you know the current state of the world and how the world is evolving, then you can use this to determine the direction to move.

Note that you don’t need to look at the whole screen all the time. You can figure out what will happen to most of the screen by noting that as you run, the obstacles move down the screen. You only need to look at the top of the screen to understand the new information and then simulate the rest.

This lends itself to a natural formulation. Let h be the hidden state, relevant knowledge about the world. Also let x be the input, the observation that you get each time. h’ then represents the derivative of the hidden state, i.e. how the state is evolving. We’re trying to predict y, the optimal next move (right or left).

Now, Claim 1 states that from the hidden state h, h’, and the new observation x, you can figure out y.

More concretely, h, the state, can be represented as a differential equation (Eq 1a):

$h’(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$

Knowing h allows you to determine your next move y (Eq 1b):

$y(t) = \mathbf{C}h(t) + \mathbf{D}x(t)$

The system’s evolution is determined by its current state and newly acquired observations. A small new observation is enough, as the majority of the state can be inferred by applying known state dynamics to its previous state. That is, most of the screen isn’t new, it’s just a continuation of the previous state’s natural downward trajectory. A full understanding of the state would enable optimal selection of the subsequent action, denoted as y.

You can learn a lot about the system dynamics by observing the top of the screen. For instance, increased velocity of this upper section suggests an acceleration of the rest of the screen as well, so we can infer that the game is speeding up⁵. In this way, even if we start off knowing nothing about the game and only have limited observations, it becomes possible to gain a holistic understanding of the screen dynamics fairly rapidly.

What’s the State?

Here, state refers to the variables that, when combined with the input variables, fully determine the future system behaviour. In theory, once we have the state, there’s nothing else we need to know about the past to predict the future. With this choice of state, the system is converted to a Markov Decision Process. Ideally, the state is a fairly small amount of information which captures the essential properties of the system. That is, the state is a compression of the past⁶.

Discretisation – How To Deal With Living in a Quantised World

Okay, great! So, given some state and input observation, we have an autoregressive-style system to determine the next action. Amazing!

In practice though, there’s a little snag here. We’re modelling time as continuous. But in real life, we get new inputs and take new actions at discrete time steps⁷.

We would like to convert this continuous-time differential equation into a discrete-time difference equation. This conversion process is known as discretisation. Discretisation is a well-studied problem in the literature. Mamba uses the Zero-Order Hold (ZOH) discretisation⁸. To give an idea of what’s happening morally, consider a naive first-order approximation⁹.

From Equation 1a, we have

$h’(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$

And for small ∆,

$h’(t) \approx \frac{h(t+\Delta) – h(t)}{\Delta}$

by the definition of the derivative.

We let:

$h_t = h(t)$

and

$h_{t+1} = h(t + \Delta)$

and substitute into Equation 1a giving:

$h_{t+1} – h_t \approx \Delta (\mathbf{A}h_t + \mathbf{B}x_t)$
$\Rightarrow h_{t+1} \approx (I + \Delta \mathbf{A})h_t + (\Delta
\mathbf{B})x_t$

Hence, after renaming the coefficients and relabelling indices, we have the discrete representations:

Equation 2 — The Discretised Version of the SSM Equation

If you’ve ever looked at an RNN before¹⁰ and this feels familiar – trust your instincts:

We have some input x, which is combined with the previous hidden state by some transform to give the new hidden state. Then we use the hidden state to calculate the output at each time step.

Understanding the SSM Matrices

Now, we can interpret the A, B, C, D matrices more intuitively:

A is the transition state matrix. It shows how you transition the current state into the next state. It asks “How should I forget the less relevant parts of the state over time?”
B is mapping the new input into the state, asking “What part of my new input should I remember?”¹¹
C is mapping the state to the output of the SSM. It asks, “How can I use the state to make a good next prediction?”¹²
D is how the new input passes through to the output. It’s a kind of modified skip connection that asks “How can I use the new input in my prediction?”

Visual SSM Equations — Visual Representation of The SSM Equations

Additionally, ∆ has a nice interpretation – it’s the step size, or what we might call the linger time or the dwell time. For large ∆, you focus more on that token; for small ∆, you skip past the token immediately and don’t include it much in the next state.

And that’s it! That’s the SSM, our ~drop-in replacement for Attention (Communication) in the Mamba block. The Computation in the Mamba architecture comes from regular linear projections, non-linearities, and local convolutions.

Okay great, that’s the theory – but does this work? Well…

Effectiveness vs Efficiency: Attention is Focus, Selectivity is Prioritisation

At WWDC ‘97, Steve Jobs famously noted that “focusing is about saying no”. Focus is ruthless prioritisation. It’s common to think about Attention positively as choosing what to notice. In the Steve Jobs sense, we might instead frame Attention negatively as choosing what to discard.

There’s a classic intuition pump in Machine Learning known as the Cocktail Party Problem¹³. Imagine a party with dozens of simultaneous loud conversations:

Question:

How do we recognise what one person is saying when others are talking at the same time?¹⁴

Answer:

The brain solves this problem by focusing your “attention” on a particular stimulus and hence drowning out all other sounds as much as possible.

Transformers use Dot-Product Attention to focus on the most relevant tokens. A big reason Attention is so great is that you have the potential to look back at everything that ever happened in its context. This is like photographic memory when done right.¹⁵

Transformers (🤖) are extremely effective. But they aren’t very efficient. They store everything from the past so that they can look back at tokens with theoretically perfect recall.

Traditional RNNs (🔁) are the opposite – they forget a lot, only recalling a small amount in their hidden state and discarding the rest. They are very efficient – their state is small. Yet they are less effective as discarded information cannot be recovered.

We’d like something closer to the Pareto frontier of the effectiveness/efficiency tradeoff. Something that’s more effective than traditional RNNs and more efficient than transformers.

The Mamba Architecture seems to offer a solution which pushes out the Pareto frontier of effectiveness/efficiency.

SSMs are as efficient as RNNs, but we might wonder how effective they are. After all, it seems like they would have a hard time discarding only unnecessary information and keeping everything relevant. If each token is being processed the same way, applying the same A and B matrices as if in a factory assembly line for tokens, there is no context-dependence. We would like the forgetting and remembering matrices (A and B respectively) to vary and dynamically adapt to inputs.

The Selection Mechanism

Selectivity allows each token to be transformed into the state in a way that is unique to its own needs. Selectivity is what takes us from vanilla SSM models (applying the same A (forgetting) and B (remembering) matrices to every input) to Mamba, the Selective State Space Model.

In regular SSMs, A, B, C and D are learned matrices – that is

$\mathbf{A} = \mathbf{A}_{\theta}$ etc. (where θ represents the learned parameters)

With the Selection Mechanism in Mamba, A, B, C and D are also functions of x. That is $\mathbf{A} = \mathbf{A}_{\theta(x)}$ etc; the matrices are context dependent rather than static.

SSM Algorithm — Mamba (right) differs from traditional SSMs by allowing A,B,C matrices to be **selective** i.e. context dependent (source)

Making A and B functions of x allows us to get the best of both worlds:

We’re selective about what we include in the state, which improves effectiveness vs traditional SSMs.
Yet, since the state size is bounded, we improve on efficiency relative to the Transformer. We have O(1), not O(n) space and O(n) not O(n²) time requirements.

The Mamba paper authors write:

The efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension.

Humans (mostly) don’t have photographic memory for everything they experience within a lifetime – or even within a day! There’s just way too much information to retain it all. Subconsciously, we select what to remember by choosing to forget, throwing away most information as we encounter it. Transformers (🤖) decide what to focus on at recall time. Humans (🧑) also decide what to throw away at memory-making time. Humans filter out information early and often.

If we had infinite capacity for memorisation, it’s clear the transformer approach is better than the human approach – it truly is more effective. But it’s less efficient – transformers have to store so much information about the past that might not be relevant. Transformers (🤖) only decide what’s relevant at recall time. The innovation of Mamba (🐍) is allowing the model better ways of forgetting earlier – it’s focusing by choosing what to discard using Selectivity, throwing away less relevant information at memory-making time¹⁶.

The Problems of Selectivity

Applying the Selection Mechanism does have its gotchas though. Non-selective SSMs (i.e. A,B not dependent on x) are fast to compute in training. This is because the component of

Yt which depends on xi can be expressed as a linear map, i.e. a single matrix that can be precomputed!

For example (ignoring the D component, the skip connection):

$$y_2 = \mathbf{C}\mathbf{B}x_2 + \mathbf{C}\mathbf{A}\mathbf{B}x_1 +
\mathbf{C}\mathbf{A}\mathbf{A}\mathbf{B}x_0$$

If we’re paying attention, we might spot something even better here – this expression can be written as a convolution. Hence we can apply the Fast Fourier Transform and the Convolution Theorem to compute this very efficiently on hardware as in Equation 3 below.

We can calculate Equation 2, the SSM equations, efficiently in the Convolutional Form, Equation 3.

Unfortunately, with the Selection Mechanism, we lose the convolutional form. Much attention is given to making Mamba efficient on modern GPU hardware using similar hardware optimisation tricks to Tri Dao’s Flash Attention¹⁷. With the hardware optimisations, Mamba is able to run faster than comparably sized Transformers.

Machine Learning for Political Economists – How Large Should The State Be?

The Mamba authors write, “the efficiency vs. effectiveness tradeoff of sequence models is characterised by how well they compress their state”. In other words, like in political economy¹⁸, the fundamental problem is how to manage the state.

🔁 Traditional RNNs are anarchic

They have a small, minimal state. The size of the state is bounded. The compression of state is poor.

🤖 Transformers are communist

They have a maximally large state. The “state” is just a cache of the entire history with no compression. Every context token is treated equally until recall time.

🐍Mamba has a compressed state

…but it’s selective about what goes in. Mamba says we can get away with a small state if the state is well focused and effective¹⁹.

The upshot is that state representation is critical. A smaller state is more efficient; a larger state is more effective. The key is to selectively and dynamically compress data into the state. Mamba’s Selection Mechanism allows for context-dependent reasoning, focusing and ignoring. For both performance and interpretability, understanding the state seems to be very useful.

Information Flow in Transformer vs Mamba

How do Transformers know anything? At initialization, a transformer isn’t very smart. It learns in two ways:

Training data (Pretraining, SFT, RLHF etc)
In context-data

Training Data

Models learn from their training data. This is a kind of lossy compression of input data into the weights. We can think of the effect of pretraining data on the transformer kinda like the effect of your ancestor’s experiences on your genetics – you can’t recall their experiences, you just have vague instincts about them²⁰.

In Context-Data

Transformers use their context as short-term memory, which they can recall with ~perfect fidelity. So we get In-Context Learning, e.g. using induction heads to solve the Indirect Object Identification task, or computing Linear Regression.

Retrieval

Note that Transformers don’t filter their context at all until recall time. So if we have a bunch of information we think might be useful to the Transformer, we filter it outside the Transformer (using Information Retrieval strategies) and then stuff the results into the prompt. This process is known as Retrieval Augmented Generation (RAG). RAG determines relevant information for the context window of a transformer. A human with the internet is kinda like a RAG system – you still have to know what to search but whatever you retrieve is as salient as short-term memory to you.

Information Flow for Mamba

Training Data acts similarly for Mamba. However, the lines are slightly blurred for in-context data and retrieval. In-context data for Mamba is compressed/filtered similar to retrieval data for transformers. This in-context data is also accessible for look-up like for transformers (although with somewhat lower fidelity).

Transformer context is to Mamba states what short-term is to long-term memory. Mamba doesn’t just have “RAM”, it has a hard drive²¹ ²².

Swapping States as a New Prompting Paradigm

Currently, we often use RAG to give a transformer contextual information.

With Mamba-like models, you could instead imagine having a library of states created by running the model over specialised data. States could be shared kinda like LoRAs for image models.

For example, I could do inference on 20 physics textbooks and, say, 100 physics questions and answers. Then I have a state which I can give to you. Now you don’t need to add any few-shot examples; you just simply ask your question. The in-context learning is in the state.

In other words, you can drag and drop downloaded states into your model, like literal plug-in cartridges. And note that “training” a state doesn’t require any backprop. It’s more like a highly specialised one-pass fixed-size compression algorithm. This is unlimited in-context learning applied at inference time for zero-compute or latency²³.

The structure of an effective LLM call goes from…

System Prompt
Preamble
Few shot-examples
Question

…for Transformers, to simply…

Inputted state (with problem context, initial instructions, textbooks, and few-shot examples)
Short question

…for Mamba.

This is cheaper and faster than few-shot prompting (as the state is infinitely reusable without inference cost). It’s also MUCH cheaper than finetuning and doesn’t require any gradient updates. We could imagine retrieving states in addition to context.

Mamba & Mechanistic Interpretability

Transformer interpretability typically involves:

understanding token relationships via attention,
understanding circuits, and
using Dictionary Learning for unfolding MLPs.

Most of the ablations that we would like to do for Mamba are still valid, but understanding token communication (1) is now more nuanced. All information moves between tokens via hidden states instead of the Attention Mechanism which can “teleport” information from one sequence position to another.

For understanding in-context learning (ICL) tasks with Mamba, we will look to intervene on the SSM state. A classic task in-context learning task is Indirect Object Identification in which a model has to finish a paragraph like:

Then, Shelby and Emma had a lot of fun at the school. [Shelby/Emma] gave an apple to [BLANK]

The model is expected to fill in the blank with the name that is not repeated in the paragraph. In the chart below we can see that information is passed from the [Shelby/Emma] position to the final position via the hidden state (see the two blue lines in the top chart).

Since it’s hypothesised that much of In-Context Learning in Transformers is downstream of more primitive sequence position operations (like Induction Heads), Mamba being able to complete this task suggests a more general In-Context Learning ability.

What’s Next for Mamba & SSMs?

Mamba-like models are likely to excel in scenarios requiring extremely long context and long-term memory. Examples include:

Processing DNA
Generating (or reasoning over) video
Writing novels

An illustrative example is agents with long-term goals.

Suppose you have an agent interacting with the world. Eventually, its experiences become too much for the context window of a transformer. The agent then has to compress or summarise its experiences into some more compact representation.

But how do you decide what information is the most useful as a summary? If the task is language, LLMs are actually fairly good at summaries – okay, yeah, you’ll lose some information, but the most important stuff can be retained.

However, for other disciplines, it might not be clear how to summarise. For example, what’s the best way to summarise a 2 hour movie?²⁴. Could the model itself learn to do this naturally rather than a hacky workaround like trying to describe the aesthetics of the movie in text?

This is what Mamba allows. Actual long-term memory. A real state where the model learns to keep what’s important. Prediction is compression – learning what’s useful to predict what’s coming next inevitably leads to building a useful compression of the previous tokens.

The implications for Assistants are clear:

Your chatbot co-evolves with you. It remembers.

The film HER is looking better and better as time goes on 😳

Agents & AI Safety

One reason for positive updates in existential risk from AGI is Language Models. Previously, Deep-RL agents trained via self-play looked set to be the first AGIs. Language models are inherently much safer since they aren’t trained with long-term goals²⁵.

The potential for long-term sequence reasoning here brings back the importance of agent-based AI safety. Few agent worries are relevant to Transformers with an 8k context window. Many are relevant to systems with impressive long-term memories and possible instrumental goals.

The Best Collab Since Taco Bell & KFC: 🤖 x 🐍

The Mamba authors show that there’s value in combining Mamba’s long context with the Transformer’s high fidelity over short sequences. For example, if you’re making long videos, you likely can’t fit a whole movie into a Transformer’s context for attention²⁶. You could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency²⁷.

This isn’t the end for Transformers. Their high effectiveness is exactly what’s needed for many tasks. But now Transformers aren’t the only option. Other architectures are genuinely feasible.

So we’re not in the post-Transformer era. But for the first time, we’re living in the post-only-Transformers era²⁸. And this blows the possibilities wide open for sequence modelling with extreme context lengths and native long-term memory.

Two ML researchers, Sasha Rush (HuggingFace, Annotated Transformer, Cornell Professor) and Jonathan Frankle (Lottery Ticket Hypothesis, MosaicML, Harvard Professor), currently have a bet here.

Currently Transformers are far and away in the lead. With 3 years left, there’s now a research direction with a fighting chance.

All that remains to ask is: Is Attention All We Need?

1. see Figure 8 in the Mamba paper.

2. And scaling up with massive compute.

3. More specifically the scaled dot-product Attention popularised by Transformers

4. For people who don’t see Temple Run as the cultural cornerstone it is 🤣 Temple Run was an iPhone game from 2011 similar to Subway Surfer

5. Here we assume the environment is sufficiently smooth.

6. One pretty important constraint for this to be efficient is that we don’t allow the individual elements of the state vector to interact with each other directly. We’ll use a combination of the state dimensions to determine the output but we don’t e.g. allow the velocity of the runner and the direction of the closest obstacle (or whatever else was in our state) to directly interact. This helps with efficient computation and we achieve this practically by constraining A to be a diagonal matrix.

7. Concretely consider the case of Language Models – each token is a discrete step

8. ZOH also has nice properties for the initialisations – we want A_bar to be close to the identity so that the state can be mostly maintained from timestep to timestep if desired. ZOH gives A_bar as an exponential so any diagonal element initialisations close to zero give values close to 1

9. This is known as the Euler discretisation in the literature

10. It’s wild to note that some readers might not have, we’re so far into the age of Attention that RNNs have been forgotten!

11. B is like the Query (Q) matrix for Transformers.

12. C is like the Output (O) matrix for Transformers.

13. Non-alcoholic options also available!

14. Especially as all voices roughly occupy the same space on the audio frequency spectrum Intuitively this seems really hard!

15. Note that photographic memory doesn’t necessarily imply perfect inferences from that memory!

16. To be clear, if you have a short sequence, then a transformer should theoretically be a better approach. If you can store the whole context, then why not!? If you have enough memory for a high-resolution image, why compress it into a JPEG? But Mamba-style architectures are likely to hugely outperform with long-range sequences.

17. More details are available for engineers interested in CUDA programming – Tri’s talk, Mamba paper section 3.3.2, and the official CUDA code are good resources for understanding the Hardware-Aware Scan

18. or in Object Oriented Programming

19. Implications to actual Political Economy are left to the reader but maybe Gu and Dao accidentally solved politics!?

20. This isn’t a perfect analogy as human evolution follows a genetic algorithm rather than SGD.

21. Albeit a pretty weird hard drive at that – it morphs over time rather than being a fixed representation.

22. As a backronym, I’ve started calling the hidden_state the state space dimension (or selective state dimension) which shortens to SSD, a nice reminder for what this object represents – the long-term memory of the system.

23. I’m thinking about this similarly to the relationship between harmlessness finetuning and activation steering. State swapping, like activation steering, is an inference time intervention giving comparable results to its train time analogue.

24. This is a very non-trivial problem! How do human brains represent a movie internally? It’s not a series of the most salient frames, nor is it a text summary of the colours, nor is it a purely vibes-based summary if you can memorise some lines of the film.

25. They’re also safer since they inherently understand (though don’t necessarily embody) human values. It’s not all clear that how to teach an RL agent human morality.

26. Note that typically an image (i.e. a single frame) counts as >196 tokens, and movies are typically 24 fps so you’ll fill a 32k context window in 7 seconds 🤯

27. Another possibility that I’m excited about is applying optimisation pressure to the state itself as well as the output to have models that respect particular use cases.

28. This is slightly hyperbolic, the TS-Mixer for time series, Gradient Boosting Trees for tabular data and Graph Neural Networks for weather prediction exist and are currently used, but these aren’t at the core of AI

Author Bio

Kola Ayonrinde is a Research Scientist and Machine Learning Engineer with a flair for writing. He integrates technology and creativity, focusing on applying machine learning in innovative ways and exploring the societal impacts of tech advancements.

Acknowledgements

This post was originally posted on Kola’s personal blog.

Thanks to Gonçalo for reading an early draft, Jaden for the nnsight library used for the Interpretability analysis and Tessa for Mamba patching visualisations.Also see: Mamba paper, Mamba Python code, Annotated S4, Nathan Labenz podcast

Citation

For attribution in academic contexts or books, please cite this work as

Kola Ayonrinde, "Mamba Explained," The Gradient, 2024

@article{Ayonrinde2024mamba,
    author = {Kola Ayonrinde},
    title = {Mamba Explained},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/mamba-explained},
}

Source link

Ethics & Policy

7 Life-Changing Books Recommended by Catriona Wallace | Books

Published

18 hours ago

August 30, 2025

Girish Shukla

7 Life-Changing Books Recommended by Catriona Wallace (Picture Credit – Instagram)

Some books ignite something immediate. Others change you quietly, over time. For Dr Catriona Wallace—tech entrepreneur, AI ethics advocate, and one of Australia’s most influential business leaders, books are more than just ideas on paper. They are frameworks, provocations, and spiritual companions. Her reading list offers not just guidance for navigating leadership and technology, but for embracing identity, power, and inner purpose. These seven titles reflect a mind shaped by disruption, ethics, feminism, and wisdom. They are not trend-driven. They are transformational.

1. Lean In by Sheryl Sandberg

A landmark in feminist career literature, Lean In challenges women to pursue their ambitions while confronting the structural and cultural forces that hold them back. Sandberg uses her own journey at Facebook and Google to dissect gender inequality in leadership. The book is part memoir, part manifesto, and remains divisive for valid reasons. But Wallace cites it as essential for starting difficult conversations about workplace dynamics and ambition. It asks, simply: what would you do if you weren’t afraid?

2. Women and Power: A Manifesto by Mary Beard

In this sharp, incisive book, classicist Mary Beard examines the historical exclusion of women from power and public voice. From Medusa to misogynistic memes, Beard exposes how narratives built around silence and suppression persist today. The writing is fiery, brief, and packed with centuries of insight. Wallace recommends it for its ability to distil complex ideas into cultural clarity. It’s a reminder that power is not just a seat at the table; it is a script we are still rewriting.

3. The World of Numbers by Adam Spencer

A celebration of mathematics as storytelling, this book blends fun facts, puzzles, and history to reveal how numbers shape everything from music to human behaviour. Spencer, a comedian and maths lover, makes the subject inviting rather than intimidating. Wallace credits this book with sparking new curiosity about logic, data, and systems thinking. It’s not just for mathematicians. It’s for anyone ready to appreciate the beauty of patterns and the thinking habits that come with them.

4. Small Giants by Bo Burlingham

This book is a love letter to companies that chose to be great instead of big. Burlingham profiles fourteen businesses that opted for soul, purpose, and community over rapid growth. For Wallace, who has founded multiple mission-driven companies, this book affirms that success is not about scale. It is about integrity. Each story is a blueprint for building something meaningful, resilient, and values-aligned. It is a must-read for anyone tired of hustle culture and hungry for depth.

5. The Misogynist Factory by Alison Phipps

A searing academic work on the production of misogyny in modern institutions. Phipps connects the dots between sexual violence, neoliberalism, and resistance movements in a way that is as rigorous as it is radical. Wallace recommends this book for its clear-eyed confrontation of how systemic inequality persists beneath performative gestures. It equips readers with language to understand how power moves, morphs, and resists change. This is not light reading. It is a necessary reading for anyone seeking to challenge structural harm.

6. Tribes by Seth Godin

Godin’s central idea is simple but powerful: people don’t follow brands, they follow leaders who connect with them emotionally and intellectually. This book blends marketing, leadership, and human psychology to show how movements begin. Wallace highlights ‘Tribes’ as essential reading for purpose-driven founders and changemakers. It reminds readers that real influence is built on trust and shared values. Whether you’re leading a company or a cause, it’s a call to speak boldly and build your own tribe.

7. The Tibetan Book of Living and Dying by Sogyal Rinpoche

Equal parts spiritual guide and philosophical reflection, this book weaves Tibetan Buddhist teachings with Western perspectives on mortality, grief, and rebirth. Wallace turns to it not only for personal growth but also for grounding ethical decision-making in a deeper sense of purpose. It’s a book that speaks to those navigating endings—personal, spiritual, or professional and offers a path toward clarity and compassion. It does not offer answers. It offers presence, which is often far more powerful.

The books that shape us are often those that disrupt us first. Catriona Wallace’s list is not filled with comfort reads. It’s made of hard questions, structural truths, and radical shifts in thinking. From feminist manifestos to Buddhist reflections, from purpose-led business to systemic critique, this bookshelf is a mirror of her own leadership—decisive, curious, and grounded in values. If you’re building something bold or seeking language for change, there’s a good chance one of these books will meet you where you are and carry you further than you expected.

Source link

Ethics & Policy

Hyderabad: Dr. Pritam Singh Foundation hosts AI and ethics round table at Tech Mahindra

Published

22 hours ago

August 30, 2025

Telangana Today

The Dr. Pritam Singh Foundation and IILM University hosted a Round Table on “Human at Core: AI, Ethics, and the Future” in Hyderabad. Leaders and academics discussed leveraging AI for inclusive growth while maintaining ethics, inclusivity, and human-centric technology.

Published Date – 30 August 2025, 12:57 PM

Hyderabad: The Dr. Pritam Singh Foundation, in collaboration with IILM University, hosted a high-level Round Table Discussion on “Human at Core: AI, Ethics, and the Future” at Tech Mahindra, Cyberabad.

The event, held in memory of the late Dr. Pritam Singh, pioneering academic, visionary leader, and architect of transformative management education in India, brought together policymakers, business leaders, and academics to explore how India can harness artificial intelligence (AI) while safeguarding ethics, inclusivity, and human values.

In his keynote address, Padmanabhaiah Kantipudi, IAS (Retd.), Chairman of the Administrative Staff College of India (ASCI),

paid tribute to Dr. Pritam Singh, describing him as a nation-builder who bridged academia, business, and governance.
The Round Table theme, Leadership: AI, Ethics, and the Future, underscored India’s opportunity to leverage AI for inclusive growth across healthcare, agriculture, education, and fintech—while ensuring technology remains human-centric and trustworthy.

Source link

Ethics & Policy

AI ethics: Bridging the gap between public concern and global pursuit – Pennsylvania

Published

1 day ago

August 29, 2025

Black Chronicle News Service

(The Center Square) – Those who grew up in the 20th and 21st centuries have spent their lives in an environment saturated with cautionary tales about technology and human error, projections of ancient flood myths onto modern scenarios in which the hubris of our species brings our downfall.

They feature a point of no return, dubbed the “singularity” by Manhattan Project physicist John von Neumann, who suggested that technology would advance to a stage after which life as we know it would become unrecognizable.

Some say with the advent of artificial intelligence, that moment has come. And with it, a massive gap between public perception and the goals of both government and private industry. While states court data center development and tech investments, polling from Pew Research indicates Americans outside the industry have strong misgivings about AI.

In Pennsylvania, giants like Amazon and Microsoft have pledged to spend billions building the high-powered infrastructure required to enable the technology. Fostering this progress is a rare point of agreement between the state’s Democratic and Republican leadership, even bringing Gov. Josh Shapiro to the same event – if not the same stage – as President Donald Trump.

Pittsburgh is rebranding itself as the “global capital of physical AI,” leveraging its blue-collar manufacturing reputation and its prestigious academic research institutions to depict the perfect marriage of code and machine. Three Mile Island is rebranding itself as Crane Clean Energy Center, coming back online exclusively to power Microsoft AI services. Some legislators are eager to turn the lights back on fossil fuel-burning plants and even build new ones to generate the energy required to feed both AI and the everyday consumers already on the grid.

– Advertisement –

At the federal level, Trump has revoked guardrails established under the Biden administration with an executive order entitled “Removing Barriers to American Leadership in Artificial Intelligence.” In July, the White House released its “AI Action Plan.”

The document reads, “We need to build and maintain vast AI infrastructure and the energy to power it. To do that, we will continue to reject radical climate dogma and bureaucratic red tape, as the Administration has done since Inauguration Day. Simply put, we need to ‘Build, Baby, Build!’”

To borrow an analogy from Shapiro’s favorite sport, it’s a full-court press, and there’s hardly a day that goes by that messaging from the state doesn’t tout the thrilling promise of the new AI era. Next week, Shapiro will be returning to Pittsburgh along with a wide array of luminaries to attend the AI Horizons summit in Bakery Square, a hub for established and developing tech companies.

According to leaders like Trump and Shapiro, the stakes could not be higher. It isn’t just a race for technological prowess — it’s an existential fight against China for control of the future itself. AI sits at the heart of innovation in fields like biotechnology, which promise to eradicate disease, address climate collapse, and revolutionize agriculture. It also sits at the heart of defense, an industry that thrives in Pennsylvania.

Yet, one area of overlap in which both everyday citizens and AI experts agree is that they want to see more government control and regulation of the technology. Already seeing the impacts of political deepfakes, algorithmic bias, and rogue chatbots, AI has far outpaced legislation, often to disastrous effect.

In an interview with The Center Square, Penn researcher Dr. Michael Kearns said that he’s less worried about autonomous machines becoming all-powerful than the challenges already posed by AI.

– Advertisement –

Kearns spends his time creating mathematical models and writing about how to embed ethical human principles into machine code. He believes that in some areas like chatbots, progress may have reached a point where improvements appear incremental for the average user. He cites the most recent ChatGPT update as evidence.

“I think the harms that are already being demonstrated are much more worrisome,” said Kearns. “Demographic bias, chatbots hurling racist invectives because they were trained on racist material, privacy leaks.”

Kearns says that a major barrier to getting effective regulatory policy is incentivizing experts to leave behind engaging work in the field as researchers and lucrative roles in tech in order to work on policy. Without people who understand how the algorithms operate, it’s difficult to create “auditable” regulations, meaning there are clear tests to pass.

Kearns pointed to ISO 420001. This is an international standard that focuses on process rather than outcome to guide developers in creating ethical AI. He also noted that the market itself is a strong guide. When someone gets hurt or hurts someone else using AI, it’s bad for business, incentivizing companies to do their due diligence.

He also noted crossroads where two ethical issues intersect. For instance, companies are entrusted with their users’ personal data. If policing misuse of the product requires an invasion of privacy, like accessing information stored on the cloud, there’s only so much that can be done.

OpenAI recently announced that it is scanning user conversations for concerning statements and escalating them to human teams, who may contact authorities when deemed appropriate. For some, the idea of alerting the police to someone suffering from mental illness is a dangerous breech. Still, it demonstrates the calculated risks AI companies have to make when faced with reports of suicide, psychosis, and violence arising out of conversations with chatbots.

Kearns says that even with the imperative for self-regulation on AI companies, he expects there to be more stumbling blocks before real improvement is seen in the absence of regulation. He cites watchdogs like the investigative journalists at ProPublica who demonstrated machine bias against Black people in programs used to inform criminal sentencing in 2016.

Kearns noted that the “headline risk” is not the same as enforceable regulation and mainly applies to well-established companies. For the most part, a company with a household name has an investment in maintaining a positive reputation. For others just getting started or flying under the radar, however, public pressure can’t replace law.

One area of AI concern that has been widely explored in the media is the use of AI by those who make and enforce the law. Kearns said, for his part, he’s found “three-letter agencies” to be “among the most conservative of AI adopters just because of the stakes involved.

In Pennsylvania, AI is used by the state police force.

In an email to The Center Square, PSP Communications Director Myles Snyder wrote, “The Pennsylvania State Police, like many law enforcement agencies, utilizes various technologies to enhance public safety and support our mission. Some of these tools incorporate AI-driven capabilities. The Pennsylvania State Police carefully evaluates these tools to ensure they align with legal, ethical, and operational considerations.”

PSP was unwilling to discuss the specifics of those technologies.

AI is also used by the U.S. military and other militaries around the world, including those of Israel, Ukraine, and Russia, who are demonstrating a fundamental shift in the way war is conducted through technology.

In Gaza, the Lavender AI system was used to identify and target individuals connected with Hamas, allowing human agents to approve strikes with acceptable numbers of civilian casualties, according to Israeli intelligence officials who spoke to The Guardian on the matter. Analysis of AI use in Ukraine calls for a nuanced understanding of the way the technology is being used and ways in which it should be regulated by international bodies governing warfare in the future.

Then, there are the more ephemeral concerns. Along with the long-looming “jobpocalypse,” many fear that offloading our day-to-day lives into the hands of AI may deplete our sense of meaning. Students using AI may fail to learn. Workers using AI may feel purposeless. Relationships with or grounded in AI may lead to disconnection.

Kearns acknowledged that there would be disruption in the classroom and workplace to navigate but it would also provide opportunities for people who previously may not have been able to gain entrance into challenging fields.

As for outsourcing joy, he asked “If somebody comes along with a robot that can play better tennis than you and you love playing tennis, are you going to stop playing tennis?”

Source link