AI Research

Journalist Karen Hao on Sam Altman, OpenAI & the “Quasi-Religious” Push for Artificial Intelligence

Published

3 days ago

July 4, 2025

The Editors

This is a rush transcript. Copy may not be in its final form.

AMY GOODMAN: This is Democracy Now!, democracynow.org, The War and Peace Report. I’m Amy Goodman.

In this holiday special, we continue with the journalist Karen Hao, author of the new book, Empire of AI: Dreams and Nightmares in Sam Altman’s OpenAI. She came into our studio in May. She talked about how AI will impact workers.

KAREN HAO: One of the things that we have seen is this technology is already having a huge impact on jobs, not necessarily because the technology itself is really capable of replacing jobs, but it is perceived as capable enough that executives are laying off workers. And we need more — some kind of more guardrails to actually prevent these companies from continuing to try and develop labor-automating technologies, and try to shift them to producing labor-assistive technologies.

AMY GOODMAN: What do you mean?

KAREN HAO: So, OpenAI, their definition of what they call artificial general intelligence is highly autonomous systems that outperform humans in most economically valuable work. So they explicitly state that they are trying to automate jobs away. I mean, what is economically valuable work but the things that people do to get paid?

But there’s this really great book called Power and Progress by MIT economists Daron Acemoglu and Simon Johnson, who mention that technology development, all technology revolutions, they take a labor-automating approach, not because of inevitability, but because the people at the top choose to automate those jobs away. They choose to design the technology so that they can sell it to executives and say, “You can shrink your costs by laying off all these workers and using our AI services instead.”

But in the past, we’ve seen studies that, for example, suggest that if you develop an AI tool that a doctor uses, rather than replacing the doctor, you will actually get better healthcare for patients. You will get better cancer diagnoses. If you develop an AI tool that teachers can use, rather than just an AI tutor that replaces the teacher, your kids will get better educational outcomes. And so, that’s what I mean by labor-assistive than labor.

AMY GOODMAN: And explain what you mean, because I think a lot of people don’t even understand artificial intelligence. And when you say “replace the doctor,” what are you talking about?

KAREN HAO: Right. So, these companies, they try to develop a technology that they position as an everything machine that can do anything. And so, they will try to say, “You can use this — you can talk to ChatGPT for therapy.” No, you cannot. ChatGPT is not a licensed therapist. And, in fact, these models actually spew lots of medical misinformation. And there have been lots of examples of, actually, users being psychologically harmed by the model, because the model will continue to reinforce self-harming behaviors. And we’ve even had cases where children who speak to chatbots and develop huge emotional relationships with these chatbots have actually killed themselves after using these chatbot systems. But that’s what I mean when these companies are trying to develop labor-automating tools. They’re positioning it as: You can now hire this tool instead of hire a worker.

AMY GOODMAN: So, you’ve talked about Sam Altman, and in Part 1, we touched on who he is, but I’d like you to go more deeply into what — who Sam Altman is, how he exploded onto the U.S. scene testifying before Congress, actually warning about the dangers of AI. So that really protected him, in a way, people seeing him as a prophet. That’s a P-R-O-P-H-E-T.

KAREN HAO: Right.

AMY GOODMAN: But now we can talk about the other kind of profit, P-R-O-F-I-T.

KAREN HAO: Yeah.

AMY GOODMAN: And how OpenAI was formed? How is OpenAI different from AI?

KAREN HAO: OpenAI is a — I mean, it was originally founded as a nonprofit, as I mentioned. And Altman specifically, when he was thinking about, “How do I make a fundamental AI research lab that is going to make a big splash?” he chose to make it a nonprofit because he identified that if he could not compete on capital — and he was relatively late to the game. Google already had a monopoly on a lot of top AI research talent at the time. If he could not compete on capital, and he could not compete in terms of being a first mover, he needed some other kind of ingredient there to really recruit talent, recruit public goodwill and establish a name for OpenAI. So he —

AMY GOODMAN: A gimmick.

KAREN HAO: — identified a mission. He identified: Let me make this a nonprofit, and let me give it a really compelling mission. So, the mission of OpenAI is to ensure artificial general intelligence benefits all of humanity. And one of the quotes that I open my book with is this quote that Sam Altman cited himself in 2013 in his blog. He was an avid blogger back in the day, talking about his learnings on business and strategy and Silicon Valley startup life. And the quote is, “Successful people build companies. More successful people build countries. The most successful people build religions.” And then he reflects on that quote in his blog, saying, “It appears to me that the best way to build a religion is actually to build a company.”

AMY GOODMAN: And so, talk about how Altman was then forced out of the company and then came back. And also, I just found it so fascinating that you were able to speak with so many OpenAI workers. You thought —

KAREN HAO: Yeah.

AMY GOODMAN: — there was a kind of total ban on you.

KAREN HAO: Yes. Yeah, exactly. So, I was the first journalist to profile OpenAI. I embedded within the company for three days in 2019, and then my profile published in 2020 for MIT Technology Review. And at the time, I identified in the profile this tension that I was seeing, where it was a nonprofit by name, but behind the scenes, a lot of the public values that they espoused were actually the opposite of how they operated. So, they espoused transparency, but they were highly secretive. They espoused collaborativeness, they were highly competitive. And they espoused that they had no commercial intent, but, in fact, it seemed like — they had just gotten a $1 billion investment from Microsoft. It seems like they were rapidly going to develop commercial intent. And so I wrote that into the profile, and OpenAI was deeply unhappy about it, and they would not — refused to talk to me for three years.

And so, when OpenAI took up this mission of artificial general intelligence, they were able to essentially shape and mold what they wanted this technology to be, based on what is most convenient for them. But when they identified it, it was at a time when scientists really looked down on this term even, ”AGI.” And so, they absorbed just a small group of self-identified AGI believers. This is why I call it quasi-religious. Because there’s no scientific evidence that we can actually develop AGI, the people who are strongly — have this strong conviction that they will do it and that it’s going to happen soon, it is just purely based on belief. And they talk about it as a belief, too. But there are two factions within this belief system of the AGI religion: There are people who think AGI is going to bring us to utopia, and there are people who think AGI is going to destroy all of humanity. Both of them believe that it is possible, it’s coming soon, and therefore, they conclude that they need to be the ones to control the technology and not democratize it.

And this is ultimately what leads to your question of what happened when Sam Altman was fired and rehired. Through the history of OpenAI, there’s been a lot of clashing between the boomers and doomers about who should actually —

AMY GOODMAN: The boomers and doomers.

KAREN HAO: The boomers and the doomers.

AMY GOODMAN: Those that say it’ll bring us the apocalypse.

KAREN HAO: So, utopia, boomers, and those that say it’ll destroy humanity, the doomers. And they have clashed relentlessly and aggressively about how quickly to build the technology, how quickly to release the technology.

AMY GOODMAN: And I want to take this up until today, to, in January, the Trump administration announcing the Stargate Project, a $500 billion project to boost AI infrastructure in the United States. This is OpenAI’s Sam Altman speaking alongside President Trump.

SAM ALTMAN: I think this will be the most important project of this era. And as Masa said, for AGI to get built here, to create hundreds of thousands of jobs, to create a new industry centered here, we wouldn’t be able to do this without you, Mr. President.

AMY GOODMAN: He also there referred to AGI —

KAREN HAO: Exactly.

AMY GOODMAN: — artificial general intelligence. Explain what happened here and what this is. And has it actually happened?

KAREN HAO: So, Altman, before Trump was elected, he already was sensing, through observation, that it was possible that the administration would shift and that he would need to start politicking quite heavily to ingratiate himself to a new administration. Altman is very strategic. He was under a lot of pressure at the time, as well, because his original co-founder, Elon Musk, now has great beef with him. Musk feels like Altman used his name and his money to set up OpenAI, and then he got nothing in return. So, Musk had been suing him, is still suing him, and suddenly became first buddy of the Trump administration.

So, Altman basically cleverly orchestrated this announcement, where, by the way, the announcement is quite strange, because the Trump — President Trump is not — it’s not the U.S. government giving $500 billion. It’s private investment coming into the U.S. from places like SoftBank.

AMY GOODMAN: Which is?

KAREN HAO: Which is one of the largest investment funds, run by Masayoshi Son, a Japanese businessman who made a lot of his wealth from the previous tech era. So, it’s not even the U.S. government that’s providing this money.

AMY GOODMAN: And take that right through to now, that golf trip that Elon Musk was on, but so was Sam Altman —

KAREN HAO: Yes.

AMY GOODMAN: — to the fury of Elon Musk. And then a deal was sealed in Abu Dhabi —

KAREN HAO: Yeah. So —

AMY GOODMAN: — that didn’t include Elon Musk, but was about OpenAI.

KAREN HAO: Exactly. So, Altman has continued to try and use the U.S. government as a way to get access to more places and more powerful spaces to build out this empire. And one of the things, because OpenAI’s computational infrastructure needs are so aggressive, you know, I had an OpenAI employee tell me, “We’re running out of land and power.” So, they are running out of resources in the U.S., which is why they’re trying to get access to land and energy in other places. The Middle East has a lot of land and has a lot of energy, and they’re willing to strike deals. And that is why Altman was part of that trip, looking to strike a deal. And what they — the deal that they struck was to build a massive data center, or multiple data centers, in the Middle East, using their land and their energy.

But one of the things that OpenAI has recently rolled out, they call it the OpenAI for Countries program, and it is this idea that they want to install OpenAI hardware and software in places around the world, and explicitly says. “We want to build democratic AI rails. We want to install our hardware and software as a foundation of democratic AI globally, so that we can stop China from installing authoritarian AI globally.”

But the thing that he does not acknowledge is that there is nothing democratic about what he’s doing. You know, The Atlantic executive editor says, “We need to call these companies for what they are.” They are techno-authoritarians. They do not ask the public for any perspective on how they develop the technology, what data they train the technology on, where they develop these data centers. In fact, these data centers are often developed in the cover of night under shell companies. Like, Meta recently entered New Mexico under the shell company named Greater Kudu LLC.

AMY GOODMAN: Greater Kudu?

KAREN HAO: Greater Kudu LLC. And once the deal was actually closed, and the residents couldn’t do anything about it anymore, that’s when it was revealed: “Surprise, we’re Meta. And you’re going to get a data center that drinks all of your freshwater.”

AMY GOODMAN: And then there was this whole controversy in Memphis around a data center.

KAREN HAO: Yes. So, that is the data center that Elon Musk is building. So, meanwhile, Musk is saying, “Altman is terrible. Everyone should use my AI.” And, of course, his AI is also being developed using the same environmental and public health costs. So, he built this massive supercomputer called Colossus in Memphis, Tennessee, that’s training Grok, the chatbot that people can access through X. And that is being powered by around 35 unlicensed methane gas turbines that are pumping thousands of tons of toxic air pollutants into the greater Memphis community. And that community has long suffered a lack of access to clean air, a fundamental human right.

AMY GOODMAN: So, I want to go to, interestingly, Sam Altman testifying in front of Congress last month about solutions to the high energy consumption of artificial intelligence.

SAM ALTMAN: In the short term, I think this probably looks like more natural gas, although there are some applications where I think solar can really help. In the medium term, I hope it’s advanced nuclear fission and fusion. More energy is important well beyond AI.

AMY GOODMAN: So, that’s OpenAI’s Sam Altman. This is testifying before the Senate and talking about everything from solar to nuclear power —

KAREN HAO: Yeah.

AMY GOODMAN: — something that was fought in the United States by environmental activists for decades. So, you have these huge, old nuclear power plants, but many say you can’t make them safe, no matter how small and smart you make them.

KAREN HAO: This is one of the things — of the many things that I’m concerned about with the current trajectory of AI development. This is a second-order, tertiary-order effect, is that because these companies are trying to claim that the AI development approach they took doesn’t have climate harms, they are explicitly evoking nuclear again and again and again as nuclear will solve the problem. And it has been effective. I have talked with certain AI researchers who thought the problem was solved because of nuclear. And in order to try and actually build more and more nuclear plants, they are lobbying governments to try and unwind the regulatory structure around nuclear power plant building. I mean, this is, like, crazy on so many levels, that they’re not just trying to develop these, the AI technology, recklessly, they are also trying to lay down infrastructure and nuclear infrastructure in this move-fast, break-things ideology.

AMY GOODMAN: But for those who are environmentalists and have long opposed nuclear, will they be sucked in by the solar alternative?

KAREN HAO: But that — so, data centers have to run 24/7, so they cannot actually run on just renewables. That is why the companies keep trying to evoke nuclear as the solve-all. But solar does not actually work when we do not have sufficient enough energy storage solutions for that 24/7 operation.

AMY GOODMAN: We’re talking to Karen Hao, author of Empire of AI: Dreams and Nightmares in Sam Altman’s OpenAI. You mentioned earlier China. You live in Hong Kong.

KAREN HAO: Yes.

AMY GOODMAN: You’ve covered Chinese AI, U.S. AI for years.

KAREN HAO: Yeah.

AMY GOODMAN: Explain what’s happening in China right now.

KAREN HAO: Yeah, so, the — I have to sort of explain the dynamic between China and the U.S. first. So, the U.S. — China and the U.S. are the largest hubs for AI research. They are the largest concentration of AI research talent globally. Other than Silicon Valley, China really is the only other rival in terms of talent density and the amount of capital investment and the amount of infrastructure that is going into AI development.

In the last few years, what we have seen is the U.S. government has been aggressively trying to stay number one, and one of the mechanisms that they have used is export controls. A key input into these AI models is the computational infrastructure and the computer chips for installing into the data centers for training these models. And these computer chips are — in order to develop the AI models, companies are using the most bleeding-edge computer chip technology. It’s like every two years, a new chip comes out, and they immediately start using that to train the next generation of AI models. Those computer chips are designed by American companies, the most prominent one being Nvidia in California. And so, the U.S. government has been trying to use export controls to prevent Chinese companies from getting access to the most cutting-edge computer chips. That has all been under the recommendation of Silicon Valley saying, “This is the way to prevent China from being number one. And, like, put export controls on them, and don’t regulate us at all, so we can stay number one, and they will fall behind.”

What has happened instead is, because there is a strong base of talent, of AI research talent, in China, under the constraints of fewer computational resources, Chinese companies have actually been able to innovate and develop the same level of AI model capabilities as American companies, with two orders of magnitude less computational resources, less energy, less data. So, I’m talking specifically about the Chinese company High-Flyer, which developed this model called DeepSeek earlier this year, that briefly tanked the global economy because the company said that their — training this one AI model cost around $6 million, when OpenAI was training models that cost hundreds of millions, if not over tens of billions of dollars, and that delta demonstrated to people that this — what Silicon Valley has tried to convince everyone for the last few years, that this is the only path to getting more AI capabilities, is totally false. And actually, the techniques that the Chinese company was using were ones that existed in the literature and just had to be assembled. They used a lot of engineering sophistication to do that, but they weren’t actually using fundamentally new techniques. They were ones that actually already existed.

AMY GOODMAN: So, let me ask you something, Karen. The latest news, as you’re traveling in the United States, before you go back to Hong Kong, of Trump’s attack on academia, how this fits in? How could Trump’s attack on international students, specifically targeting the, what, more than 250,000, a quarter of a million, Chinese students —

KAREN HAO: Yeah.

AMY GOODMAN: — and revoking their visas —

KAREN HAO: Yeah.

AMY GOODMAN: — impact the future of the AI industry? But not just Chinese students, because what’s going on here now is terrifying students around the world.

KAREN HAO: Yes.

AMY GOODMAN: And because labs are shutting down in all kinds of ways here, U.S. students, as well, deciding to go abroad.

KAREN HAO: This is just the latest action that the U.S. government has taken over the last few years to really alienate a key talent pool for U.S. innovation. Originally, there were more Chinese researchers working in the U.S. contributing to U.S. AI than there were in China, because just a few years ago, Chinese researchers aspired to work for American companies. They wanted to move to the U.S. They wanted to contribute to the U.S. economy. They didn’t want to go back to their home country.

But because of what was called the China Initiative, which was the first Trump-era initiative to try and criminalize Chinese academics or ethnically Chinese academics, some of whom were actually Americans, based on just paperwork errors, they would accuse them of being spies. That was one of the first actions. Then, of course, the pandemic happened, and the U.S.-China trade escalations started amplifying anti-Chinese rhetoric. All of these led — and now with the potential ban on international students, all of these have led more and more Chinese researchers to just opt for staying at home and contributing to the Chinese AI ecosystem.

And this was a prerequisite to High-Flyer pulling off DeepSeek. If there had not been that concentration and build-up of AI talent in China, they probably would have had a much harder time innovating around, circumventing these export controls that the U.S. government was imposing on them. But because they now have a high concentration of top talent, some of the top talent globally, when those restrictions were imposed, they were able to innovate around them. So, DeepSeek is literally a product of this continuation of that alienation.

And with the U.S. continuing to take this stance, it is just going to get worse. And as you mentioned, it’s not just Chinese researchers. I literally just talked to a friend in academia that said she’s considering going to Europe now, because she just cannot survive without that public funding. And European countries are seeing a critical opportunity, offering million-dollar packages: “Come here. We’ll give you a lab. We’ll give you millions of dollars of funding.” I mean, this is the fastest way to brain drain this country.

AMY GOODMAN: I mean, what many are saying, “U.S.’s brain drain is their brain gain.”

KAREN HAO: Yes.

AMY GOODMAN: And this also reminds us of history. You have the Chinese rocket scientist Qian Xuesen, who, in the 1950s, was inexplicably held under house arrest for years, and then Eisenhower has him deported to China. He becomes the father of rocket science and China’s entry into space.

KAREN HAO: Yeah.

AMY GOODMAN: And he said he would never again step foot into the United States, even though originally that was the only place he wanted to live.

KAREN HAO: Yes, and there was, I believe, a government official, a U.S. government official, who said that was the dumbest mistake the U.S. ever made.

AMY GOODMAN: We talk about the brain drain and the brain gain. OK, again, some more rhyming, the doomers and the boomers. I want to talk about what an AI apocalypse looks like, meaning how it brings us to apocalypse, but also how people say it could lead us to a utopia. What are the two tracks, trajectories?

KAREN HAO: It’s a great question. And I ask boomers and doomers this all the time: Can you articulate to me exactly how we get there? And the issue is that they cannot. And this is why I call it quasi-religious. It really is based on belief.

I mean, I was talking with one researcher who identified as a boomer, and I said — you know, his eyes were wide, and he really lit up, saying, “You know, once we get to AGI, game over. Everything becomes perfect.” And I asked him, I was like, “Can you explain to me: How does AGI feed people that haven’t — don’t have food on the table right now?” And he was like, “Oh, you’re talking about, like, the floor floor and how to elevate their quality of life.” And I was like, “Yes, because they are also part of all of humanity.” And he was like, “I’m not really sure how that would happen, but I think it could help the middle class get more economic opportunity.” And I was like, “OK, but how does that happen, as well?” And he was like, “Well, once these come — once we have AGI, and it can just create trillions of dollars of economic value, we can just give them cash payouts.” And I was like, “Who’s giving them cash payouts? What institutions are giving them?” You know, like, it doesn’t — when you actually test their logic, it doesn’t really hold.

And with the doomers, I mean, it’s the same thing. Like, their belief is — ultimately, what I realized when reporting on the book is they believe AGI is possible because of their belief of how the human brain works. They believe human intelligence is inherently fully computational. So, if you have enough data and you have enough computational resources, you will inevitably be able to recreate human intelligence. It’s just a matter of time. And to them, the reason why that would lead to an apocalyptic scenario is humans, we learn and improve our intelligence through communication, and communication is inefficient. We miscommunicate all the time. And so, for AI intelligences, they would be able to rapidly get smarter and smarter and smarter by having perfect communication with one another as digital intelligences. And so, many of these people who self-identify as doomers say there has never been in the history of the universe a species that was superior to another species — a species that was able to rule over a more superior species. So they think that, ultimately, AI will evolve into a higher species and then start ruling us, and then maybe decide to get rid of us altogether.

AMY GOODMAN: As we begin to wrap up, I’m wondering if you can talk about any model of a country, not a company, that is pioneering a way of democratically controlled artificial intelligence.

KAREN HAO: I don’t think it’s actively happening right now. The EU has had the EU AI Act, which is their major piece of legislation trying to develop a risk-based, rights-based framework for governing AI deployment.

But to me, one of the keys of democratic AI governance is also democratically developing AI, and I don’t think any country is really doing that. And what I mean by that is there are — AI has a supply chain. It needs data. It needs land. It needs energy. It needs water. And it also needs spaces in which these companies need access to it to then deploy their technology — schools, hospitals, government agencies. Silicon Valley has done a really good job over the last decade of making people feel that their collectively owned resources are Silicon Valley’s. You know, I talk with friends all the time who say, “We don’t have data privacy anymore. So, like, what’s more — what is more data to these companies? Like, I’m fine just giving them all of my data.”

But that data is yours. You know, that intellectual property is the writers’ and artists’ intellectual property. That land is a community’s land. Those schools are the students’ and teachers’ schools. The hospitals are the doctors’ and nurses’ and patients’ hospitals. These are all sites of democratic contestation in the deployment — in the development and the deployment of AI. And just like those Chilean water activists that we talked about, who aggressively understood that that freshwater was theirs, and they were not willing to give it up unless they got some kind of mutually beneficial agreement for it, we need to have that spirit in protecting our data, our land, our water and our schools, so that companies inevitably will have to adjust their approach, because they will no longer get access to the resources they need or the spaces that they need to deploy in.

AMY GOODMAN: In 2022, Karen, you wrote a piece for MIT Technology Review headlined “A new vision of artificial intelligence for the people: In a remote rural town in New Zealand, an Indigenous couple is challenging what AI could be and who it should serve.” Who are they?

KAREN HAO: This was a wonderful story that I did, where the couple, they run Te Hiku Media. It’s a nonprofit Māori radio station in New Zealand. And the Māori people have suffered a lot of the same challenges as many Indigenous peoples around the world. The history of colonization led them to rapidly lose their language, and there are very few Māori speakers in the world anymore. And so, in the last few years, there has been an attempt to revive the language, and the New Zealand government has tried to repent by trying to encourage the revival of that language.

But this nonprofit radio station, they had all of this wonderful archival material, archival audio of their ancestors speaking the Māori language, that they wanted to provide to Māori speakers, Māori learners around the world as an educational resource. The problem is, in order to do that, they needed to transcribe the audio so that Māori learners could actually listen, see what was being said, click on the words, understand the translation, and actually turn it into an active learning tool. But there were so few Māori speakers that can speak at that advanced level, that they realized they had to turn to AI.

And this is a key part of my book’s argument, is I’m not critiquing all AI development. I’m specifically critiquing the scale-at-all-costs approach that Silicon Valley has taken. But there are many different kinds of beneficial AI models, including what they ended up doing.

So, they took a fundamentally different approach. First and foremost, they asked their community, “Do we want this AI tool?” Once the community said yes, then they moved to the next step of asking people to fully consent to donating data for the training of this tool. They explained to the community what this data was for, how it would be used, how they would then guard that data and make sure that it wasn’t used for other purposes. They collected around a couple hundred hours of audio data in just a few days, because the community rallied support around this project. And only a couple hundred hours was enough to create a performance speech recognition model, which is crazy when you think about the scales of data that these Silicon Valley companies require. And that is, once again, a lesson that can be learned, is actually there’s plenty of research that shows, when you have highly curated small data sets, you can actually create very powerful AI models. And then, once they had that tool, they were able to do exactly what they wanted, to open source and resource — open source this educational resource to their community.

And so, my vision for AI development in the future is to have more small, task-specific AI models that are not trained on vast, polluted data sets, but small, curated data sets, and therefore only need small amounts of computational power and can be deployed in challenges that we actually need to tackle for humanity — mitigating climate change by integrating more renewable energy into the grid, improving healthcare by doing more drug discovery.

AMY GOODMAN: So, as we finally do wrap up, what you were most shocked by? You’ve been doing this journalism, this research for years. What you were most shocked by in writing Empire of AI?

KAREN HAO: I originally thought that I was going to write a book focused on vertical harms of the AI supply chain — here’s how labor exploitation happens in the AI industry, here’s how the environmental harms are arising out of the AI industry. And at the end of my reporting, I realized that there is a horizontal harm that’s happening here. Every single community that I spoke to, whether it was artists having their intellectual property taken or Chilean water activists having their freshwater taken, they all said that when they encountered the empire, they initially felt exactly the same way: a complete loss of agency to self-determine their future. And that is when I realized the horizontal harm here is AI is threatening democracy. If the majority of the world is going to feel this loss of agency over self-determining their future, democracy cannot survive. And again, specifically Silicon Valley’s approach, scale-at-all-costs AI development.

AMY GOODMAN: But you also chronicle the resistance. You talk about how the Chilean water activists felt at first —

KAREN HAO: Yes, exactly.

AMY GOODMAN: — how the artists feel at first.

KAREN HAO: Yes.

AMY GOODMAN: So, talk about the strategies that these people have employed, and if they’ve been effective.

KAREN HAO: So, the amazing thing is that there has since been so much pushback. The artists have then said, “Wait a minute. We can sue these companies.” The Chilean water activists said, “Wait a minute. We can fight back and protect these water resources.” The Kenyan workers that I spoke to who were contracted by OpenAI, they said, “We can unionize and escalate our story to international media attention.”

And so, even in these — even when I thought that these communities, you could argue, are the most vulnerable in the world, have the least amount of agency, they were the ones that remembered that they do have agency and that they can seize that agency and fight back. And I think it was remarkably heartening to encounter those people to remind me that, actually, the first step to reclaiming democracy is remembering that no one can take your agency away.

AMY GOODMAN: Karen Hao, author of the new book, Empire of AI: Dreams and Nightmares in Sam Altman’s OpenAI. Go to democracynow.org to see the full interview. And that does it for this special broadcast. I’m Amy Goodman. Thanks so much for joining us.

Source link

AI Research

AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

Published

1 hour ago

July 7, 2025

Brenda Potts

DANIEL CARPENTER: Thanks for having me.

SULLIVAN: Dan, before we dissect policy, let’s rewind the tape to your origin story. Can you take us to the moment that you first became fascinated with regulators rather than, say, politicians? Was there a spark that pulled you toward the FDA story?

CARPENTER: At one point during graduate school, I was studying a combination of American politics and political theory, and I did a summer interning at the Department of Housing and Urban Development. And I began to think, why don’t people study these administrators more and the rules they make, the, you know, inefficiencies, the efficiencies? Really more from, kind of, a descriptive standpoint, less from a normative standpoint. And I was reading a lot that summer about the Food and Drug Administration and some of the decisions it was making on AIDS drugs. That was a, sort of, a major, …

SULLIVAN: Right.

CARPENTER: … sort of, you know, moment in the news, in the global news as well as the national news during, I would say, what? The late ’80s, early ’90s? And so I began to look into that.

SULLIVAN: So now that we know what pulled you in, let’s zoom out for our listeners. Give us the whirlwind tour. I think most of us know pharma involves years of trials, but what’s the part we don’t know?

CARPENTER: So I think when most businesses develop a product, they all go through some phases of research and development and testing. And I think what’s different about the FDA is, sort of, two- or three-fold.

First, a lot of those tests are much more stringently specified and regulated by the government, and second, one of the reasons for that is that the FDA imposes not simply safety requirements upon drugs in particular but also efficacy requirements. The FDA wants you to prove not simply that it’s safe and non-toxic but also that it’s effective. And the final thing, I think, that makes the FDA different is that it stands as what I would call the “veto player” over R&D [research and development] to the marketplace. The FDA basically has, sort of, this control over entry to the marketplace.

And so what that involves is usually first, a set of human trials where people who have no disease take it. And you’re only looking for toxicity generally. Then there’s a set of Phase 2 trials, where they look more at safety and a little bit at efficacy, and you’re now examining people who have the disease that the drug claims to treat. And you’re also basically comparing people who get the drug, often with those who do not.

And then finally, Phase 3 involves a much more direct and large-scale attack, if you will, or assessment of efficacy, and that’s where you get the sort of large randomized clinical trials that are very expensive for pharmaceutical companies, biomedical companies to launch, to execute, to analyze. And those are often the sort of core evidence base for the decisions that the FDA makes about whether or not to approve a new drug for marketing in the United States.

SULLIVAN: Are there differences in how that process has, you know, changed through other countries and maybe just how that’s evolved as you’ve seen it play out?

CARPENTER: Yeah, for a long time, I would say that the United States had probably the most stringent regime of regulation for biopharmaceutical products until, I would say, about the 1990s and early 2000s. It used to be the case that a number of other countries, especially in Europe but around the world, basically waited for the FDA to mandate tests on a drug and only after the drug was approved in the United States would they deem it approvable and marketable in their own countries. And then after the formation of the European Union and the creation of the European Medicines Agency, gradually the European Medicines Agency began to get a bit more stringent.

But, you know, over the long run, there’s been a lot of, sort of, heterogeneity, a lot of variation over time and space, in the way that the FDA has approached these problems. And I’d say in the last 20 years, it’s begun to partially deregulate, namely, you know, trying to find all sorts of mechanisms or pathways for really innovative drugs for deadly diseases without a lot of treatments to basically get through the process at lower cost. For many people, that has not been sufficient. They’re concerned about the cost of the system. Of course, then the agency also gets criticized by those who believe it’s too lax. It is potentially letting ineffective and unsafe therapies on the market.

SULLIVAN: In your view, when does the structured model genuinely safeguard patients and where do you think it maybe slows or limits innovation?

CARPENTER: So I think the worry is that if you approach pharmaceutical approval as a world where only things can go wrong, then you’re really at a risk of limiting innovation. And even if you end up letting a lot of things through, if by your regulations you end up basically slowing down the development process or making it very, very costly, then there’s just a whole bunch of drugs that either come to market too slowly or they come to market not at all because they just aren’t worth the kind of cost-benefit or, sort of, profit analysis of the firm. You know, so that’s been a concern. And I think it’s been one of the reasons that the Food and Drug Administration as well as other world regulators have begun to basically try to smooth the process and accelerate the process at the margins.

The other thing is that they’ve started to basically make approvals on the basis of what are called surrogate endpoints. So the idea is that a cancer drug, we really want to know whether that drug saves lives, but if we wait to see whose lives are saved or prolonged by that drug, we might miss the opportunity to make judgments on the basis of, well, are we detecting tumors in the bloodstream? Or can we measure the size of those tumors in, say, a solid cancer? And then the further question is, is the size of the tumor basically a really good correlate or predictor of whether people will die or not, right? Generally, the FDA tends to be less stringent when you’ve got, you know, a remarkably innovative new therapy and the disease being treated is one that just doesn’t have a lot of available treatments, right.

The one thing that people often think about when they’re thinking about pharmaceutical regulation is they often contrast, kind of, speed versus safety …

SULLIVAN: Right.

CARPENTER: … right. And that’s useful as a tradeoff, but I often try to remind people that it’s not simply about whether the drug gets out there and it’s unsafe. You know, you and I as patients and even doctors have a hard time knowing whether something works and whether it should be prescribed. And the evidence for knowing whether something works isn’t just, well, you know, Sally took it or Dan took it or Kathleen took it, and they seem to get better or they didn’t seem to get better.

The really rigorous evidence comes from randomized clinical trials. And I think it’s fair to say that if you didn’t have the FDA there as a veto player, you wouldn’t get as many randomized clinical trials and the evidence probably wouldn’t be as rigorous for whether these things work. And as I like to put it, basically there’s a whole ecology of expectations and beliefs around the biopharmaceutical industry in the United States and globally, and to some extent, it’s undergirded by all of these tests that happen.

SULLIVAN: Right.

CARPENTER: And in part, that means it’s undergirded by regulation. Would there still be a market without regulation? Yes. But it would be a market in which people had far less information in and confidence about the drugs that are being taken. And so I think it’s important to recognize that kind of confidence-boosting potential of, kind of, a scientific regulation base.

SULLIVAN: Actually, if we could double-click on that for a minute, I’d love to hear your perspective on, testing has been completed; there’s results. Can you walk us through how those results actually shape the next steps and decisions of a particular drug and just, like, how regulators actually think about using that data to influence really what happens next with it?

CARPENTER: Right. So it’s important to understand that every drug is approved for what’s called an indication. It can have a first primary indication, which is the main disease that it treats, and then others can be added as more evidence is shown. But a drug is not something that just kind of exists out there in the ether. It has to have the right form of administration. Maybe it should be injected. Maybe it should be ingested. Maybe it should be administered only at a clinic because it needs to be kind of administered in just the right way. As doctors will tell you, dosage is everything, right.

And so one of the reasons that you want those trials is not simply a, you know, yes or no answer about whether the drug works, right. It’s not simply if-then. It’s literally what goes into what you might call the dose response curve. You know, how much of this drug do we need to basically, you know, get the benefit? At what point does that fall off significantly that we can basically say, we can stop there? All that evidence comes from trials. And that’s the kind of evidence that is required on the basis of regulation.

Because it’s not simply a drug that’s approved. It’s a drug and a frequency of administration. It’s a method of administration. And so the drug isn’t just, there’s something to be taken off the shelf and popped into your mouth. I mean, sometimes that’s what happens, but even then, we want to know what the dosage is, right. We want to know what to look for in terms of side effects, things like that.

SULLIVAN: Going back to that point, I mean, it sounds like we’re making a lot of progress from a regulation perspective in, you know, sort of speed and getting things approved but doing it in a really balanced way. I mean, any other kind of closing thoughts on the tradeoffs there or where you’re seeing that going?

CARPENTER: I think you’re going to see some move in the coming years—there’s already been some of it—to say, do we always need a really large Phase 3 clinical trial? And to what degree do we need the, like, you know, all the i’s dotted and the t’s crossed or a really, really large sample size? And I’m open to innovation there. I’m also open to the idea that we consider, again, things like accelerated approvals or pathways for looking at different kinds of surrogate endpoints. I do think, once we do that, then we also have to have some degree of follow-up.

SULLIVAN: So I know we’re getting close to out of time, but maybe just a quick rapid fire if you’re open to it. Biggest myth about clinical trials?

CARPENTER: Well, some people tend to think that the FDA performs them. You know, it’s companies that do it. And the only other thing I would say is the company that does a lot of the testing and even the innovating is not always the company that takes the drug to market, and it tells you something about how powerful regulation is in our system, in our world, that you often need a company that has dealt with the FDA quite a bit and knows all the regulations and knows how to dot the i’s and cross the t’s in order to get a drug across the finish line.

SULLIVAN: If you had a magic wand, what’s the one thing you’d change in regulation today?

CARPENTER: I would like people to think a little bit less about just speed versus safety and, again, more about this basic issue of confidence. I think it’s fundamental to everything that happens in markets but especially in biopharmaceuticals.

SULLIVAN: Such a great point. This has been really fun. Just thanks so much for being here today. We’re really excited to share your thoughts out to our listeners. Thanks.

[TRANSITION MUSIC]

CARPENTER: Likewise.

SULLIVAN: Now to the world of medical devices, I’m joined by Professor Timo Minssen. Professor Minssen, it’s great to have you here. Thank you for joining us today.

TIMO MINSSEN: Yeah, thank you very much, it’s a pleasure.

SULLIVAN: Before getting into the regulatory world of medical devices, tell our audience a bit about your personal journey or your origin story, as we’re asking our guests. How did you land in regulation, and what’s kept you hooked in this space?

MINSSEN: So I started out as a patent expert in the biomedical area, starting with my PhD thesis on patenting biologics in Europe and in the US. So during that time, I was mostly interested in patent and trade secret questions. But at the same time, I also developed and taught courses in regulatory law and held talks on regulating advanced medical therapy medicinal products. I then started to lead large research projects on legal challenges in a wide variety of health and life science innovation frontiers. I also started to focus increasingly on AI-enabled medical devices and software as a medical device, resulting in several academic articles in this area and also in the regulatory area and a book on the future of medical device regulation.

SULLIVAN: Yeah, what’s kept you hooked in the space?

MINSSEN: It’s just incredibly exciting, in particular right now with everything that is going on, you know, in the software arena, in the marriage between AI and medical devices. And this is really challenging not only societies but also regulators and authorities in Europe and in the US.

SULLIVAN: Yeah, it’s a super exciting time to be in this space. You know, we talked to Daniel a little earlier and, you know, I think similar to pharmaceuticals, people have a general sense of what we mean when we say medical devices, but most listeners may picture like a stethoscope or a hip implant. The word “medical device” reaches much wider. Can you give us a quick, kind of, range from perhaps very simple to even, I don’t know, sci-fi and then your 90-second tour of how risk assessment works and why a framework is essential?

MINSSEN: Let me start out by saying that the WHO [World Health Organization] estimates that today there are approximately 2 million different kinds of medical devices on the world market, and as of the FDA’s latest update that I’m aware of, the FDA has authorized more than 1,000 AI-, machine learning-enabled medical devices, and that number is rising rapidly.

So in that context, I think it is important to understand that medical devices can be any instrument, apparatus, implement, machine, appliance, implant, reagent for in vitro use, software, material, or other similar or related articles that are intended by the manufacturer to be used alone or in combination for a medical purpose. And the spectrum of what constitutes a medical device can thus range from very simple devices such as tongue depressors, contact lenses, and thermometers to more complex devices such as blood pressure monitors, insulin pumps, MRI machines, implantable pacemakers, and even software as a medical device or AI-enabled monitors or drug device combinations, as well.

So talking about regulation, I think it is also very important to stress that medical devices are used in many diverse situations by very different stakeholders. And testing has to take this variety into consideration, and it is intrinsically tied to regulatory requirements across various jurisdictions.

During the pre-market phase, medical testing establishes baseline safety and effectiveness metrics through bench testing, performance standards, and clinical studies. And post-market testing ensures that real-world data informs ongoing compliance and safety improvements. So testing is indispensable in translating technological innovation into safe and effective medical devices. And while particular details of pre-market and post-market review procedures may slightly differ among countries, most developed jurisdictions regulate medical devices similarly to the US or European models. 

So most jurisdictions with medical device regulation classify devices based on their risk profile, intended use, indications for use, technological characteristics, and the regulatory controls necessary to provide a reasonable assurance of safety and effectiveness.

SULLIVAN: So medical devices face a pretty prescriptive multi-level testing path before they hit the market. From your vantage point, what are some of the downsides of that system and when does it make the most sense?

MINSSEN: One primary drawback is, of course, the lengthy and expensive approval process. High-risk devices, for example, often undergo years of clinical trials, which can cost millions of dollars, and this can create a significant barrier for startups and small companies with limited resources. And even for moderate-risk devices, the regulatory burden can slow product development and time to the market.

And the approach can also limit flexibility. Prescriptive requirements may not accommodate emerging innovations like digital therapeutics or AI-based diagnostics in a feasible way. And in such cases, the framework can unintentionally [stiffen] innovation by discouraging creative solutions or iterative improvements, which as matter of fact can also put patients at risk when you don’t use new technologies and AI. And additionally, the same level of scrutiny may be applied to low-risk devices, where the extensive testing and documentation may also be disproportionate to the actual patient risk.

However, the prescriptive model is highly appropriate where we have high testing standards for high-risk medical devices, in my view, particularly those that are life-sustaining, implanted, or involve new materials or mechanisms.

I also wanted to say that I think that these higher compliance thresholds can be OK and necessary if you have a system where authorities and stakeholders also have the capacity and funding to enforce, monitor, and achieve compliance with such rules in a feasible, time-effective, and straightforward manner. And this, of course, requires resources, novel solutions, and investments.

SULLIVAN: A range of tests are undertaken across the life cycle of medical devices. How do these testing requirements vary across different stages of development and across various applications?

MINSSEN: Yes, that’s a good question. So I think first it is important to realize that testing is conducted by various entities, including manufacturers, independent third-party laboratories, and regulatory agencies. And it occurs throughout the device life cycle, beginning with iterative testing during the research and development stage, advancing to pre-market evaluations, and continuing into post-market monitoring. And the outcomes of these tests directly impact regulatory approvals, market access, and device design refinements, as well. So the testing results are typically shared with regulatory authorities and in some cases with healthcare providers and the broader public to enhance transparency and trust.

So if you talk about the different phases that play a role here … so let’s turn to the pre-market phase, where manufacturers must demonstrate that the device is conformed to safety and performance benchmarks defined by regulatory authorities. Pre-market evaluations include functional bench testing, biocompatibility, for example, assessments and software validation, all of which are integral components of a manufacturer’s submission.

But, yes, but, testing also, and we touched already up on that, extends into the post-market phase, where it continues to ensure device safety and efficacy, and post-market surveillance relies on testing to monitor real-world performance and identify emerging risks on the post-market phase. By integrating real-world evidence into ongoing assessments, manufacturers can address unforeseen issues, update devices as needed, and maintain compliance with evolving regulatory expectations. And I think this is particularly important in this new generation of medical devices that are AI-enabled or machine-learning enabled.

I think we have to understand that in this AI-enabled medical devices field, you know, the devices and the algorithms that are working with them, they can improve in the lifetime of a product. So actually, not only you could assess them and make sure that they maintain safe, you could also sometimes lower the risk category by finding evidence that these devices are actually becoming more precise and safer. So it can both, you know, heighten the risk category or lower the risk category, and that’s why this continuous testing is so important.

SULLIVAN: Given what you just said, how should regulators handle a device whose algorithm keeps updating itself after approval?

MINSSEN: Well, it has to be an iterative process that is feasible and straightforward and that is based on a very efficient, both time efficient and performance efficient, communication between the regulatory authorities and the medical device developers, right. We need to have the sensors in place that spot potential changes, and we need to have the mechanisms in place that allow us to quickly react to these changes both regulatory wise and also in the technological way. 

So I think communication is important, and we need to have the pathways and the feedback loops in the regulation that quickly allow us to monitor these self-learning algorithms and devices.

SULLIVAN: It sounds like it’s just … there’s such a delicate balance between advancing technology and really ensuring public safety. You know, if we clamp down too hard, we stifle that innovation. You already touched upon this a bit. But if we’re too lax, we risk unintended consequences. And I’d just love to hear how you think the field is balancing that and any learnings you can share.

MINSSEN: So this is very true, and you just touched upon a very central question also in our research and our writing. And this is also the reason why medical device regulation is so fascinating and continues to evolve in response to rapid advancements in technologies, particularly dual technologies regarding digital health, artificial intelligence, for example, and personalized medicine.

And finding the balance is tricky because also [a] related major future challenge relates to the increasing regulatory jungle and the complex interplay between evolving regulatory landscapes that regulate AI more generally.

We really need to make sure that the regulatory authorities that deal with this, that need to find the right balance to promote innovation and mitigate and prevent risks, need to have the capacity to do this. So this requires investments, and it also requires new ways to regulate this technology more flexibly, for example through regulatory sandboxes and so on.

SULLIVAN: Could you just expand upon that a bit and double-click on what it is you’re seeing there? What excites you about what’s happening in that space?

MINSSEN: Yes, well, the research of my group at the Center for Advanced Studies in Bioscience Innovation Law is very broad. I mean, we are looking into gene editing technologies. We are looking into new biologics. We are looking into medical devices, as well, obviously, but also other technologies in advanced medical computing.

And what we see across the line here is that there is an increasing demand for having more adaptive and flexible regulatory frameworks in these new technologies, in particular when they have new uses, regulations that are focusing more on the product rather than the process. And I have recently written a report, for example, for emerging biotechnologies and bio-solutions for the EU commission. And even in that area, regulatory sandboxes are increasingly important, increasingly considered.

So this idea of regulatory sandboxes has been developing originally in the financial sector, and it is now penetrating into other sectors, including synthetic biology, emerging biotechnologies, gene editing, AI, quantum technology, as well. This is basically creating an environment where actors can test new ideas in close collaboration and under the oversight of regulatory authorities.

But to implement this in the AI sector now also leads us to a lot of questions and challenges. For example, you need to have the capacities of authorities that are governing and monitoring and deciding on these regulatory sandboxes. There are issues relating to competition law, for example, which you call antitrust law in the US, because the question is, who can enter the sandbox and how may they compete after they exit the sandbox? And there are many questions relating to, how should we work with these sandboxes and how should we implement these sandboxes?

[TRANSITION MUSIC]

SULLIVAN: Well, Timo, it has just been such a pleasure to speak with you today.

MINSSEN: Yes, thank you very much.

And now I’m happy to introduce Chad Atalla.

Chad is senior applied scientist in Microsoft Research New York City’s Sociotechnical Alignment Center, where they contribute to foundational responsible AI research and practical responsible AI solutions for teams across Microsoft.

Chad, welcome!

CHAD ATALLA: Thank you.

SULLIVAN: So we’ll kick off with a couple questions just to dive right in. So tell me a little bit more about the Sociotechnical Alignment Center, or STAC? I know it was founded in 2022. I’d love to just learn a little bit more about what the group does, how you’re thinking about evaluating AI, and maybe just give us a sense of some of the projects you’re working on.

ATALLA: Yeah, absolutely. The name is quite a mouthful.

SULLIVAN: It is! [LAUGHS]

ATALLA: So let’s start by breaking that down and seeing what that means.

SULLIVAN: Great.

ATALLA: So modern AI systems are sociotechnical systems, meaning that the social and technical aspects are deeply intertwined. And we’re interested in aligning the behaviors of these sociotechnical systems with some values. Those could be societal values; they could be regulatory values, organizational values, etc. And to make this alignment happen, we need the ability to evaluate the systems.

So my team is broadly working on an evaluation framework that acknowledges the sociotechnical nature of the technology and the often-abstract nature of the concepts we’re actually interested in evaluating. As you noted, it’s an applied science team, so we split our time between some fundamental research and time to bridge the work into real products across the company. And I also want to note that to power this sort of work, we have an interdisciplinary team drawing upon the social sciences, linguistics, statistics, and, of course, computer science.

SULLIVAN: Well, I’m eager to get into our takeaways from the conversation with both Daniel and Timo. But maybe just to double-click on this for a minute, can you talk a bit about some of the overarching goals of the AI evaluations that you noted?

ATALLA: So evaluation is really the act of making valuative judgments based on some evidence, and in the case of AI evaluation, that evidence might be from tests or measurements, right. And the goal of why we’re doing this in the first place is to make decisions and claims most often.

So perhaps I am going to make a claim about a model that I’m producing, and I want to say that it’s better than this other model. Or we are asking whether a certain product is safe to ship. All of these decisions need to be informed by good evaluation and therefore good measurement or testing. And I’ll also note that in the regulatory conversation, risk is often what we want to evaluate. So that is a goal in and of itself. And I’ll touch more on that later.

SULLIVAN: I read a recent paper that you had put out with some of our colleagues from Microsoft Research, from the University of Michigan, and Stanford, and you were arguing that evaluating generative AI is the social-science measurement challenge. Maybe for those who haven’t read the paper, what does this mean? And can you tell us a little bit more about what motivated you and your coauthors?

ATALLA: So the measurement tasks involved in evaluating generative AI systems are often abstract and contested. So that means they cannot be directly measured and must instead [be] indirectly measured via other observable phenomena. So this is very different than the older machine learning paradigm, where, let’s say, for example, I had a system that took a picture of a traffic light and told you whether it was green, yellow, or red at a given time.

If we wanted to evaluate that system, the task is much simpler. But with the modern generative AI systems that are also general purpose, they have open-ended output, and language in a whole chat or multiple paragraphs being outputted can have a lot of different properties. And as I noted, these are general-purpose systems, so we don’t know exactly what task they’re supposed to be carrying out.

So then the question becomes, if I want to make some decision or claim—maybe I want to make a claim that this system has human-level reasoning capabilities—well, what does that mean? Do I have the same impression of what that means as you do? And how do we know whether the downstream, you know, measurements and tests that I’m conducting actually will support my notion of what it means to have human-level reasoning, right? Difficult questions. But luckily, social scientists have been dealing with these exact sorts of challenges for multiple decades in fields like education, political science, and psychometrics. So we’re really attempting to avoid reinventing the wheel here and trying to learn from their past methodologies.

And so the rest of the paper goes on to delve into a four-level framework, a measurement framework, that’s grounded in the measurement theory from the quantitative social sciences that takes us all the way from these abstract and contested concepts through processes to get much clearer and eventually reach reliable and valid measurements that can power our evaluations.

SULLIVAN: I love that. I mean, that’s the whole point of this podcast, too, right. Is to really build on those other learnings and frameworks that we’re taking from industries that have been thinking about this for much longer. Maybe from your vantage point, what are some of the biggest day-to-day hurdles in building solid AI evaluations and, I don’t know, do we need more shared standards? Are there bespoke methods? Are those the way to go? I would love to just hear your thoughts on that.

ATALLA: So let’s talk about some of those practical challenges. And I want to briefly go back to what I mentioned about risk before, all right. Oftentimes, some of the regulatory environment is requiring practitioners to measure the risk involved in deploying one of their models or AI systems. Now, risk is importantly a concept that includes both event and impact, right. So there’s the probability of some event occurring. For the case of AI evaluation, perhaps this is us seeing a certain AI behavior exhibited. Then there’s also the severity of the impacts, and this is a complex chain of effects in the real world that happen to people, organizations, systems, etc., and it’s a lot more challenging to observe the impacts, right.

So if we’re saying that we need to measure risk, we have to measure both the event and the impacts. But realistically, right now, the field is not doing a very good job of actually measuring the impacts. This requires vastly different techniques and methodologies where if I just wanted to measure something about the event itself, I can, you know, do that in a technical sandbox environment and perhaps have some automated methods to detect whether a certain AI behavior is being exhibited. But if I want to measure the impacts? Now, we’re in the realm of needing to have real people involved, and perhaps a longitudinal study where you have interviews, questionnaires, and more qualitative evidence-gathering techniques to truly understand the long-term impacts. So that’s a significant challenge.

Another is that, you know, let’s say we forget about the impacts for now and we focus on the event side of things. Still, we need datasets, we need annotations, and we need metrics to make this whole thing work. When I say we need datasets, if I want to test whether my system has good mathematical reasoning, what questions should I ask? What are my set of inputs that are relevant? And then when I get the response from the system, how do I annotate them? How do I know if it was a good response that did demonstrate mathematical reasoning or if it was a mediocre response? And then once I have an annotation of all of these outputs from the AI system, how do I aggregate those all up into a single informative number?

SULLIVAN: Earlier in this episode, we heard Daniel and Timo walk through the regulatory frameworks in pharma and medical devices. I’d be curious what pieces of those mature systems are already showing up or at least may be bubbling up in AI governance.

ATALLA: Great question. You know, Timo was talking about the pre-market and post-market testing difference. Of course, this is similarly important in the AI evaluation space. But again, these have different methodologies and serve different purposes.

So within the pre-deployment phase, we don’t have evidence of how people are going to use the system. And when we have these general-purpose AI systems, to understand what the risks are, we really need to have a sense of what might happen and how they might be used. So there are significant challenges there where I think we can learn from other fields and how they do pre-market testing. And the difference in that pre- versus post-market testing also ties to testing at different stages in the life cycle.

For AI systems, we already see some regulations saying you need to start with the base model and do some evaluation of the base model, some basic attributes, some core attributes, of that base model before you start putting it into any real products. But once we have a product in mind, we have a user base in mind, we have a specific task—like maybe we’re going to integrate this model into Outlook and it’s going to help you write emails—now we suddenly have a much crisper picture of how the system will interact with the world around it. And again, at that stage, we need to think about another round of evaluation.

Another part that jumped out to me in what they were saying about pharmaceuticals is that sometimes approvals can be based on surrogate endpoints. So this is like we’re choosing some heuristic. Instead of measuring the long-term impact, which is what we actually care about, perhaps we have a proxy that we feel like is a good enough indicator of what that long-term impact might look like.

This is occurring in the AI evaluation space right now and is often perhaps even the default here since we’re not seeing that many studies of the long-term impact itself. We are seeing, instead, folks constructing these heuristics or proxies and saying if I see this behavior happen, I’m going to assume that it indicates this sort of impact will happen downstream. And that’s great. It’s one of the techniques that was used to speed up and reduce the barrier to innovation in the other fields. And I think it’s great that we are applying that in the AI evaluation space. But special care is, of course, needed to ensure that those heuristics and proxies you’re using are reasonable indicators of the greater outcome you’re looking for.

SULLIVAN: What are some of the promising ideas from maybe pharma or med device regulation that maybe haven’t made it to AI testing yet and maybe should? And where would you urge technologists, policymakers, and researchers to focus their energy next?

ATALLA: Well, one of the key things that jumped out to me in the discussion about pharmaceuticals was driving home the emphasis that there is a holistic focus on safety and efficacy. These go hand in hand and decisions must be made while considering both pieces of the picture. I would like to see that further emphasized in the AI evaluation space.

Often, we are seeing evaluations of risk being separated from evaluations of performance or quality or efficacy, but these two pieces of the puzzle really are not enough for us to make informed decisions independently. And that ties back into my desire to really also see us measuring the impacts.

So we see Phase 3 trials as something that occurs in the medical devices and pharmaceuticals field. That’s not something that we are doing an equivalent of in the AI evaluation space at this time. These are really cost intensive. They can last years and really involve careful monitoring of that holistic picture of safety and efficacy. And realistically, we are not going to be able to put that on the critical path to getting specific individual AI models or AI systems vetted before they go out into the world. However, I would love to see a world in which this sort of work is prioritized and funded or required. Think of how, with social media, it took quite a long time for us to understand that there are some long-term negative impacts on mental health, and we have the opportunity now, while the AI wave is still building, to start prioritizing and funding this sort of work. Let it run in the background and as soon as possible develop a good understanding of the subtle, long-term effects.

More broadly, I would love to see us focus on reliability and validity of the evaluations we’re conducting because trust in these decisions and claims is important. If we don’t focus on building reliable, valid, and trustworthy evaluations, we’re just going to continue to be flooded by a bunch of competing, conflicting, and largely meaningless AI evaluations.

SULLIVAN: In a number of the discussions we’ve had on this podcast, we talked about how it’s not just one entity that really needs to ensure safety across the board, and I’d just love to hear from you how you think about some of those ecosystem collaborations, and you know, from across … where we think about ourselves as more of a platform company or places that these AI models are being deployed more at the application level. Tell me a little bit about how you think about, sort of, stakeholders in that mix and where responsibility lies across the board.

ATALLA: It’s interesting. In this age of general-purpose AI technologies, we’re often seeing one company or organization being responsible for building the foundational model. And then many, many other people will take that model and build it into specific products that are designed for specific tasks and contexts.

Of course, in that, we already see that there is a responsibility of the owners of that foundational model to do some testing of the central model before they distribute it broadly. And then again, there is responsibility of all of the downstream individuals digesting that and turning it into products to consider the specific contexts that they are deploying into and how that may affect the risks we’re concerned with or the types of quality and safety and performance we need to evaluate.

Again, because that field of risks we may be concerned with is so broad, some of them also require an immense amount of expertise. Let’s think about whether AI systems can enable people to create dangerous chemicals or dangerous weapons at home. It’s not that every AI practitioner is going to have the knowledge to evaluate this, so in some of those cases, we really need third-party experts, people who are experts in chemistry, biology, etc., to come in and evaluate certain systems and models for those specific risks, as well.

So I think there are many reasons why multiple stakeholders need to be involved, partly from who owns what and is responsible for what and partly from the perspective of who has the expertise to meaningfully construct the evaluations that we need.

SULLIVAN: Well, Chad, this has just been great to connect, and in a few of our discussions, we’ve done a bit of a lightning round, so I’d love to just hear your 30-second responses to a few of these questions. Perhaps favorite evaluation you’ve run so far this year?

ATALLA: So I’ve been involved in trying to evaluate some language models for whether they infer sensitive attributes about people. So perhaps you’re chatting with a chatbot, and it infers your religion or sexuality based on things you’re saying or how you sound, right. And in working to evaluate this, we encounter a lot of interesting questions. Or, like, what is a sensitive attribute? What makes these attributes sensitive, and what are the differences that make it inappropriate for an AI system to infer these things about a person? Whereas realistically, whenever I meet a person on the street, my brain is immediately forming first impressions and some assumptions about these people. So it’s a very interesting and thought-provoking evaluation to conduct and think about the norms that we place upon people interacting with other people and the norms we place upon AI systems interacting with other people.

SULLIVAN: That’s fascinating! I’d love to hear the AI buzzword you’d retire tomorrow. [LAUGHTER]

ATALLA: I would love to see the term “bias” being used less when referring to fairness-related issues and systems. Bias happens to be a highly overloaded term in statistics and machine learning and has a lot of technical meanings and just fails to perfectly capture what we mean in the AI risk sense.

SULLIVAN: And last one. One metric we’re not tracking enough.

ATALLA: I would say over-blocking, and this comes into that connection between the holistic picture of safety and efficacy. It’s too easy to produce systems that throw safety to the wind and focus purely on utility or achieving some goal, but simultaneously, the other side of the picture is possible, where we can clamp down too hard and reduce the utility of our systems and block even benign and useful outputs just because they border on something sensitive. So it’s important for us to track that over-blocking and actively track that tradeoff between safety and efficacy.

SULLIVAN: Yeah, we talk a lot about this on the podcast, too, of how do you both make things safe but also ensure innovation can thrive, and I think you hit the nail on the head with that last piece.

[MUSIC]

Well, Chad, this was really terrific. Thanks for joining us and thanks for your work and your perspectives. And another big thanks to Daniel and Timo for setting the stage earlier in the podcast.

And to our listeners, thanks for tuning in. You can find resources related to this podcast in the show notes. And if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI. 

See you next time! 

[MUSIC FADES]

Source link

AI Research

Cyber Command creates new AI program in fiscal 2026 budget

Published

2 hours ago

July 7, 2025

Mark Pomerleau

U.S. Cyber Command’s budget request for fiscal 2026 includes funding to begin a new project specifically for artificial intelligence.

While the budget proposal would allot just $5 million for the effort — a small portion of Cybercom’s $1.3 billion research and development spending plan — the stand-up of the program follows congressional direction to prod the command to develop an AI roadmap.

In the fiscal 2023 defense policy bill, Congress charged Cybercom and the Department of Defense chief information officer — in coordination with the chief digital and artificial intelligence officer, director of the Defense Advanced Research Projects Agency, director of the National Security Agency and the undersecretary of defense for research and engineering — to jointly develop a five-year guide and implementation plan for rapidly adopting and acquiring AI systems, applications, supporting data and data management processes for cyber operations forces.

Cybercom created its roadmap shortly thereafter along with an AI task force.

The new project within Cybercom’s R&D budget aims to develop core data standards in order to curate and tag collected data that meet those standards to effectively integrate data into AI and machine learning solutions while more efficiently developing artificial intelligence capabilities to meet operational needs.

The effort is directly related to the task of furthering the roadmap.

As a result of that roadmap, the command decided to house its task force within its elite Cyber National Mission Force.

The command created the program by pulling funds from its operations and maintenance budget and moving them to the R&D budget from fiscal 2025 to fiscal 2026.

The command outlined five categories of various AI applications across its enterprise and other organizations, including vulnerabilities and exploits; network security, monitoring, and visualization; modeling and predictive analytics; persona and identity; and infrastructure and transport.

Specifically, the command’s AI project, Artificial Intelligence for Cyberspace Operations, will aim to develop and conduct pilots while investing in infrastructure to leverage commercial AI capabilities. The command’s Cyber Immersion Laboratory will develop, test and evaluate cyber capabilities and perform operational assessments performed by third parties, the budget documents state.

In fiscal 2026, the command plans to spend the $5 million to support the CNMF in piloting AI technologies through an agile 90-day pilot cycle, according to the documents, which will ensure quick success or failure. That fast-paced methodology allows the CNMF to quickly test and validate solutions against operational use cases with flexibility to adapt to evolving cyber threats.

The CNMF will also look to explore ways to improve threat detection, automate data analysis, and enhance decision-making processes in cyber operations, according to budget documents.

Source link

AI Research

How artificial intelligence is transforming feline medicine

Published

2 hours ago

July 7, 2025

Matt McGlasson DVM, CVPM

A powerful survival instinct that cats have is the ability to hide pain. As both predators and prey, cats evolved to mask signs of illness or injury to avoid being seen as vulnerable. In the wild, this meant safety. In clinical practice today, it often means late diagnoses and missed opportunities for early care. Cats are subtle when they’re unwell, and this contributes to one of the most persistent challenges in veterinary medicine: feline under-medicalization.

The feline care gap

Although an estimated 74 million domesticated cats are in the US only 40%receive annual veterinary care, compared to 82% of dogs.^1,2 According to the CATalyst Council, the current feline veterinary market is valued at $12 billion; however, if cat utilization matched that of dogs, the market could expand to $32 billion, representing an untapped $20 billion opportunity.³ Encouragingly, feline visits and revenue have grown even as other segments have plateaued. Yet detecting what cats instinctively hide continues to be the greatest barrier.

Understanding silent symptoms

Illness in cats rarely begins with dramatic signs. Instead, it starts with small behavioral shifts:

A subtle decline in jumping
Changes in sleep cycles
Less frequent or reduced play behavior
Changes in eating and drinking behavior
Altered litter box behavior and usage
Reduced grooming or altered body posture
Withdrawal or increased nighttime activity

Because cats tend to suppress these signs, especially in stressful environments like veterinary clinics, they often go unnoticed. This creates a species-specific blind spot in veterinary medicine.

Aligning with feline instincts to close the care gap

The medicalization gap isn’t just a matter of caregiver hesitation or access; it’s rooted in feline biology. Cats have evolved to hide signs of vulnerability. To close the gap, veterinary medicine must find ways to work with this reality, not against it.

This is where artificial intelligence (AI)-powered tools come in. By observing cats continuously in their natural, stress-free environment, new technologies are surfacing subtle early changes that humans alone often miss. These tools don’t force cats to communicate; they interpret what cats are already expressing through behavior.

Emerging tools that translate the subtle

One innovation in this space is Moggie, a behavior-tracking wearable explicitly designed for cats. Moggie monitors core feline behaviors, including walking, jumping, resting, grooming, and play, by using AI to detect deviations from each cat’s individual baseline. It helps surface meaningful behavioral changes that could indicate early illness or discomfort.

This technology is passive, non-invasive, and designed for continuous use in the cat’s natural habitat, where stress levels are low and instinctual behaviors are most pronounced.

Early patterns from caregivers and clinics using AI-driven behavior tools have shown promise:

Decreased jumping flagged early-stage arthritis
Fragmented sleep and pacing pointed toward hyperthyroidism
Reduced grooming revealed underlying dental pain

These insights don’t replace veterinary exams; they complement them, giving caregivers and clinicians richer, continuous behavioral context between visits.

From silence to signals

By equipping caregivers with objective insights and reducing reliance on subjective observation alone, AI-powered tools can:

Prompt earlier veterinary visits
Improve chronic disease monitoring
Strengthen caregiver-clinic communication
Increase trust and compliance over time

This approach aligns directly with the CATalyst Council’s call for proactive, tech-enabled, and cat-centric models of care that meet cats on their terms, not just in exam rooms.⁴

Cats will always hide pain, but that doesn’t mean we have to miss it. With AI, veterinary medicine is learning to detect the subtle signals cats have been conveying to us all along.

References

American Veterinary Medical Association. U.S. Pet Ownership Statistics. American Veterinary Medical Association. Published 2024. https://www.avma.org/resources-tools/reports-statistics/us-pet-ownership-statistics
2025 Hill’s Pet Nutrition World of the cat report. Hill’s Pet Nutrition. Accessed July 2, 2025. https://na.hillsvna.com/en_US/resources-2/view/244
CATalyst Council Releases First 2025 Market Insights Report: Feline Veterinary Care Emerges as Industry Growth Driver. News release. CATalyst Council. April 30, 2025. Accessed July 2, 2025. https://catalystcouncil.org/catalyst-council-releases-first-2025-market-insights-report-feline-veterinary-care-emerges-as-industry-growth-driver/
Garrison G. The Feline Factor. Vet Advantage. Published May 2025. Accessed July 2, 2025. https://vet-advantage.com/vet-advantage/the-feline-factor/

Source link