AI Research
How we built our multi-agent research system \ Anthropic
Claude now has Research capabilities that allow it to search across the web, Google Workspace, and any integrations to accomplish complex tasks.
The journey of this multi-agent system from prototype to production taught us critical lessons about system architecture, tool design, and prompt engineering. A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously. Systems with multiple agents introduce new challenges in agent coordination, evaluation, and reliability.
This post breaks down the principles that worked for us—we hope you’ll find them useful to apply when building your own multi-agent systems.
Benefits of a multi-agent system
Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent. When people conduct research, they tend to continuously update their approach based on discoveries, following leads that emerge during investigation.
This unpredictability makes AI agents particularly well-suited for research tasks. Research demands the flexibility to pivot or explore tangential connections as the investigation unfolds. The model must operate autonomously for many turns, making decisions about which directions to pursue based on intermediate findings. A linear, one-shot pipeline cannot handle these tasks.
The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.
Once intelligence reaches a threshold, multi-agent systems become a vital way to scale performance. For instance, although individual humans have become more intelligent in the last 100,000 years, human societies have become exponentially more capable in the information age because of our collective intelligence and ability to coordinate. Even generally-intelligent agents face limits when operating as individuals; groups of agents can accomplish far more.
Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.
Multi-agent systems work mainly because they help spend enough tokens to solve the problem. In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors. This finding validates our architecture that distributes work across agents with separate context windows to add more capacity for parallel reasoning. The latest Claude models act as large efficiency multipliers on token use, as upgrading to Claude Sonnet 4 is a larger performance gain than doubling the token budget on Claude Sonnet 3.7. Multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents.
There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. Further, some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.
Architecture overview for Research
Our Research system uses a multi-agent architecture with an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.
When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. As shown in the diagram above, the subagents act as intelligent filters by iteratively using search tools to gather information, in this case on AI agent companies in 2025, and then returning a list of companies to the lead agent so it can compile a final answer.
Traditional approaches using Retrieval Augmented Generation (RAG) use static retrieval. That is, they fetch some set of chunks that are most similar to an input query and use these chunks to generate a response. In contrast, our architecture uses a multi-step search that dynamically finds relevant information, adapts to new findings, and analyzes results to formulate high-quality answers.

Prompt engineering and evaluations for research agents
Multi-agent systems have key differences from single-agent systems, including a rapid growth in coordination complexity. Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. Below are some principles we learned for prompting agents:
- Think like your agents. To iterate on prompts, you must understand their effects. To help us do this, we built simulations using our Console with the exact prompts and tools from our system, then watched agents work step-by-step. This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent, which can make the most impactful changes obvious.
- Teach the orchestrator how to delegate. In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries. Without detailed task descriptions, agents duplicate work, leave gaps, or fail to find necessary information. We started by allowing the lead agent to give simple, short instructions like ‘research the semiconductor shortage,’ but found these instructions often were vague enough that subagents misinterpreted the task or performed the exact same searches as other agents. For instance, one subagent explored the 2021 automotive chip crisis while 2 others duplicated work investigating current 2025 supply chains, without an effective division of labor.
- Scale effort to query complexity. Agents struggle to judge appropriate effort for different tasks, so we embedded scaling rules in the prompts. Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities. These explicit guidelines help the lead agent allocate resources efficiently and prevent overinvestment in simple queries, which was a common failure mode in our early versions.
- Tool design and selection are critical. Agent-tool interfaces are as critical as human-computer interfaces. Using the right tool is efficient—often, it’s strictly necessary. For instance, an agent searching the web for context that only exists in Slack is doomed from the start. With MCP servers that give the model access to external tools, this problem compounds, as agents encounter unseen tools with descriptions of wildly varying quality. We gave our agents explicit heuristics: for example, examine all available tools first, match tool usage to user intent, search the web for broad external exploration, or prefer specialized tools over generic ones. Bad tool descriptions can send agents down completely wrong paths, so each tool needs a distinct purpose and a clear description.
- Let agents improve themselves. We found that the Claude 4 models can be excellent prompt engineers. When given a prompt and a failure mode, they are able to diagnose why the agent is failing and suggest improvements. We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.
- Start wide, then narrow down. Search strategy should mirror expert human research: explore the landscape before drilling into specifics. Agents often default to overly long, specific queries that return few results. We counteracted this tendency by prompting agents to start with short, broad queries, evaluate what’s available, then progressively narrow focus.
- Guide the thinking process. Extended thinking mode, which leads Claude to output additional tokens in a visible thinking process, can serve as a controllable scratchpad. The lead agent uses thinking to plan its approach, assessing which tools fit the task, determining query complexity and subagent count, and defining each subagent’s role. Our testing showed that extended thinking improved instruction-following, reasoning, and efficiency. Subagents also plan, then use interleaved thinking after tool results to evaluate quality, identify gaps, and refine their next query. This makes subagents more effective in adapting to any task.
- Parallel tool calling transforms speed and performance. Complex research tasks naturally involve exploring many sources. Our early agents executed sequential searches, which was painfully slow. For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.
Our prompting strategy focuses on instilling good heuristics rather than rigid rules. We studied how skilled humans approach research tasks and encoded these strategies in our prompts—strategies like decomposing difficult questions into smaller tasks, carefully evaluating the quality of sources, adjusting search approaches based on new information, and recognizing when to focus on depth (investigating one topic in detail) vs. breadth (exploring many topics in parallel). We also proactively mitigated unintended side effects by setting explicit guardrails to prevent the agents from spiraling out of control. Finally, we focused on a fast iteration loop with observability and test cases.
Effective evaluation of agents
Good evaluations are essential for building reliable AI applications, and agents are no different. However, evaluating multi-agent systems presents unique challenges. Traditional evaluations often assume that the AI follows the same steps each time: given input X, the system should follow path Y to produce output Z. But multi-agent systems don’t work this way. Even with identical starting points, agents might take completely different valid paths to reach their goal. One agent might search three sources while another searches ten, or they might use different tools to find the same answer. Because we don’t always know what the right steps are, we usually can’t just check if agents followed the “correct” steps we prescribed in advance. Instead, we need flexible evaluation methods that judge whether agents achieved the right outcomes while also following a reasonable process.
Start evaluating immediately with small samples. In early agent development, changes tend to have dramatic impacts because there is abundant low-hanging fruit. A prompt tweak might boost success rates from 30% to 80%. With effect sizes this large, you can spot changes with just a few test cases. We started with a set of about 20 queries representing real usage patterns. Testing these queries often allowed us to clearly see the impact of changes. We often hear that AI developer teams delay creating evals because they believe that only large evals with hundreds of test cases are useful. However, it’s best to start with small-scale testing right away with a few examples, rather than delaying until you can build more thorough evals.
LLM-as-judge evaluation scales when done well. Research outputs are difficult to evaluate programmatically, since they are free-form text and rarely have a single correct answer. LLMs are a natural fit for grading outputs. We used an LLM judge that evaluated each output against criteria in a rubric: factual accuracy (do claims match sources?), citation accuracy (do the cited sources match the claims?), completeness (are all requested aspects covered?), source quality (did it use primary sources over lower-quality secondary sources?), and tool efficiency (did it use the right tools a reasonable number of times?). We experimented with multiple judges to evaluate each component, but found that a single LLM call with a single prompt outputting scores from 0.0-1.0 and a pass-fail grade was the most consistent and aligned with human judgements. This method was especially effective when the eval test cases did have a clear answer, and we could use the LLM judge to simply check if the answer was correct (i.e. did it accurately list the pharma companies with the top 3 largest R&D budgets?). Using an LLM as a judge allowed us to scalably evaluate hundreds of outputs.
Human evaluation catches what automation misses. People testing agents find edge cases that evals miss. These include hallucinated answers on unusual queries, system failures, or subtle source selection biases. In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue. Even in a world of automated evaluations, manual testing remains essential.
Multi-agent systems have emergent behaviors, which arise without specific programming. For instance, small changes to the lead agent can unpredictably change how subagents behave. Success requires understanding interaction patterns, not just individual agent behavior. Therefore, the best prompts for these agents are not just strict instructions, but frameworks for collaboration that define the division of labor, problem-solving approaches, and effort budgets. Getting this right relies on careful prompting and tool design, solid heuristics, observability, and tight feedback loops. See the open-source prompts in our Cookbook for example prompts from our system.
Production reliability and engineering challenges
In traditional software, a bug might break a feature, degrade performance, or cause outages. In agentic systems, minor changes cascade into large behavioral changes, which makes it remarkably difficult to write code for complex agents that must maintain state in a long-running process.
Agents are stateful and errors compound. Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents. When errors occur, we can’t just restart from the beginning: restarts are expensive and frustrating for users. Instead, we built systems that can resume from where the agent was when the errors occurred. We also use the model’s intelligence to handle issues gracefully: for instance, letting the agent know when a tool is failing and letting it adapt works surprisingly well. We combine the adaptability of AI agents built on Claude with deterministic safeguards like retry logic and regular checkpoints.
Debugging benefits from new approaches. Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder. For instance, users would report agents “not finding obvious information,” but we couldn’t see why. Were the agents using bad search queries? Choosing poor sources? Hitting tool failures? Adding full production tracing let us diagnose why agents failed and fix issues systematically. Beyond standard observability, we monitor agent decision patterns and interaction structures—all without monitoring the contents of individual conversations, to maintain user privacy. This high-level observability helped us diagnose root causes, discover unexpected behaviors, and fix common failures.
Deployment needs careful coordination. Agent systems are highly stateful webs of prompts, tools, and execution logic that run almost continuously. This means that whenever we deploy updates, agents might be anywhere in their process. We therefore need to prevent our well-meaning code changes from breaking existing agents. We can’t update every agent to the new version at the same time. Instead, we use rainbow deployments to avoid disrupting running agents, by gradually shifting traffic from old to new versions while keeping both running simultaneously.
Synchronous execution creates bottlenecks. Currently, our lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding. This simplifies coordination, but creates bottlenecks in the information flow between agents. For instance, the lead agent can’t steer subagents, subagents can’t coordinate, and the entire system can be blocked while waiting for a single subagent to finish searching. Asynchronous execution would enable additional parallelism: agents working concurrently and creating new subagents when needed. But this asynchronicity adds challenges in result coordination, state consistency, and error propagation across the subagents. As models can handle longer and more complex research tasks, we expect the performance gains will justify the complexity.
Conclusion
When building AI agents, the last mile often becomes most of the journey. Codebases that work on developer machines require significant engineering to become reliable production systems. The compound nature of errors in agentic systems means that minor issues for traditional software can derail agents entirely. One step failing can cause agents to explore entirely different trajectories, leading to unpredictable outcomes. For all the reasons described in this post, the gap between prototype and production is often wider than anticipated.
Despite these challenges, multi-agent systems have proven valuable for open-ended research tasks. Users have said that Claude helped them find business opportunities they hadn’t considered, navigate complex healthcare options, resolve thorny technical bugs, and save up to days of work by uncovering research connections they wouldn’t have found alone. Multi-agent research systems can operate reliably at scale with careful engineering, comprehensive testing, detail-oriented prompt and tool design, robust operational practices, and tight collaboration between research, product, and engineering teams who have a strong understanding of current agent capabilities. We’re already seeing these systems transform how people solve complex problems.

Acknowlegements
Written by Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. This work reflects the collective efforts of several teams across Anthropic who made the Research feature possible. Special thanks go to the Anthropic apps engineering team, whose dedication brought this complex multi-agent system to production. We’re also grateful to our early users for their excellent feedback.
Appendix
Below are some additional miscellaneous tips for multi-agent systems.
End-state evaluation of agents that mutate state over many turns. Evaluating agents that modify persistent state across multi-turn conversations presents unique challenges. Unlike read-only research tasks, each action can change the environment for subsequent steps, creating dependencies that traditional evaluation methods struggle to handle. We found success focusing on end-state evaluation rather than turn-by-turn analysis. Instead of judging whether the agent followed a specific process, evaluate whether it achieved the correct final state. This approach acknowledges that agents may find alternative paths to the same goal while still ensuring they deliver the intended outcome. For complex workflows, break evaluation into discrete checkpoints where specific state changes should have occurred, rather than attempting to validate every intermediate step.
Long-horizon conversation management. Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms. We implemented patterns where agents summarize completed work phases and store essential information in external memory before proceeding to new tasks. When context limits approach, agents can spawn fresh subagents with clean contexts while maintaining continuity through careful handoffs. Further, they can retrieve stored context like the research plan from their memory rather than losing previous work when reaching the context limit. This distributed approach prevents context overflow while preserving conversation coherence across extended interactions.
Subagent output to a filesystem to minimize the ‘game of telephone.’ Direct subagent outputs can bypass the main coordinator for certain types of results, improving both fidelity and performance. Rather than requiring subagents to communicate everything through the lead agent, implement artifact systems where specialized agents can create outputs that persist independently. Subagents call tools to store their work in external systems, then pass lightweight references back to the coordinator. This prevents information loss during multi-stage processing and reduces token overhead from copying large outputs through conversation history. The pattern works particularly well for structured outputs like code, reports, or data visualizations where the subagent’s specialized prompt produces better results than filtering through a general coordinator.
AI Research
Space technology: Lithuania’s promising space start-ups
Technology Reporter
I’m led through a series of concrete corridors at Vilnius University, Lithuania; the murals give a Soviet-era vibe, and it seems an unlikely location for a high-tech lab working on a laser communication system.
But that’s where you’ll find the headquarters of Astrolight, a six-year-old Lithuanian space-tech start-up that has just raised €2.8m ($2.3m; £2.4m) to build what it calls an “optical data highway”.
You could think of the tech as invisible internet cables, designed to link up satellites with Earth.
With 70,000 satellites expected to launch in the next five years, it’s a market with a lot of potential.
The company hopes to be part of a shift from traditional radio frequency-based communication, to faster, more secure and higher-bandwidth laser technology.
Astrolight’s space laser technology could have defence applications as well, which is timely given Russia’s current aggressive attitude towards its neighbours.
Astrolight is already part of Nato’s Diana project (Defence Innovation Accelerator for the North Atlantic), an incubator, set up in 2023 to apply civilian technology to defence challenges.
In Astrolight’s case, Nato is keen to leverage its fast, hack-proof laser communications to transmit crucial intelligence in defence operations – something the Lithuanian Navy is already doing.
It approached Astrolight three years ago looking for a laser that would allow ships to communicate during radio silence.
“So we said, ‘all right – we know how to do it for space. It looks like we can do it also for terrestrial applications’,” recalls Astrolight co-founder and CEO Laurynas Maciulis, who’s based in Lithuania’s capital, Vilnius.
For the military his company’s tech is attractive, as the laser system is difficult to intercept or jam.
It’s also about “low detectability”, Mr Maciulis adds:
“If you turn on your radio transmitter in Ukraine, you’re immediately becoming a target, because it’s easy to track. So with this technology, because the information travels in a very narrow laser beam, it’s very difficult to detect.”
Worth about £2.5bn, Lithuania’s defence budget is small when you compare it to larger countries like the UK, which spends around £54bn a year.
But if you look at defence spending as a percentage of GDP, then Lithuania is spending more than many bigger countries.
Around 3% of its GDP is spent on defence, and that’s set to rise to 5.5%. By comparison, UK defence spending is worth 2.5% of GDP.
Recognised for its strength in niche technologies like Astrolight’s lasers, 30% of Lithuania’s space projects have received EU funding, compared with the EU national average of 17%.
“Space technology is rapidly becoming an increasingly integrated element of Lithuania’s broader defence and resilience strategy,” says Invest Lithuania’s Šarūnas Genys, who is the body’s head of manufacturing sector, and defence sector expert.
Space tech can often have civilian and military uses.
Mr Genys gives the example of Lithuanian life sciences firm Delta Biosciences, which is preparing a mission to the International Space Station to test radiation-resistant medical compounds.
“While developed for spaceflight, these innovations could also support special operations forces operating in high-radiation environments,” he says.
He adds that Vilnius-based Kongsberg NanoAvionics has secured a major contract to manufacture hundreds of satellites.
“While primarily commercial, such infrastructure has inherent dual-use potential supporting encrypted communications and real-time intelligence, surveillance, and reconnaissance across NATO’s eastern flank,” says Mr Genys.
Going hand in hand with Astrolight’s laser technology is the autonomous satellite navigation system fellow Lithuanian space-tech start-up Blackswan Space has developed.
Blackswan Space’s “vision based navigation system” allows satellites to be programmed and repositioned independently of a human based at a ground control centre who, its founders say, won’t be able to keep up with the sheer volume of satellites launching in the coming years.
In a defence environment, the same technology can be used to remotely destroy an enemy satellite, as well as to train soldiers by creating battle simulations.
But the sales pitch to the Lithuanian military hasn’t necessarily been straightforward, acknowledges Tomas Malinauskas, Blackswan Space’s chief commercial officer.
He’s also concerned that government funding for the sector isn’t matching the level of innovation coming out of it.
He points out that instead of spending $300m on a US-made drone, the government could invest in a constellation of small satellites.
“Build your own capability for communication and intelligence gathering of enemy countries, rather than a drone that is going to be shot down in the first two hours of a conflict,” argues Mr Malinauskas, also based in Vilnius.
“It would be a big boost for our small space community, but as well, it would be a long-term, sustainable value-add for the future of the Lithuanian military.”
Eglė Elena Šataitė is the head of Space Hub LT, a Vilnius-based agency supporting space companies as part of Lithuania’s government-funded Innovation Agency.
“Our government is, of course, aware of the reality of where we live, and that we have to invest more in security and defence – and we have to admit that space technologies are the ones that are enabling defence technologies,” says Ms Šataitė.
The country’s Minister for Economy and Innovation, Lukas Savickas, says he understands Mr Malinauskas’ concern and is looking at government spending on developing space tech.
“Space technology is one of the highest added-value creating sectors, as it is known for its horizontality; many space-based solutions go in line with biotech, AI, new materials, optics, ICT and other fields of innovation,” says Mr Savickas.
Whatever happens with government funding, the Lithuanian appetite for innovation remains strong.
“We always have to prove to others that we belong on the global stage,” says Dominykas Milasius, co-founder of Delta Biosciences.
“And everything we do is also geopolitical… we have to build up critical value offerings, sciences and other critical technologies, to make our allies understand that it’s probably good to protect Lithuania.”
AI Research
How Is AI Changing The Way Students Learn At Business School?
Artificial intelligence is the skill set that employers increasingly want from future hires. Find out how b-schools are equipping students to use AI
Business students are already seeing AI’s value. More than three-quarters of business schools have already integrated AI into their curricula—from essay writing to personal tutoring, career guidance to soft-skill development.
BusinessBecause hears from current business students about how AI is reshaping the business school learning experience.
The benefits and drawbacks of using AI for essay writing
Many business school students are gaining firsthand experience of using AI to assist their academic work. At Rotterdam School of Management, Erasmus University in the Netherlands, students are required to use AI tools when submitting essays, alongside a log of their interactions.
“I was quite surprised when we were explicitly instructed to use AI for an assignment,” said Lara Harfner, who is studying International Business Administration (IBA) at RSM. “I liked the idea. But at the same time, I wondered what we would be graded on, since it was technically the AI generating the essay.”
Lara decided to approach this task as if she were writing the essay herself. She began by prompting the AI to brainstorm around the topic, research areas using academic studies and build an outline, before asking it to write a full draft.
However, during this process Lara encountered several problems. The AI-generated sources were either non-existent or inappropriate, and the tool had to be explicitly instructed on which concepts to focus on. It tended to be too broad, touching on many ideas without thoroughly analyzing any of them.
“In the end, I felt noticeably less connected to the content,” Lara says. “It didn’t feel like I was the actual author, which made me feel less responsible for the essay, even though it was still my name on the assignment.”
Despite the result sounding more polished, Lara thought she could have produced a better essay on her own with minimal AI support. What’s more, the grades she received on the AI-related assignments were below her usual average. “To me, that shows that AI is a great support tool, but it can’t produce high-quality academic work on its own.”
AI-concerned employers who took part in the Corporate Recruiters Survey echo this finding, stating that they would rather GME graduates use AI as a strategic partner in learning and strategy, than as a source for more and faster content.
How business students use AI as a personal tutor
Daniel Carvalho, a Global Online MBA student, also frequently uses AI in his academic assignments, something encouraged by his professors at Porto Business School (PBS).
However, Daniel treats AI as a personal tutor, asking it to explain complex topics in simple terms and deepen the explanation. On top of this, he uses it for brainstorming ideas, summarizing case studies, drafting presentations and exploring different points of view.
“My MBA experience has shown me how AI, when used thoughtfully, can significantly boost productivity and effectiveness,” he says.
Perhaps one of the most interesting ways Daniel uses AI is by turning course material into a personal podcast. “I convert text-based materials into audio using text-to-speech tools, and create podcast-style recaps to review content in a more conversational and engaging way. This allows me to listen to the materials on the go—in the car or at the gym.”
While studying his financial management course, Daniel even built a custom GPT using course materials. Much like a personal tutor, it would ask him questions about the material, validate his understanding, and explain any questions he got wrong. “This helped reinforce my knowledge so effectively that I was able to correctly answer all multiple-choice questions in the final exam,” he explains.
Similarly, at Villanova School of Business in the US, Master of Science in Business Analytics and AI (MSBAi) students are building personalized AI bots with distinct personalities. Students embed reference materials into the bot which then shape how the bot responds to questions.
“The focus of the program is to apply these analytics and AI skills to improve business results and career outcomes,” says Nathan Coates, MSBAi faculty director at the school. “Employers are increasingly looking for knowledge and skills for leveraging GenAI within business processes. Students in our program learn how AI systems work, what their limitations are, and what they can do better than existing solutions.”
The common limitations of using AI for academic work
Kristiina Esop, who is studying a doctorate in Business Administration and Management at Estonian Business School, agrees that AI in education must always be used critically and with intention. She warns students should always be aware of AI’s limitations.
Kristiina currently uses AI tools to explore different scenarios, synthesize large volumes of information, and detect emerging debates—all of which are essential for her work both academically and professionally.
However, she cautions that AI tools are not 100% accurate. Kristiina once asked ChatGPT to map actors in circular economy governance, and it returned a neat, simplified diagram that ignored important aspects. “That felt like a red flag,” she says. “It reminded me that complexity can’t always be flattened into clean logic. If something feels too easy, too certain—that’s when it is probably time to ask better questions.”
To avoid this problem, Kristiina combines the tools with critical thinking and contextual reading, and connects the findings back to the core questions in her research. “I assess the relevance and depth of the sources carefully,” she says. “AI can widen the lens, but I still need to focus it myself.”
She believes such critical thinking when using AI is essential. “Knowing when to question AI-generated outputs, when to dig deeper, and when to disregard a suggestion entirely is what builds intellectual maturity and decision-making capacity,” she says.
This is also what Wharton management professor Ethan Mollick, author of Co Intelligence: Living and Working with AI and co-director of the Generative AI Lab believes. He says the best way to work with [generative AI] is to treat it like a person. “So you’re in this interesting trap,” he says. “Treat it like a person and you’re 90% of the way there. At the same time, you have to remember you are dealing with a software process.”
Hult International Business School, too, expects its students to use AI in a balanced way, encouraging them to think critically about when and how to use it. For example, Rafael Martínez Quiles, a Master’s in Business Analytics student at Hult, uses AI as a second set of eyes to review his thinking.
“I develop my logic from scratch, then use AI to catch potential issues or suggest improvements,” he explains. “This controlled, feedback-oriented approach strengthens both the final product and my own learning.”
At Hult, students engage with AI to solve complex, real-world challenges as part of the curriculum. “Practical business projects at Hult showed me that AI is only powerful when used with real understanding,” says Rafael. “It doesn’t replace creativity or business acumen, it supports it.”
As vice president of Hult’s AI Society, N-AIble, Rafael has seen this mindset in action. The society’s members explore AI ethically, using it to augment their work, not automate it. “These experiences have made me even more confident and excited about applying AI in the real world,” he says.
The AI learning tools students are using to improve understanding
In other business schools, AI is being used to offer faculty a second pair of hands. Nazarbayev University Graduate School of Business has recently introduced an ‘AI Jockey’. Appearing live on a second screen next to the lecturer’s slides, this AI tool acts as a second teacher, providing real-time clarifications, offering alternate examples, challenging assumptions, and deepening explanations.
“Students gain access to instant, tailored explanations that complement the lecture, enhancing understanding and engagement,” says Dr Tom Vinaimont, assistant professor of finance, Nazarbayev University Graduate School of Business, who uses the AI jockey in his teaching.
Rather than replacing the instructor, the AI enhances the learning experience by adding an interactive, AI-driven layer to traditional teaching, transforming learning into a more dynamic, responsive experience.
“The AI Jockey model encourages students to think critically about information, question the validity of AI outputs, and build essential AI literacy. It helps students not only keep pace with technological change but also prepares them to lead in an AI-integrated world by co-creating knowledge in real time,” says Dr Vinaimont.
How AI can be used to encourage critical thinking among students
So, if you’re looking to impress potential employers, learning to work with AI while a student is a good place to start. But simply using AI tools isn’t enough. You must think critically, solve problems creatively and be aware of AI’s limitations.
Most of all, you must be adaptable. GMAC’s new AI-powered tool, Advancery, helps you find graduate business programs tailored to your career goals, with AI-readiness in mind.
After all, working with AI is a skill in itself. And in 2025, it is a valuable one.
AI Research
MyPillow CEO’s lawyers fined for AI-generated court filing
A federal judge ordered two attorneys representing MyPillow CEO Mike Lindell to pay $3,000 each after they used artificial intelligence to prepare a court filing that was riddled with errors, including citations to nonexistent cases and misquotations of case law.
Christopher Kachouroff and Jennifer DeMaster violated court rules when they filed the motion that had contained nearly 30 defective citations, Judge Nina Y. Wang of the U.S. District Court in Denver ruled Monday.
“Notwithstanding any suggestion to the contrary, this Court derives no joy from sanctioning attorneys who appear before it,” Wang wrote in her ruling, adding that the sanction against Kachourouff and Demaster was “the least severe sanction adequate to deter and punish defense counsel in this instance.”
The motion was filed in Lindell’s defamation case, which ended last month when a Denver jury found Lindell liable for defamation for pushing false claims that the 2020 presidential election was rigged.
The filing misquoted court precedents and highlighted legal principles that were not involved in the cases it cited, according to the ruling.
During a pretrial hearing after the errors were discovered, Kachouroff admitted to using generative artificial intelligence to write the motion.
Kachouroff initially told the judge that the motion was a draft and was filed by accident. But the “final” version that he said was the correct one was still riddled with “substantive errors,” including some that were not included in the filed version, Wang wrote.
It was the attorneys’ “contradictory statements and the lack of corroborating evidence” that led the judge to believe that the filing of the AI-generated motion was not “an inadvertent error” and deserved a sanction.
The judge also found Kachouroff’s accusation of the court trying to “blindside” him over the errors were “troubling and not well-taken.”
“Neither Mr. Kachouroff nor Ms. DeMaster provided the Court any explanation as to how those citations appeared in any draft of the Opposition absent the use of generative artificial intelligence or gross carelessness by counsel,” Wang wrote.
Kachouroff and DeMaster did not immediately return a request for comment Monday.
-
Funding & Business1 week ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers7 days ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions7 days ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business7 days ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers6 days ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Funding & Business4 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Funding & Business1 week ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Jobs & Careers7 days ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle
-
Jobs & Careers7 days ago
Telangana Launches TGDeX—India’s First State‑Led AI Public Infrastructure
-
Funding & Business7 days ago
Europe’s Most Ambitious Startups Aren’t Becoming Global; They’re Starting That Way