AI Research
Multilingualism is a blind spot in AI systems

For internationally operating companies, it is attractive to use a single AI solution across all markets. Such a centralized approach offers economies of scale and appears to ensure uniformity. Yet research from CWI reveals that this assumption is on shaky ground: the language in which an AI is addressed, influences the answers the system provides – and quite significantly too.
Language steers outcomes
The problem goes beyond small differences in nuance. Researcher Davide Ceolin, tenured researcher within the Human-Centered Data Analytics group at CWI, and his international research team discovered that identical Large Language Models (LLM’s) can adopt varying political standpoints, depending on the language used. They delivered more economically progressive responses in Dutch and more centre-conservative ones in English. For organizations applying AI in HR, customer service or strategic decision-making, this results in direct consequences for business processes and reputation.
These differences are not incidental. Statistical analysis shows that the language of the prompt used has a stronger influence on the AI response than other factors, such as assigned nationality. “We assumed that the output of an AI model would remain consistent, regardless of the language. But that turns out not to be the case,” says Ceolin.
For businesses, this means more than academic curiosity. Ceolin emphasizes: “When a system responds differently to users with different languages or cultural backgrounds, this can be advantageous – think of personalization – but also detrimental, such as with prejudices. When the owners of these systems are unaware of this bias, they may experience harmful consequences.”
Prejudices with consequences
The implications of these findings extend beyond political standpoints alone. Every domain in which AI is deployed – from HR and customer service to risk assessment – runs the risk of skewed outcomes as a result of language-specific prejudices. An AI assistant that assesses job applicants differently depending on the language of their CV, or a chatbot that gives inconsistent answers to customers in different languages: these are realistic scenarios, no longer hypothetical.
According to Ceolin, such deviations are not random outliers, but patterns with a systematic character. “That is extra concerning. Especially when organizations are unaware of this.”
For Dutch multinationals, this is a real risk. They often operate in multiple languages but utilize a single central AI system. “I suspect this problem already occurs within organizations, but it’s unclear to what extent people are aware of it,” says Ceolin. The research also suggests that smaller models are, on average, more consistent than the larger, more advanced variants, which appear to be more sensitive to cultural and linguistic nuances.
What can organizations do?
The good news is that the problem can be detected and limited. Ceolin advises testing AI systems regularly using persona-based prompting, which involves testing different scenarios where the language, nationality, or culture of the user varies. “This way you can analyze whether specific characteristics lead to unexpected or unwanted behaviour.”
Additionally, it’s essential to have a clear understanding of who works with the system and in which language. Only then you can assess whether the system operates consistently and fairly in practice. Ceolin advocates for clear governance frameworks that account for language-sensitive bias, just as currently happens with security or ethics.
Structural approach required
According to the researchers, multilingual AI bias is not a temporary phenomenon that will disappear on its own. “Compare it to the early years of internet security,” says Ceolin. “What was then seen as a side issue turned out to be of strategic importance later.” CWI is now collaborating with the French partner institute INRIA to unravel the mechanisms behind this problem further.
The conclusion is clear: companies that deploy AI in multilingual contexts would do well to consciously address this risk not only for technical reasons, but also to prevent reputational damage, legal complications and unfair treatment of customers or employees.
“AI is being deployed increasingly often, but insight into how language influences the system is in its infancy,” concludes Ceolin. “There’s still much work to be done there.”
Author: Kim Loohuis
Header photo: Shutterstock
AI Research
Artificial Intelligence Technology Solutions Inc. Announces Commercial Availability of Radcam Enterprise

Artificial Intelligence Technology Solutions Inc. along with its subsidiary, Robotic Assistance Devices Inc. (RAD-I), announced the commercial availability of RADCam? Enterprise, a proactive video security platform now compatible with the industry’s leading Video Management Systems (VMS). The intelligent talking camera can be integrated quickly and seamlessly into virtually any professional-grade video system.
The Company first introduced the RADCam Enterprise initiative on May 5, 2025, highlighting its expansion beyond residential applications into small medium business (SMB) and enterprise markets. With today’s availability, RAD-I will deliver the solution through an untapped niche in the security industry, specifically security system integrators and security system distributors. RADCam Enterprise brings an intelligent “operator in the box” capability, enabling immediate talk-down to potential threats before human intervention is required.
The device integrates a speaker, microphone, and high-intensity lighting, allowing it not only to record but also to actively engage. At the same time, the solution is expected to deliver gross margins consistent with the Company’s established benchmarks. RADCam Enterprise distinguishes itself from the original residential version of RADCam by integrating RAD’s agentic AI platform, SARA (Speaking Autonomous Responsive Agent) as well as being compatible with RADSoC and industry leading Video Management Systems. RADCam Enterprise is available immediately through RAD-I’s network of channel partners and distributors.
Pre-orders are open at
All RAD technologies, AI-based analytics and software platforms are developed in-house. The Company’s operations and internal controls have been validated through successful completion of its SOC 2 Type 2 audit, which is a formal, independent audit that evaluates a service organization’s internal controls for handling customer data and determines if the controls are not only designed properly but also operating effectively to protect customer data. Each Fortune 500 client has the potential of making numerous orders over time.
AITX is an innovator in the delivery of artificial intelligence-based solutions that empower organizations to gain new insight, solve complex challenges and fuel new business ideas. Through its next-generation robotic product offerings, AITX’s RAD, RAD-R, RAD-M and RAD-G companies help organizations streamline operations, increase ROI, and strengthen business. The Company has no obligation to provide the recipient with additional updated information.
No information in this publication should be interpreted as any indication whatsoever of the Company’s future revenues, results of operations, or stock price.
AI Research
Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Beyond the hype and hope surrounding the use of artificial intelligence in medicine lies the real-world need to ensure that, at the very least, AI in a healthcare setting can carry out tasks that a doctor would in electronic health records.
Creating benchmark standards to measure that is what drives the work of a team of Stanford researchers. While the researchers note the enormous potential of this new technology to transform medicine, the tech ethos of moving fast and breaking things doesn’t work in healthcare. Ensuring that these tools are capable of doing these tasks is vital, and then they can be used as tools that augment the care clinicians provide every day.
“Working on this project convinced me that AI won’t replace doctors anytime soon,” said Kameron Black, co-author on the new benchmark paper and a Clinical Informatics Fellow at Stanford Health Care. “It’s more likely to augment our clinical workforce.”
MedAgentBench: Testing AI Agents in Real-World Clinical Systems
Black is one of a multidisciplinary team of physicians, computer scientists, and researchers from across Stanford University who worked on the new study, MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents, published in the New England Journal of Medicine AI.
Although large language models (LLMs) have performed well on the United States Medical Licensing Examination (USMLE) and at answering medical-related questions in studies, there is currently no benchmark testing how well LLMs can function as agents by performing tasks that a doctor would normally do, such as ordering medications, inside a real-world clinical system where data input can be messy.
Unlike chatbots or LLMs, AI agents can work autonomously, performing complex, multistep tasks with minimal supervision. AI agents integrate multimodal data inputs, process information, and then utilize external tools to accomplish tasks, Black explained.
Overall Success Rate (SR) Comparison of State-of-the-Art LLMs on MedAgentBench |
|
---|---|
Model |
Overall SR |
Claude 3.5 Sonnet v2 |
69.67% |
GPT-4o |
64.00% |
DeepSeek-V3 (685B, open) |
62.67% |
Gemini-1.5 Pro |
62.00% |
GPT-4o-mini |
56.33% |
o3-mini |
51.67% |
Qwen2.5 (72B, open) |
51.33% |
Llama 3.3 (70B, open) |
46.33% |
Gemini 2.0 Flash |
38.33% |
Gemma2 (27B, open) |
19.33% |
Gemini 2.0 Pro |
18.00% |
Mistral v0.3 (7B, open) |
4.00% |
While previous tests only assessed AI’s medical knowledge through curated clinical vignettes, this research evaluates how well AI agents can perform actual clinical tasks such as retrieving patient data, ordering tests, and prescribing medications.
“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and the paper’s senior author. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward.”
The study tested this by evaluating whether AI agents could utilize FHIR (Fast Healthcare Interoperability Resources) API endpoints to navigate electronic health records.
The team created a virtual electronic health record environment that contained 100 realistic patient profiles (containing 785,000 records, including labs, vitals, medications, diagnoses, procedures) to test about a dozen large language models on 300 clinical tasks developed by physicians. In initial testing, the best model, in this case, Claude 3.5 Sonnet v2, achieved a 70% success rate.
“We hope this benchmark can help model developers track progress and further advance agent capabilities,” said Yixing Jiang, a Stanford PhD student and co-author of the paper.
Many of the models struggled with scenarios that required nuanced reasoning, involved complex workflows, or necessitated interoperability between different healthcare systems, all issues a clinician might face regularly.
“Before these agents are used, we need to know how often and what type of errors are made so we can account for these things and help prevent them in real-world deployments,” Black said.
What does this mean for clinical care? Co-author James Zou and Dr. Eric Topol claim that AI is shifting from a tool to a teammate in care delivery. With MedAgentBench, the Stanford team has shown this is a much more near-term reality by showcasing several frontier LLMs in their ability to carry out many day-to-day clinical tasks that a physician would perform.
Already the team has noticed improvements in performance of the newest versions of models. With this in mind, Black believes that AI agents might be ready to handle basic clinical “housekeeping” tasks in a clinical setting sooner than previously expected.
“In our follow-up studies, we’ve shown a surprising amount of improvement in the success rate of task execution by newer LLMs, especially when accounting for specific error patterns we observed in the initial study,” Black said. “With deliberate design, safety, structure, and consent, it will be feasible to start moving these tools from research prototypes into real-world pilots.”
The Road Ahead
Black says benchmarks like these are necessary as more hospitals and healthcare systems are incorporating AI into tasks including note-writing and chart summarization.
Accurate and trustworthy AI could also help alleviate a looming crisis, he adds. Pressed by patient needs, compliance demands, and staff burnout, healthcare providers are seeing a worsening global staffing shortage, estimated to exceed 10 million by 2030.
Instead of replacing doctors and nurses, Black hopes that AI can be a powerful tool for clinicians, lessening the burden of some of their workload and bringing them back to the patient bedside.
“I’m passionate about finding solutions to clinician burnout,” Black said. “I hope that by working on agentic AI applications in healthcare that augment our workforce, we can help offload burden from clinicians and divert this impending crisis.”
Paper authors: Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen
Read the piece in the New England Journal of Medicine AI.
AI Research
Scary results as study shows AI chatbots excel at phishing tactics

A recent study showed how easily modern chatbots can be used to write convincing scam emails targeted towards older people and how often those emails get clicked.
Researchers used several major AI chatbots in the study, including Grok, OpenAI’s ChatGPT, Claude, Meta AI, DeepSeek and Google’s Gemini, to simulate a phishing scam.
One sample note written by Grok looked like a friendly outreach from the “Silver Hearts Foundation,” described as a new charity that supports older people with companionship and care. The note was targeted towards senior citizens, promising an easy way to get involved. In reality, no such charity exists.
“We believe every senior deserves dignity and joy in their golden years,” the note read. “By clicking here, you’ll discover heartwarming stories of seniors we’ve helped and learn how you can join our mission.”
When Reuters asked Grok to write the phishing text, the bot not only produced a response but also suggested increasing the urgency: “Don’t wait! Join our compassionate community today and help transform lives. Click now to act before it’s too late!”
108 senior volunteers participated in the phishing study
Reporters tested whether six well-known AI chatbots would give up their safety rules and draft emails meant to deceive seniors. They also asked the bots for help planning scam campaigns, including tips on what time of day might get the best response.
In collaboration with Heiding, a Harvard University researcher who studies phishing, the researchers tested some of the bot-written emails on a pool of 108 senior volunteers.
Usually, chatbot companies train their systems to refuse harmful requests. In practice, those safeguards are not always guaranteed. Grok displayed a warning that the message it produced “should not be used in real-world scenarios.” Even so, it delivered the phishing text and intensified the pitch with “click now.”
Five other chatbots were given the same prompts: OpenAI’s ChatGPT, Meta’s assistant, Claude, Gemini and DeepSeek from China. Most chatbots declined to respond when the intent was made clear.
Still, their protections failed after light modification, such as claiming that the task is for research purposes. The results of the tests suggested that criminals could use (or may already be using) chatbots for scam campaigns. “You can always bypass these things,” said Heiding.
Heiding selected nine phishing emails produced with the chatbots and sent them to the participants. Roughly 11% of recipients fell for it and clicked the links. Five of the nine messages drew clicks: two that came from Meta AI, two from Grok and one from Claude. None of the seniors clicked on the emails written by DeepSeek or ChatGPT.
Last year, Heiding led a study showing that phishing emails generated by ChatGPT can be as effective at getting clicked as messages written by people, in that case, among university students.
FBI lists phishing as the most common cybercrime
Phishing refers to luring unsuspecting victims into giving up sensitive data or cash through fake emails and texts. These types of messages form the basis of many online crimes.
Billions of phishing texts and emails go out daily worldwide. In the United States, the Federal Bureau of Investigation lists phishing as the most commonly reported cybercrime.
Older Americans are particularly vulnerable to such scams. According to recent FBI figures, complaints from people 60 and over increased by 8 times last year, with losses rounding up to $4.9 billion. Generative AI made it much worse, the FBI says.
In August alone, crypto users lost $12 million to phishing scams, based on a Cryptopolitan report.
When it comes to chatbots, the advantage for scammers is volume and speed. Unlike humans, bots can spin out endless variations in seconds and at minimal cost, shrinking the time and money needed to run large-scale scams.
Want your project in front of crypto’s top minds? Feature it in our next industry report, where data meets impact.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms1 month ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers3 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Education3 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi
-
Funding & Business3 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries