aistoriz.com
  • AI Trends & Innovations
    • The Travel Revolution of Our Era
  • Contact Us
  • Home News
  • Join Us
    • Registration
  • Member Login
    • Password Reset
    • Profile
  • Privacy Policy
  • Terms Of Service
  • Thank You
Connect with us
aistoriz.com aistoriz.com

aistoriz.com

Learning from other domains to advance AI evaluation and testing

  • AI Research
    • AI Transformation (AX) using artificial intelligence (AI) is spreading throughout the domestic finan..

    • Study shakes Silicon Valley: Researchers break AI

    • Password1: how scammers exploit variations of your logins | Money

    • And Sci Fi Thought AI Was Going To… Take Over? – mindmatters.ai

    • Measuring Machine Intelligence Using Turing Test 2.0

  • Funding & Business
    • Teck’s Founder Sought Merger With Anglo Before It Was Too Late

    • Switzerland’s Central Bank Learns to Live With a Strong Franc

    • Cross-Border Bank Consolidation Benefits Cyprus, Patsalides Says

    • Polish Stock Rally Seen Rolling On Despite Drones, Bank Tax Hike

    • Politics Drive Investment Divide in Southeast Asia’s Top Markets

  • Events & Conferences
    • Read Meta’s 2025 Sustainability Report

    • Scientific frontiers of agentic AI

    • A New Ranking Framework for Better Notification Quality on Instagram

    • Simplifying book discovery with ML-powered visual autocomplete suggestions

    • Revolutionizing warehouse automation with scientific simulation

  • AI Insights
    • 2 Popular AI Stocks to Sell Before They Fall 46% and 73%, According to Wall Street Analysts

    • 2 Artificial Intelligence (AI) Stocks to Buy Before They Soar to $5 Trillion, According to a Wall Street Expert

    • The battle for artificial intelligence (AI) talents triggered in Silicon Valley is spreading to Chin..

    • Goldman Sachs Warns An AI Slowdown Can Tank The Stock Market By 20%

    • A Sample Grant Proposal on “Artificial Intelligence for Rural Healthcare” – fundsforNGOs

  • Jobs & Careers
    • Databricks Invests in Naveen Rao’s New AI Hardware Startup

    • OpenAI Announces Grove, a Cohort for ‘Pre-Idea Individuals’ to Build in AI 

    • Uncommon Uses of Common Python Standard Library Functions

    • Tendulkar-Backed RRP Electronics Gets 100 Acres in Maharashtra for Semiconductor Fab

    • 5 Tips for Building Optimized Hugging Face Transformer Pipelines

  • Ethics & Policy
    • Vatican Hosts Historic “Grace for the World” Concert and AI Ethics Summit | Ukraine news

    • Pet Dog Joins Google’s Gemini AI Retro Photo Trend! Internet Can’t Get Enough | Viral Video | Viral

    • Morocco Signs Deal to Build National Responsible AI Platform

    • Santa Fe Ethics Board Discusses Revisions to City Ethics Code

    • Santa Fe Campaign Committee Discusses Ethics and Social Media Impact on Elections

  • Mergers & Acquisitions
    • FTAV’s further reading

    • Trump Intel deal designed to block sale of chipmaking unit, CFO says

    • Nuclear fusion developer raises almost $900mn in new funding

    • AI is opening up nature’s treasure chest

    • AI start-up Lovable receives funding offers at $4bn valuation

  • Podcasts & Talks
    • Intel Just Changed Computer Graphics Forever!

    • Not to go off on a tangent, but these math easter eggs in #GoogleSearch are pretty (a)cute.

    • What is the role of AI in enhancing digital defense? | Gen. David H. Petraeus & Google’s Kent Walker

    • Simplify classwork with AI Mode in Search. Just upload your syllabus PDF to get going.

    • Build Hour: Codex

AI Research

Learning from other domains to advance AI evaluation and testing

Published

3 months ago

on

June 23, 2025

By

Amanda Craig Deckard, Chad Atalla


As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?  

Recent research and reports from Microsoft (opens in new tab), the UK AI Security Institute (opens in new tab), The New York Times (opens in new tab), and MIT Technology Review (opens in new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report (opens in new tab) (2025) and the Singapore Consensus (opens in new tab) (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust. 

Today, we’re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, pharmaceuticals, and medical devices to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.

Microsoft Research Blog

Research at Microsoft 2024: Meeting the challenge of a changing world

In this new AI era, technology is changing even faster than before, and the transition from research to reality, from concept to solution, now takes days or weeks rather than months or years.


Opens in a new tab

We’re also sharing written case studies from experts, along with top-level lessons we’re applying to AI. At the close of the podcast series, we’ll offer Microsoft’s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation. 

Lessons from eight case studies 

Our research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft’s Office of Responsible AI (opens in new tab) gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI (opens in new tab), in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past. 

While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.  

Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the “DNA” of the historical moment in which they’re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later. 

Strict, pre-deployment testing regimes—such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals—offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.  

In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment—such as cybersecurity and bank stress testing—rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.  

Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation. 

These variations in approaches across domains—stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors—also inform takeaways for AI.

Applying risk evaluation and governance lessons to AI 

While no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.  

Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.  

These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including: 

  1. Rigor in defining what is being examined and why it matters. This requires detailed specification of what is being measured and understanding how the deployment context may affect outcomes.
  2. Standardization of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency. 
  3. Interpretability of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results—while remaining aware of their limitations. 

Toward stronger foundations for AI testing 

Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability—and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.  

Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.

Acknowledgements 

We would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.  

Case studies 

Civil aviation: Testing in Aircraft Design and Manufacturing, by Paul Alp 

Cybersecurity: Cybersecurity Standards and Testing—Lessons for AI Safety and Security, by Stewart Baker 

Financial services (bank stress testing): The Evolving Use of Bank Stress Tests, by Kathryn Judge 

Genome editing: Governance of Genome Editing in Human Therapeutics and Agricultural Applications, by Alta Charo and Andy Greenfield 

Medical devices: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance, by Mateo Aboy and Timo Minssen 

Nanoscience: The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation, by Jennifer Dionne 

Nuclear energy: Testing in the Nuclear Industry, by Pablo Cantero and Gerónimo Poletto Antonacci 

Pharmaceuticals: The History and Evolution of Testing in Pharmaceutical Regulation, by Daniel Benamouzig and Daniel Carpenter

Opens in a new tab





Source link

Related Topics:
Up Next

AI Testing and Evaluation: Learnings from Science and Industry

Don't Miss

Killer whales make tools from kelp to ‘massage’ each other

Amanda Craig Deckard, Chad Atalla

Continue Reading

You may like

Click to comment

Leave a Reply

Cancel reply

Your email address will not be published. Required fields are marked *

AI Research

AI Transformation (AX) using artificial intelligence (AI) is spreading throughout the domestic finan..

Published

1 hour ago

on

September 14, 2025

By

The Editors


Getty Images Bank

AI Transformation (AX) using artificial intelligence (AI) is spreading throughout the domestic financial sector. Beyond simple digital transformation (DX), the strategy is to internalize AI across organizations and services to achieve management efficiency, work automation, and customer experience innovation at the same time. Financial companies are moving the judgment that it will be difficult to survive unless they raise their AI capabilities across the company in an environment where regulations and competition are intensifying. AX’s core is internal process innovation and customer service differentiation. AI can reduce costs and secure speed by quickly and accurately handling existing human-dependent tasks such as loan review, risk management, investment product recommendation, and internal counseling support.

At customer contact points, high-quality counseling is provided 24 hours a day through AI bankers, voice robots, and customized chatbots to increase financial service satisfaction. Industry sources say, “AX is not just a matter of technology, but a structural change that determines financial companies’ competitiveness and crisis response.”

First of all, major domestic banks and financial holding companies began to introduce in-house AI assistant and private large language model (LLM), establish a dedicated organization, and establish an AI governance system at the level of all affiliates. It is trying to automate internal work and differentiate customer services at the same time by establishing a strategic center at the group company level or introducing collaboration tools and AI platforms throughout the company.

KB Financial Group has established a ‘KB AI strategy’ and a ‘KB AI agent roadmap’ to introduce more than 250 AI agents to 39 core business areas of the group. It has established the ‘KB GenAI Portal’ for the first time in the financial sector to create an environment in which all executives and employees can utilize and develop AI without coding, and through this, it is efficiently changing work productivity and how they work.

Shinhan Financial Group is increasing work productivity with cloud-based collaboration tools (M365+Copilot) and introducing AI to the site by affiliates. Shinhan Bank placed Generative AI bankers at the window through the “AI Branch,” and in the application “SOL,” “AI Investment Mate” provides customized information to customers through card news.

사진설명

Hana Bank is operating a “foreign exchange company AI departure prediction system” using its foreign exchange expertise. It is a structure that analyzes 253 variables based on past transaction data to calculate the possibility of suspension of transactions and automatically guides branches to help preemptively respond.

Woori Financial Group established an AI strategy center within the holding under the leadership of Chairman Lim Jong-ryong and deployed AI-only organizations to all affiliates, including banks, cards, securities, and insurance.

Internet banks are trying to differentiate themselves by focusing on interactive search and calculation machines, forgery and alteration detection, customized recommendations, and spreading in-house AI culture. As there is no offline sales network, it is actively strengthening customer contact AI innovation such as app and mobile counseling.

Kakao Bank has upgraded its AI organization to a group and has more than 500 dedicated personnel. K-Bank achieved a 100% recognition rate with its identification card recognition solution using AI, and started to set standards by publishing papers to academia. Toss Bank uses AI to determine ID forgery and alteration (99.5% accuracy), automate mass document optical character recognition (OCR), convert counseling voice letters (STT), and build its own financial-specific language model.

Insurance companies are increasing accuracy, approval rate, and processing speed by introducing AI in the entire process of risk assessment, underwriting, and insurance payment. Due to the nature of the insurance industry, the effect of using AI is remarkable as the screening and payment process is long and complex.

Samsung Fire & Marine Insurance has more than halved the proportion of manpower review by automating the cancer diagnosis and surgical benefit review process through ‘AI medical review’. The machine learning-based “Long-Term Insurance Sickness Screening System” raised the approval rate from 71% to 90% and secured patents.

Industry experts view this AI transformation as a paradigm shift in the financial industry, not just the introduction of technology. It is necessary to create new added value and customer experiences beyond cost reduction and efficiency through AI. In particular, it is evaluated that the differentiation of financial companies will be strengthened only when AI and data are directly connected to resolving customer inconveniences.

However, preparing for ethical, security, and accountability issues is considered an essential task as much as the speed of AI’s spread. Failure to manage risks such as the impact of large language models on financial decision-making, personal information protection, and algorithmic bias can lead to loss of trust. This means that the process of developing accumulated experiences into industrial standards through small experiments is of paramount importance.

[Reporter Lee Soyeon]



Source link

Continue Reading

AI Research

Study shakes Silicon Valley: Researchers break AI

Published

2 hours ago

on

September 14, 2025

By

The Editors


Study shakes Silicon Valley: Researchers break AI | The Jerusalem Post

Jerusalem Post/Consumerism

Study shows researchers can manipulate chatbots with simple psychology, raising serious concerns about AI’s vulnerability and potential dangers.

ChatGPT encouraged a teenager toward suicide
ChatGPT encouraged a teenager toward suicide
(photo credit: OpenAI)
ByDR. ITAY GAL
SEPTEMBER 14, 2025 09:13






Source link

Continue Reading

AI Research

Password1: how scammers exploit variations of your logins | Money

Published

2 hours ago

on

September 14, 2025

By

Shane Hickey


The first you know about it is when you find out someone has accessed one of your accounts. You’ve been careful with your details so you can’t work out what has gone wrong, but you have made one mistake – recycling part of your password.

Reusing the same word in a password – even if it is altered to include numbers or symbols – gives criminals a way in to your accounts.

Brandyn Murtagh, an ethical “white hat” hacker, says information obtained through data breaches on sites such as DropBox and Tumblr and through cyber-attacks has been circulating on the internet for some time.

Hackers obtain passwords and test them out on other websites – a practice known as credential stuffing – to see whether they can break into accounts.

But in some cases they do not just try the exact passwords from the hacked data: as well as credential stuffing, the fraudsters also attempt to access accounts with derivations of the hacked password.

Research from Virgin Media O2 suggests four out of every five people use the same or nearly identical passwords on online accounts.

Using a slightly altered passwords – such as Guardian1 instead of Guardian – is almost an open door for hackers to compromise online accounts, Murtagh says.

Working with Virgin Media O2, he has shown volunteers how easy it is to trace their password when they supply their email address, often getting a result within minutes.

A spokesperson for Virgin Media O2 says: “Human behaviour is quite easy to model. [Criminals] know, for example, you might use one password and then add a full stop or an exclamation mark to the end.”

What the scam looks like

The criminals use scripts – automated sets of instructions for the computer – to go through variations of the passwords in an attempt to access other accounts. This can happen on an industrial scale, says Murtagh.

“It’s very rare that you are targeted as an individual – you are [usually] in a group of thousands of people that are getting targeted. These processes scale just like they would in business,” he says.

You might be alerted by messages saying that you have been trying to change your email address or other details connected to an account.

What to do

Change any passwords that are variations on the same word – Murtagh advises starting with the most important four sets of accounts: banks, email, work accounts and mobile.

Use a password managers – these are often integrated into web browsers. Apple has iCloud Keychain while Androids have Google Password Manager, both of which can suggest and save complicated passwords.

Put in place two-factor authentication or multi-factor authentication (2FA or MFA), which mean means you have two steps to log into a site.



Source link

Continue Reading

Trending

  • Business2 weeks ago

    The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial

  • Tools & Platforms1 month ago

    Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks

  • Ethics & Policy2 months ago

    SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية

  • Events & Conferences4 months ago

    Journey to 1000 models: Scaling Instagram’s recommendation system

  • Jobs & Careers3 months ago

    Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding

  • Podcasts & Talks2 months ago

    Happy 4th of July! 🎆 Made with Veo 3 in Gemini

  • Education2 months ago

    VEX Robotics launches AI-powered classroom robotics system

  • Education2 months ago

    Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics

  • Podcasts & Talks2 months ago

    OpenAI 🤝 @teamganassi

  • Funding & Business3 months ago

    Kayak and Expedia race to build AI travel agents that turn social posts into itineraries

aistoriz.com
  • Privacy Policy
  • Terms Of Service
  • Contact Us
  • The Travel Revolution of Our Era

Copyright © 2025 AISTORIZ. For enquiries email at prompt@aistoriz.com