AI Research
Tencent improves testing creative AI models with new benchmark
Tencent has introduced a new benchmark, ArtifactsBench, that aims to fix current problems with testing creative AI models.
Ever asked an AI to build something like a simple webpage or a chart and received something that works but has a poor user experience? The buttons might be in the wrong place, the colours might clash, or the animations feel clunky. It’s a common problem, and it highlights a huge challenge in the world of AI development: how do you teach a machine to have good taste?
For a long time, we’ve been testing AI models on their ability to write code that is functionally correct. These tests could confirm the code would run, but they were completely “blind to the visual fidelity and interactive integrity that define modern user experiences.”
This is the exact problem ArtifactsBench has been designed to solve. It’s less of a test and more of an automated art critic for AI-generated code
Getting it right, like a human would should
So, how does Tencent’s AI benchmark work? First, an AI is given a creative task from a catalogue of over 1,800 challenges, from building data visualisations and web apps to making interactive mini-games.
Once the AI generates the code, ArtifactsBench gets to work. It automatically builds and runs the code in a safe and sandboxed environment.
To see how the application behaves, it captures a series of screenshots over time. This allows it to check for things like animations, state changes after a button click, and other dynamic user feedback.
Finally, it hands over all this evidence – the original request, the AI’s code, and the screenshots – to a Multimodal LLM (MLLM), to act as a judge.
This MLLM judge isn’t just giving a vague opinion and instead uses a detailed, per-task checklist to score the result across ten different metrics. Scoring includes functionality, user experience, and even aesthetic quality. This ensures the scoring is fair, consistent, and thorough.
The big question is, does this automated judge actually have good taste? The results suggest it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard platform where real humans vote on the best AI creations, they matched up with a 94.4% consistency. This is a massive leap from older automated benchmarks, which only managed around 69.4% consistency.
On top of this, the framework’s judgments showed over 90% agreement with professional human developers.
Tencent evaluates the creativity of top AI models with its new benchmark
When Tencent put more than 30 of the world’s top AI models through their paces, the leaderboard was revealing. While top commercial models from Google (Gemini-2.5-Pro) and Anthropic (Claude 4.0-Sonnet) took the lead, the tests unearthed a fascinating insight.
You might think that an AI specialised in writing code would be the best at these tasks. But the opposite was true. The research found that “the holistic capabilities of generalist models often surpass those of specialized ones.”
A general-purpose model, Qwen-2.5-Instruct, actually beat its more specialised siblings, Qwen-2.5-coder (a code-specific model) and Qwen2.5-VL (a vision-specialised model).
The researchers believe this is because creating a great visual application isn’t just about coding or visual understanding in isolation and requires a blend of skills.
“Robust reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” the researchers highlight as example vital skills. These are the kinds of well-rounded, almost human-like abilities that the best generalist models are beginning to develop.
Tencent hopes its ArtifactsBench benchmark can reliably evaluate these qualities and thus measure future progress in the ability for AI to create things that are not just functional but what users actually want to use.
See also: Tencent Hunyuan3D-PolyGen: A model for ‘art-grade’ 3D assets
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
AI Research
Senator Wiener Expands AI Bill Into Landmark Transparency Measure Based on Recommendations of Governor’s Working Group
SACRAMENTO – Senator Scott Wiener (D-San Francisco) announced amendments to expand Senate Bill (SB) 53 into a first-in-the-nation transparency requirement for the largest AI companies. The new provisions draw on the recommendations of a working group led by some of the world’s leading AI experts and convened by Governor Newsom. Building on the report’s “trust, but verify” approach, the amended bill requires the largest AI companies to publicly disclose their safety and security protocols and report the most critical safety incidents to the California Attorney General. The requirements codify voluntary agreements made by leading AI developers to boost trust and accountability and establish a level playing field for AI development.
SB 53 retains provisions — called “CalCompute” — that advance a bold industrial strategy to boost AI development and democratize access to the most advanced AI models and tools. CalCompute will be a public cloud compute cluster housed at the University of California that provides free and low-cost access to compute for startups and academic researchers. CalCompute builds on Senator Wiener’s recent legislation to boost semiconductor and other advanced manufacturing in California by streamlining permit approvals for advanced manufacturing plants, and his work to protect democratic access to the internet by authoring the nation’s strongest net neutrality law.
SB 53 also retains its protections of whistleblowers at AI labs who disclose significant risks.
Weeks ago, the U.S. Senate voted 99-1 to remove provisions of President Trump’s “Big Beautiful Bill” that would have prevented states from enacting AI regulations. By boosting transparency, SB 53 builds on this vote for accountability.
“As AI continues its remarkable advancement, it’s critical that lawmakers work with our top AI minds to craft policies that support AI’s huge potential benefits while guarding against material risks,” said Senator Wiener. “Building on the Working Group Report’s recommendations, SB 53 strikes the right balance between boosting innovation and establishing guardrails to support trust, fairness, and accountability in the most remarkable new technology in years. The bill continues to be a work in progress, and I look forward to working with all stakeholders in the coming weeks to refine this proposal into the most scientific and fair law it can be.”
As AI advances, risks and benefits grow
Recent advances in AI have delivered breakthrough benefits across several industries, from accelerating drug discovery and medical diagnostics to improving climate modeling and wildfire prediction. AI systems are revolutionizing education, increasing agricultural productivity, and helping solve complex scientific challenges.
However, the world’s most advanced AI companies and researchers acknowledge that as their models become more powerful, they also pose increasing risks of catastrophic damage. The Working Group report states:
Evidence that foundation models contribute to both chemical, biological, radiological, and nuclear (CBRN) weapons risks for novices and loss of control concerns has grown, even since the release of the draft of this report in March 2025. Frontier AI companies’ [including OpenAI and Anthropic] own reporting reveals concerning capability jumps across threat categories.
To address these risks, AI developers like Meta, Google, OpenAI, and Anthropic have entered voluntary commitments to conduct safety testing and establish robust safety and security protocols. Several California-based frontier AI developers have designed industry-leading safety practices including safety evaluations and cybersecurity protections. SB 53 codifies these voluntary commitments to establish a level playing field and ensure greater accountability across the industry.
Background on the report
Governor Newsom convened the Joint California Policy Working Group on AI Frontier Models in September 2024, following his veto of Senator Wiener’s SB 1047, tasking the group to “help California develop workable guardrails for deploying GenAI, focusing on developing an empirical, science-based trajectory analysis of frontier models and their capabilities and attendant risks.”
The Working Group is led by experts including the “godmother of AI” Dr. Fei-Fei Li, Co-Director of the Stanford Institute for Human-Centered Artificial Intelligence; Dr. Mariano-Florentino Cuéllar, President of the Carnegie Endowment for International Peace; and Dr. Jennifer Tour Chayes, Dean of the UC Berkeley College of Computing, Data Science, and Society.
On June 17, the Working Group released their Final Report. While the report does not endorse specific legislation, it promotes a “trust, but verify” framework to establish guardrails that reduce material risks while supporting continued innovation.
SB 53 balances AI risk with benefits
Drawing on recommendations of the Working Group Report, SB 53:
- Establishes transparency into large companies’ safety and security protocols and risk evaluations. Companies will be required to publish their safety and security protocols and risk evaluations in redacted form to protect intellectual property.
- Mandates reporting of critical safety incidents (e.g., model-enabled CBRN threats, major cyber-attacks, or loss of model control) within 15 days to the Attorney General.
- Protects employees and contractors who reveal evidence of critical risk or violations of the act by AI developers.
The bill’s provisions apply only to a small number of well-resourced companies, and only to the most advanced models. The Attorney General has the power to update the thresholds governing which companies are covered under the bill to ensure the requirements keep up with rapid advancements in the field, but must cover only well-resourced companies at the frontier of AI development.
Under SB 53, the Attorney General imposes civil penalties for violations of the act. SB 53 does not impose any new liability for harms caused by AI systems.
In addition, SB 53 creates CalCompute, a research cluster to support startups and researchers developing large-scale AI. The bill helps California secure its global leadership as states like New York establish their own AI research clusters.
SB 53 is sponsored by the Encode AI, Economic Security Action California, and the Secure AI Project.
SB 53 is supported by a broad coalition of researchers, industry leaders, and civil society advocates:
“California has long been the birthplace of major tech innovations. SB 53 will help keep it that way by ensuring AI developers responsibly build frontier AI models,” said Sneha Revanur, president and founder of Encode AI, a co-sponsor of the bill. “This bill reflects a common-sense consensus on AI development, promoting transparency around companies’ safety and security practices.”
“At Elicit, we build AI systems that help researchers make evidence-based decisions by analyzing thousands of academic papers,” said Andreas Stuhlmüller, CEO of Elicit. “This work has taught me that transparency is essential for AI systems that people rely on for critical decisions. SB53’s requirements for safety protocols and transparency reports are exactly what we need as AI becomes more powerful and widespread. As someone who’s spent years thinking about how AI can augment human reasoning, I believe this legislation will accelerate responsible innovation by creating clear standards that make future technology more trustworthy.”
“I have devoted my life to advancing the field of AI, but in recent years it has become clear that the risks it poses could threaten us all,” said Geoffrey Hinton, University of Toronto Professor Emeritus, Turing Award winner, Nobel laureate, and a “godfather of AI.” “Greater transparency requirements into how companies are addressing safety concerns from the most powerful technology of our time is an important step towards addressing those risks.”
“SB 53 is a smart, targeted step forward on AI safety, security, and transparency,” said Bruce Reed, Head of AI at Common Sense Media. “We thank Senator Wiener for reinforcing California’s strong commitment to innovation and accountability.”
“AI can bring tremendous benefits, but only if we steer it wisely. Recent evidence shows that frontier AI systems can resort to deceptive behavior like blackmail and cheating to avoid being shut down or fulfill other objectives,” said Yoshua Bengio, Full Professor at Université de Montréal, Co-President and Scientific Director of LawZero, Turing Award winner and a “godfather of AI.” “These risks must be taken with the utmost seriousness alongside other existing and emerging threats. By advancing SB 53, California is uniquely positioned to continue supporting cutting-edge AI while proactively taking a step towards addressing these severe and potentially irreversible harms.”
“Including safety and transparency protections recommended by Gov. Newsom’s AI commission in SB 53 is an opportunity for California to be on the right side of history and advance commonsense AI regulations while our national leaders dither,” said Teri Olle, Director of Economic Security California Action, a co-sponsor of the bill. “In addition to making sure AI is safe, the bill would create a public option for cloud computing – the critical infrastructure necessary to fuel innovation and research. CalCompute would democratize access to this powerful resource that is currently enjoyed by a tiny handful of wealthy tech companies, and ensure that AI benefits the public. With inaction from the federal government – and on the heels of the defeat of the proposed 10-year moratorium on AI regulations – California should act now and get this done.”
“The California Report on Frontier AI Policy underscored the growing consensus for the importance of transparency into the safety practices of the largest AI developers,” said Thomas Woodside, Co-Founder and Senior Policy Advisor, Secure AI Project, a co-sponsor of the bill. “SB 53 ensures exactly that: visibility into how AI developers are keeping their AI systems secure and Californians safe.”
“Reasonable people can disagree about many aspects of AI policy, but one thing is clear: reporting requirements and whistleblower protections like those in SB 53 are sensible steps to provide transparency, inform the public, and deter egregious practices without interfering with innovation,” said Steve Newman, Technical co-founder of eight technology startups, including Writely – which became Google Docs, and co-creator of Spectre, one of the most influential video games of the 1990s.
###
AI Research
Artificial Intelligence in Cataract Surgery and Optometry at Large with Harvey Richman, OD, and Rebecca Wartman, OD
At the 2025 American Optometric Association Conference in Minneapolis, MN, Harvey Richman, OD, Shore Family Eyecare, and Rebecca Wartman, OD, optometrist chair of AOA Coding and Reimbursement Committee, presented their lecture on the implementation of artificial intelligence (AI) devices in cataract surgery and optometry at large.1
AI has been implemented in a variety of ophthalmology fields already, from analyzing and interpreting ocular imaging to determining the presence of diseases or disorders of the retina or macula. Recent studies have tested AI algorithms in analyzing fundus fluorescein angiography, finding the programs extremely effective at enhancing clinical efficiency.2
However, there are concerns as to the efficacy and reliability of AI programs, given their propensity for hallucination and misinterpretation. To that end, Drs. Richman and Wartman presented a study highlighting the present and future possibilities of AI in cataract surgery, extrapolating its usability to optometry as a whole.
Richman spoke to the importance of research in navigating the learning curve of AI technology. With the rapid advancements and breakneck pace of implementation, Richman points out the relative ease with which an individual can fall behind on the latest developments and technologies available to them.
“The problem is that the technology is advancing much quicker than the people are able to adapt to it,” Richman told HCPLive. “There’s been research done on AI for years and years; unfortunately, the implementation just hasn’t been as effective.”
Wartman warned against the potential for AI to take too much control in a clinical setting. She cautioned that clinicians should be wary of letting algorithms make all of the treatment decisions, as well as having a method of undoing those decisions.
“I think they need to be very well aware of what algorithms the AI is using to get to its interpretations and be a little cautious when the AI does all of the decision making,” Wartman said. “Make sure you know how to override that decision making.”
Richman went on to discuss the 3 major levels of AI: assistive technology, augmented technology, and autonomous intelligence.
“Some of those are just bringing out data, some of them bring data and make recommendations for treatment protocol, and the third one can actually make the diagnosis and treatment protocol and implement it without a physician even involved,” Richman said. “In fact, the first artificial intelligence code that was approved by CPT had to do with diabetic retina screening, and it is autonomous. There is no physician work involved in that.”
Wartman also informed HCPLive that a significant amount of surgical technology is already using artificial intelligence, mainly in the form of pattern recognition software and predictive devices.
“A lot of our equipment is already using some form of artificial intelligence, or at least algorithms to give you patterns and tell you whether it’s inside or outside the norm,” Wartman said.
References
-
Richman H, Wartman R. A.I. in Cataract Surgery. Presented at the 2025 American Optometric Association in Minneapolis, MN, June 25-28, 2025.
-
Shao A, Liu X, Shen W, et al. Generative artificial intelligence for fundus fluorescein angiography interpretation and human expert evaluation. NPJ Digit Med. 2025;8(1):396. Published 2025 Jul 2. doi:10.1038/s41746-025-01759-z
AI Research
Artificial intelligence could hire you. Now it could also fire you
Use of artificial intelligence in the job candidate interview and hiring process, at least at some level, is becoming more common at U.S. companies. Proponents say it saves time, filters out candidates that aren’t qualified for the job and present hiring managers with the most suitable pool of candidates.
Use of artificial intelligence in the job candidate interview and hiring process, at least at some level, is becoming more common at U.S. companies. Proponents say it saves time, filters out candidates that aren’t qualified for the job and presents hiring managers with the most suitable pool of candidates.
Opponents say AI has shown bias in candidate selection, and falls short of judging applicants on softer skills and personality traits.
AI is now finding its way into managing employees long after they’ve been hired, and that too is raising concerns.
A survey of more than 1,300 office managers with direct reports conducted by Resume Builder found a majority are now using AI to make personnel decisions, including promotions, raises and even terminations.
“It’s one thing if you are using it for some sort of transactional thing in your job, but now we’re talking about peoples’ livelihoods and their jobs,” said Stacie Haller, chief career coach at Resume Builder. “My hope is that the human part of the process in Human Resources and overseeing peoples’ careers don’t just become left up to AI.”
Haller said overreliance on artificial intelligence in making high-stakes personnel decisions can become a slippery slope for companies.
“It also leads the organization to have some liabilities if somebody feels they were unfairly fired or didn’t get a raise, and it was AI and the information wasn’t correct,” she said. “I think there are some liabilities there.”
In the survey, six in 10 mangers said they rely on AI to make decisions about the employees they manage, including 78% who said they use AI to determine raises, 77% for promotions, 66% for layoffs and even 64% for terminations.
Most concerning, two-thirds of managers using AI to manage employees said they have not received any formal AI training, according to the survey.
“Organizations need to find some uniformity and training and build this in like they build in any other process,” Haller said. “And it has to be verified, But when it comes to peoples’ careers and lives, I think the human aspect needs to play a bigger piece here.”
An overwhelming majority of HR managers surveyed said they do maintain control over AI recommendations.
“The good news is, most of these folks have told us that if they don’t agree with the decision, they will override it,” Haller said. “But it seems that too many in our surveys are leaning to use it in that direction, and it feels a little Wild West out there.”
When asked which tool they rely on most, ChatGPT was cited by 53% of managers, followed by 29% for Microsoft’s Copilot and 16% for Google’s Gemini.
Most are also using AI for personnel issues that are productive without affecting careers, such as training materials, employee development plans and draft performance improvement plans.
Results from Resume Builder’s survey on HR manager use of artificial intelligence are online.
Get breaking news and daily headlines delivered to your email inbox by signing up here.
© 2025 WTOP. All Rights Reserved. This website is not intended for users located within the European Economic Area.
-
Funding & Business1 week ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers1 week ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions1 week ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business1 week ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers1 week ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Education2 days ago
9 AI Ethics Scenarios (and What School Librarians Would Do)
-
Education2 days ago
Teachers see online learning as critical for workforce readiness in 2025
-
Education3 days ago
Nursery teachers to get £4,500 to work in disadvantaged areas
-
Education4 days ago
How ChatGPT is breaking higher education, explained
-
Jobs & Careers1 week ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle