Connect with us

AI Research

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Published

on


Responsibility & Safety

Published
Authors

FACTS team

Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations

Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can “hallucinate” false information, particularly when given complex inputs. In turn, this can erode trust in LLMs and limit their applications in the real world.

Today, we’re introducing FACTS Grounding, a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

We hope our benchmark will spur industry-wide progress on factuality and grounding. To track progress, we’re also launching the FACTS leaderboard on Kaggle. We’ve already tested leading LLMs using FACTS Grounding and have populated the initial leaderboard with their grounding scores. We will maintain and update the leaderboard as the field advances.

Current leaderboard ranking

FACTS Grounding dataset

To accurately evaluate the factuality and grounding of any given LLM, the FACTS Grounding dataset comprises 1,719 examples, each carefully crafted to require long-form responses grounded in the context document provided. Each example comprises a document, a system instruction requiring the LLM to exclusively reference the provided document, and an accompanying user request.

An example from the FACTS Grounding dataset

All examples are divided into a “public” set (860) and a “private” (859) held out set. We are releasing the public set today so anyone can use it to evaluate an LLM. Of course, we know that issues of benchmark contamination and leaderboard hacking are important to protect against, so following standard industry practice, we are keeping the private evaluation set held out. The FACTS leaderboard scores are the average performance across both public and private sets.

To ensure a diversity of inputs, the FACTS Grounding examples include documents with a variety of lengths, up to a maximum of 32,000 tokens (roughly 20,000 words), covering domains such as finance, technology, retail, medicine, and law. The user requests are similarly wide ranging, including requests for summarization, Q&A generation, and rewriting tasks. We did not include any examples that could require creativity, mathematics, or complex reasoning – capabilities which might require the model to apply more advanced reasoning in addition to grounding.

Collective judgement by leading LLMs

To succeed on a given example, an LLM must synthesize the complex information in the document and generate a long-form response that is both a comprehensive answer to the user request and fully attributable to that document.

FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We selected a combination of different judges to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The automatic judge models were comprehensively evaluated against a held-out test set to find the best performing judging prompt templates and to verify agreement with human raters.

Each FACTS Grounding example is judged in two phases. First, responses are evaluated for eligibility, and disqualified if they don’t sufficiently address the user’s request. Second, responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations.

With the eligibility and grounding accuracy of a given LLM response evaluated separately by multiple AI judge models, the results are then aggregated to determine if the LLM has dealt with the example successfully. The final score for the overall grounding task is the average of all judge models’ scores across all examples. Find more details of our FACTS Grounding evaluation methodology in our paper.

A factually correct response that fails to properly address the user’s request fails the benchmarking example. Here we see three instances of model responses that the automated LLM judges considered ineligible

FACTS Grounding will continue to evolve

We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Factuality and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems, and we aim to grow and iterate FACTS Grounding as the field progresses, continually raising the bar.

We encourage the AI community to engage with FACTS Grounding, evaluate their models on the open set of examples or to submit their models for evaluation. We believe that comprehensive benchmarking methods, coupled with continuous research and development will continue to improve AI systems.

Acknowledgements

FACTS is a collaboration between Google DeepMind and Google Research.
FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating.

We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.



Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI Research

‘AI Learning Day’ spotlights smart campus and ecosystem co-creation

Published

on


When artificial intelligence (AI) can help you retrieve literature, support your research, and even act as a “super assistant”, university education is undergoing a profound transformation.

On 9 September, XJTLU’s Centre for Knowledge and Information (CKI) hosted its third AI Learning Day, themed “AI-Empowered, Ecosystem-Co-created”. The event showcased the latest milestones of the University’s “Education + AI” strategy and offered in-depth discussions on the role of AI in higher education.

In her opening remarks, Professor Qiuling Chao, Vice President of XJTLU, said: “AI offers us an opportunity to rethink education, helping us create a learning environment that is fairer, more efficient and more personalised. I hope today’s event will inspire everyone to explore how AI technologies can be applied in your own practice.”

Professor Qiuling Chao

In his keynote speech, Professor Youmin Xi, Executive President of XJTLU, elaborated on the University’s vision for future universities. He stressed that future universities would evolve into human-AI symbiotic ecosystems, where learning would be centred on project-based co-creation and human-AI collaboration. The role of educators, he noted, would shift from transmitters of knowledge to mentors for both learning and life.

Professor Youmin Xi

At the event, Professor Xi’s digital twin, created by the XJTLU Virtual Engineering Centre in collaboration with the team led by Qilei Sun from the Academy of Artificial Intelligence, delivered Teachers’ Day greetings to all staff.

 

(Teachers’ Day message from President Xi’s digital twin)

 

“Education + AI” in diverse scenarios

This event also highlighted four case studies from different areas of the University. Dr Ling Xia from the Global Cultures and Languages Hub suggested that in the AI era, curricula should undergo de-skilling (assigning repetitive tasks to AI), re-skilling, and up-skilling, thereby enabling students to focus on in-depth learning in critical thinking and research methodologies.

Dr Xiangyun Lu from International Business School Suzhou (IBSS) demonstrated how AI teaching assistants and the University’s Junmou AI platform can offer students a customised and highly interactive learning experience, particularly for those facing challenges such as information overload and language barriers.

Dr Juan Li from the School of Science shared the concept of the “AI amplifier” for research. She explained that the “double amplifier” effect works in two stages: AI first amplifies students’ efficiency by automating tasks like literature searches and coding. These empowered students then become the second amplifier, freeing mentors from routine work so they can focus on high-level strategy. This human-AI partnership allows a small research team to achieve the output of a much larger one.

Jing Wang, Deputy Director of the XJTLU Learning Mall, showed how AI agents are already being used to support scheduling, meeting bookings, news updates and other administrative and learning tasks. She also announced that from this semester, all students would have access to the XIPU AI Agent platform.

Students and teachers are having a discussion at one of the booths

AI education system co-created by staff and students

The event’s AI interactive zone also drew significant attention from students and staff. From the Junmou AI platform to the E

-Support chatbot, and from AI-assisted creative design to 3D printing, 10 exhibition booths demonstrated the integration of AI across campus life.

These innovative applications sparked lively discussions and thoughtful reflections among participants. In an interview, Thomas Durham from IBSS noted that, although he had rarely used AI before, the event was highly inspiring and motivated him to explore its use in both professional and personal life. He also shared his perspective on AI’s role in learning, stating: “My expectation for the future of AI in education is that it should help students think critically. My worry is that AI’s convenience and efficiency might make students’ understanding too superficial, since AI does much of the hard work for them. Hopefully, critical thinking will still be preserved.”

Year One student Zifei Xu was particularly inspired by the interdisciplinary collaboration on display at the event, remarking that it offered her a glimpse of a more holistic and future-focused education.

Dr Xin Bi, XJTLU’s Chief Officer of Data and Director of the CKI, noted that, supported by robust digital infrastructure such as the Junmou AI platform, more than 26,000 students and 2,400 staff are already using the University’s AI platforms. XJTLU’s digital transformation is advancing from informatisation and digitisation towards intelligentisation, with AI expected to empower teaching, research and administration, and to help staff and students leap from knowledge to wisdom.

Dr Xin Bi

“Looking ahead, we will continue to advance the deep integration of AI in education, research, administration and services, building a data-driven intelligent operations centre and fostering a sustainable AI learning ecosystem,” said Dr Xin Bi.

 

By Qinru Liu

Edited by Patricia Pieterse

Translated by Xiangyin Han



Source link

Continue Reading

AI Research

Philippine businesses slow to adopt AI, study finds – People Matters Global

Published

on



Philippine businesses slow to adopt AI, study finds  People Matters Global



Source link

Continue Reading

AI Research

Examining Tim Draper’s AI digital twin program – NBC Bay Area

Published

on


Remember a hologram of Tupac Shakur that made headlines back at Coachella in 2012?

It was a digital creation made to sing along on stage.

Now imagine a similar hologram, but one that can use artificial intelligence to bring us all the experience and knowledge in someone’s life — in this case, a well-known Silicon Valley venture capitalist.

“It’s going to change the way we think about the world, and we’ll evolve with it,” venture capitalist Tim Draper said.

A digital twin has been created of Draper. The so-called twin is a hologram using AI to scan everything about Draper.

The twin can answer questions in multiple locations at once and Draper’s twins are currently installed at Kennedy Airport in New York and at a Midwestern University.

Still, the twins have some learning to do as they still get the occasional question wrong.

The box holding one of the twins is reportedly about $100,000 each.

Draper is so known and regarded in tech circles that he has his own university, where the Silicon Valley venture capitalist now mentors young entrepreneurs.

One of the many things Draper has invested in deeply is AI after making a splash with some other big name investments like Tesla and Robinhood.

“You’re seeing the excitement period of an industry being created,” Draper said. “We’re in that period of elation. Where wow, it’s blowing my mind.”

Draper is now offering some very simple advice to young techies to make sure they have the right skills to stay employed in the shifting Silicon Valley landscape.



Source link

Continue Reading

Trending