AI Research
What 300GB of AI Research Reveals About the True Limits of “Zero-Shot” Intelligence
Authors:
(1) Vishaal Udandarao, Tubingen AI Center, University of Tubingen, University of Cambridge, and equal contribution;
(2) Ameya Prabhu, Tubingen AI Center, University of Tubingen, University of Oxford, and equal contribution;
(3) Adhiraj Ghosh, Tubingen AI Center, University of Tubingen;
(4) Yash Sharma, Tubingen AI Center, University of Tubingen;
(5) Philip H.S. Torr, University of Oxford;
(6) Adel Bibi, University of Oxford;
(7) Samuel Albanie, University of Cambridge and equal advising, order decided by a coin flip;
(8) Matthias Bethge, Tubingen AI Center, University of Tubingen and equal advising, order decided by a coin flip.
Table of Links
2 Concepts in Pretraining Data and Quantifying Frequency
3 Comparing Pretraining Frequency & “Zero-Shot” Performance and 3.1 Experimental Setup
3.2 Result: Pretraining Frequency is Predictive of “Zero-Shot” Performance
4.2 Testing Generalization to Purely Synthetic Concept and Data Distributions
5 Additional Insights from Pretraining Concept Frequencies
6 Testing the Tail: Let It Wag!
8 Conclusions and Open Problems, Acknowledgements, and References
Part I
Appendix
A. Concept Frequency is Predictive of Performance Across Prompting Strategies
B. Concept Frequency is Predictive of Performance Across Retrieval Metrics
C. Concept Frequency is Predictive of Performance for T2I Models
D. Concept Frequency is Predictive of Performance across Concepts only from Image and Text Domains
F. Why and How Do We Use RAM++?
G. Details about Misalignment Degree Results
I. Classification Results: Let It Wag!
Abstract
Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?
We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets [79], and testing on purely synthetic data distributions [51]. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the Let it Wag! benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.
1 Introduction
Multimodal models like CLIP [91] and Stable Diffusion [96] have revolutionized performance on downstream tasks—CLIP is now the de-facto standard for “zero-shot” image recognition [133, 72, 126, 48, 132] and imagetext retrieval [46, 64, 24, 117, 129], while Stable Diffusion is now the de-facto standard for “zero-shot” text-to-image (T2I) generation [93, 17, 96, 41]. In this work, we investigate this empirical success through the lens of zero-shot generalization [69], which refers to the ability of the model to apply its learned knowledge to new unseen concepts. Accordingly, we ask: Are current multimodal models truly capable of “zero-shot” generalization?
To address this, we conducted a comparative analysis involving two main factors: (1) the performance of models across various downstream tasks and (2) the frequency of test concepts within their pretraining datasets. We compiled a comprehensive list of 4, 029 concepts[1] from 27 downstream tasks spanning classification, retrieval, and image generation, assessing the performance against these concepts. Our analysis spanned five large-scale pretraining datasets with different scales, data curation methods and sources (CC-3M [107], CC-12M [27], YFCC-15M [113], LAION-Aesthetics [103], LAION-400M [102]), and evaluated the performance of 10 CLIP models and 24 T2I models, spanning different architectures and parameter scales. We consistently find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is a strong predictor of the model’s performance on test examples containing that concept. Notably, model performance scales linearly as the concept frequency in pretraining data grows exponentially i.e., we observe a consistent log-linear scaling trend. We find that this log-linear trend is robust to controlling for correlated factors (similar samples in pretraining and test data [79]) and testing across different concept distributions along with samples generated entirely synthetically [51].
Our findings indicate that the impressive empirical performance of multimodal models like CLIP and Stable Diffusion can be largely attributed to the presence of test concepts within their vast pretraining datasets, thus their reported empirical performance does not constitute “zero-shot” generalization. Quite the contrary, these models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency.
In our analysis, we additionally document the distribution of concepts encountered in pretraining data and find that:
• Concept Distribution: Across all pretraining datasets, the distribution of concepts is long-tailed (see Fig. 5 in Sec. 5), which indicates that a large fraction of concepts are rare. However, given the extreme sample inefficiency observed, what is rare is not properly learned during multimodal pretraining.
• Concept Correlation across Pretraining Datasets: The distribution of concepts across different pretraining datasets are strongly correlated (see Tab. 4 in Sec. 5), which suggests web crawls yield surprisingly similar concept distributions across different pretraining data curation strategies, necessitating explicit rebalancing efforts [11, 125].
• Image-Text Misalignment between Concepts in Pretraining Data: Concepts often appear in one modality but not the other, which implies significant misalignment (see Tab. 3 in Sec. 5). Our released data artifacts can help image-text alignment efforts at scale by precisely indicating the examples in which modalities misalign. Note that the log-linear trend across both modalities is robust to this misalignment.
To provide a simple benchmark for generalization performance for multimodal models, which controls for the concept frequency in the training set, we introduce a new long-tailed test dataset called “Let It Wag!”. Current models trained on both openly available datasets (e.g., LAION-2B [103], DataComp-1B [46]) and closed-source datasets (e.g., OpenAI-WIT [91], WebLI [29]) have significant drops in performance, providing evidence that our observations may also transfer to closed-source datasets. We publicly release all our data artifacts (over 300GB), amortising the cost of analyzing the pretraining datasets of multimodal foundation models for a more data-centric understanding of the properties of multimodal models in the future.
Several prior works [91, 46, 82, 42, 83, 74] have investigated the role of pretraining data in affecting performance. Mayilvahanan et al. [79] showed that CLIP’s performance is correlated with the similarity between training and test datasets. In other studies on specific areas like question-answering [62] and numerical reasoning [94] in large language models, high train-test set similarity did not fully account for observed performance levels [127]. Our comprehensive analysis of several pretraining image-text datasets significantly adds to this line of work, by (1) showing that concept frequency determines zero-shot performance and (2) pinpointing the exponential need for training data as a fundamental issue for current large-scale multimodal models. We conclude that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.
[1] class categories for classification tasks, objects in the text captions for retrieval tasks, and objects in the text prompts for generation tasks, see Sec. 2 for more details on how we define concepts.
AI Research
Amadeus announces Demand360®and MeetingBroker® to be enhanced with artificial intelligence
Amadeus has partnered with Microsoft and is leveraging OpenAI’s models on Azure to develop a suite of AI integrations that enhance its Hospitality portfolio. The two latest AI tools will provide hoteliers of any background easy access to industry-leading insights and dramatically improve the efficiency of group bookings.
Amadeus Advisor chat is coming to Demand360: Making sophisticated insights instantly available
To help hoteliers stay agile and respond quickly to the fast-changing travel industry, Amadeus is integrating Advisor Chat, its Gen AI chatbot, into its industry-leading Demand360 data product. Powered by Azure OpenAI, Advisor chat offers immediate and intuitive access to crucial insights for teams across various functions, including sales, operations, marketing, and distribution.
Demand360 currently captures the most comprehensive view of the hospitality market to inform hotel strategies. Based on insights from 44,000 hotels and 35 million short-term rental properties, Demand360 provides a 12-month, forward-looking view of a hotel’s occupancy and its market ranking as well as two years of retrospective data.
Amadeus Advisor chat was rolled out to Amadeus Agency360® in 2024. In the year since, customers have enjoyed instantaneous insights. In some cases, Amadeus Advisor has saved analysts approximately a day each week as the bulk of requests can now be handled directly by the wider team.
Amadeus plans to make Advisor available within Microsoft Teams, making it easier than ever to understand performance and make informed decisions.
Transforming group sales with AI: Email to RFP
Amadeus is introducing new AI functionality, Email to RFP, within MeetingBroker to help hotels streamline the handling of inbound group booking requests, a valuable, growing segment of the market.
With Email to RFP, customers will be able to email inbound RFPs directly to MeetingBroker, where AI is then used to evaluate it and create an instant RFP response. To provide accurate, up-to-date information that is specific to each location, Email to RFP will be trained to retrieve additional, relevant information from reliable sources. Email to RFP is powered by Azure OpenAI.
Omni Atlanta Hotel, the first pilot customer, has seen significant returns with faster responses and near autonomous RFP handling.
This builds on the current functionalities of Amadeus MeetingBroker, a centralized hub for managing all group inquiries, no matter how or where they originate. By consolidating leads into a single workflow, MeetingBroker helps hotel sales teams respond faster, reduce missed opportunities, and convert more business.
Amadeus plans to introduce individual AI agents for each of its products, helping travel companies to gain more value by answering queries more easily and more quickly. Amadeus is also working to develop AI agents that will draw on multiple sources when responding to queries, unlocking new levels of insight from across Amadeus’ portfolio.
“As an industry, we’re at an important juncture where the next year of AI development and implementation will shape decades of travel and hospitality. It’s becoming increasingly clear that AI is here to make sense of complexity and support productivity in order to enhance efficiency, return on investment and ultimately increase conversions,” says Francisco Pérez-Lozao Rüter, President of Hospitality, Amadeus.
AI Research
Daily Research News Online no. 38503
Funds for Ambitious AI Research Tech Firm GetWhy
July 8 2025
Danish GenAI-based consumer research tech company GetWhy has raised EUR 17m in an additional Series A round of funding, with which to pursue its goal of ‘becoming the standard in AI-driven consumer insights’ – and specifically to open a US office this year.
The company taps AI to help clients conduct studies and extract insights from video-based interviews. Based in Copenhagen, it was established in 2011 as UserTribe, and pivoted from an MR services company to a software-based model, with the new name Sonar, after current CEO Casper Henningsen joined in 2017. In January 2024 it changed its name to GetWhy to avoid confusion with another company, and a year ago it secured $34.5m in Series A funding from PeakSpan Capital, which took a ‘significant’ minority stake in the company.
The new funding was also led by PeakSpan, along with Arbejdernes Landsbank. Co-founder and CEO Casper Henningsen (pictured) comments: ‘This increased investment from PeakSpan Capital underscores how rapidly the field of consumer research is evolving and gives GetWhy the fuel to accelerate our mission and cement our leadership position globally. We see a clear opportunity to lead the enterprise market and drive AI-fueled transformation for a vast universe of global brands.’
Of the forthcoming office launch, Henningsen says, ‘It’s crucial for GetWhy to establish a presence in the US… Some 60 % of our revenue comes from here, so this is a natural and necessary step for us to get closer to our current and potential American customers and solidify our position as a market leader driving the industry forward.’
The firm is online at www.getwhy.io .
All articles 2006-23 written and edited by Mel Crowther and/or Nick Thomas, 2024- by Nick Thomas, unless otherwise stated.
AI Research
Instagram wrongly says some users breached child sex abuse rules
Technology Reporter
Instagram users have told the BBC of the “extreme stress” of having their accounts banned after being wrongly accused by the platform of breaching its rules on child sexual exploitation.
The BBC has been in touch with three people who were told by parent company Meta that their accounts were being permanently disabled, only to have them reinstated shortly after their cases were highlighted to journalists.
“I’ve lost endless hours of sleep, felt isolated. It’s been horrible, not to mention having an accusation like that over my head,” one of the men told BBC News.
Meta declined to comment.
BBC News has been contacted by more than 100 people who claim to have been wrongly banned by Meta.
Some talk of a loss of earnings after being locked out of their business pages, while others highlight the pain of no longer having access to years of pictures and memories. Many point to the impact it has had on their mental health.
Over 27,000 people have signed a petition that accuses Meta’s moderation system, powered by artificial intelligence (AI), of falsely banning accounts and then having an appeal process that is unfit for purpose.
Thousands of people are also in Reddit forums dedicated to the subject, and many users have posted on social media about being banned.
Meta has previously acknowledged a problem with Facebook Groups but denied its platforms were more widely affected.
‘Outrageous and vile’
The BBC has changed the names of the people in this piece to protect their identities.
David, from Aberdeen in Scotland, was suspended from Instagram on 4 June. He was told he had not followed Meta’s community standards on child sexual exploitation, abuse and nudity.
He appealed that day, and was then permanently disabled on Instagram and his associated Facebook and Facebook Messenger accounts.
David found a Reddit thread, where many others were posting that they had also been wrongly banned over child sexual exploitation.
“We have lost years of memories, in my case over 10 years of messages, photos and posts – due to a completely outrageous and vile accusation,” he told BBC News.
He said Meta was “an embarrassment”, with AI-generated replies and templated responses to his questions. He still has no idea why his account was banned.
“I’ve lost endless hours of sleep, extreme stress, felt isolated. It’s been horrible, not to mention having an accusation like that over my head.
“Although you can speak to people on Reddit, it is hard to go and speak to a family member or a colleague. They probably don’t know the context that there is a ban wave going on.”
The BBC raised David’s case to Meta on 3 July, as one of a number of people who claimed to have been wrongly banned over child sexual exploitation. Within hours, his account was reinstated.
In a message sent to David, and seen by the BBC, the tech giant said: “We’re sorry that we’ve got this wrong, and that you weren’t able to use Instagram for a while. Sometimes, we need to take action to help keep our community safe.”
“It is a massive weight off my shoulders,” said David.
Faisal was banned from Instagram on 6 June over alleged child sexual exploitation and, like David, found his Facebook account suspended too.
The student from London is embarking on a career in the creative arts, and was starting to earn money via commissions on his Instagram page when it was suspended. He appealed after feeling he had done nothing wrong, and then his account was then banned a few minutes later.
He told BBC News: “I don’t know what to do and I’m really upset.
“[Meta] falsely accuse me of a crime that I have never done, which also damages my mental state and health and it has put me into pure isolation throughout the past month.”
His case was also raised with Meta by the BBC on 3 July. About five hours later, his accounts were reinstated. He received the exact same email as David, with the apology from Meta.
He told BBC News he was “quite relieved” after hearing the news. “I am trying to limit my time on Instagram now.”
Faisal said he remained upset over the incident, and is now worried the account ban might come up if any background checks are made on him.
A third user Salim told BBC News that he also had accounts falsely banned for child sexual exploitation violations.
He highlighted his case to journalists, stating that appeals are “largely ignored”, business accounts were being affected, and AI was “labelling ordinary people as criminal abusers”.
Almost a week after he was banned, his Instagram and Facebook accounts were reinstated.
What’s gone wrong?
When asked by BBC News, Meta declined to comment on the cases of David, Faisal, and Salim, and did not answer questions about whether it had a problem with wrongly accusing users of child abuse offences.
It seems in one part of the world, however, it has acknowledged there is a wider issue.
The BBC has learned that the chair of the Science, ICT, Broadcasting, and Communications Committee at the National Assembly in South Korea, said last month that Meta had acknowledged the possibility of wrongful suspensions for people in her country.
Dr Carolina Are, a blogger and researcher at Northumbria University into social media moderation, said it was hard to know what the root of the problem was because Meta was not being open about it.
However, she suggested it could be due to recent changes to the wording of some its community guidelines and an ongoing lack of a workable appeal process.
“Meta often don’t explain what it is that triggered the deletion. We are not privy to what went wrong with the algorithm,” she told BBC News.
In a previous statement, Meta said: “We take action on accounts that violate our policies, and people can appeal if they think we’ve made a mistake.”
Meta, in common with all big technology firms, have come under increased pressure in recent years from regulators and authorities to make their platforms safe spaces.
Meta told the BBC it used a combination of people and technology to find and remove accounts that broke its rules, and was not aware of a spike in erroneous account suspension.
Meta says its child sexual exploitation policy relates to children and “non-real depictions with a human likeness”, such as art, content generated by AI or fictional characters.
Meta also told the BBC a few weeks ago it uses technology to identify potentially suspicious behaviours, such as adult accounts being reported by teen accounts, or adults repeatedly searching for “harmful” terms.
Meta states that when it becomes aware of “apparent child exploitation”, it reports it to the National Center for Missing and Exploited Children (NCMEC) in the US. NCMEC told BBC News it makes all of those reports available to law enforcement around the world.
-
Funding & Business1 week ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Jobs & Careers1 week ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Mergers & Acquisitions1 week ago
Donald Trump suggests US government review subsidies to Elon Musk’s companies
-
Funding & Business1 week ago
Rethinking Venture Capital’s Talent Pipeline
-
Jobs & Careers1 week ago
Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)
-
Education1 day ago
9 AI Ethics Scenarios (and What School Librarians Would Do)
-
Jobs & Careers1 week ago
Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle
-
Funding & Business5 days ago
Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%
-
Funding & Business1 week ago
From chatbots to collaborators: How AI agents are reshaping enterprise work
-
Education2 days ago
Nursery teachers to get £4,500 to work in disadvantaged areas