AI Research
What 300GB of AI Research Reveals About the True Limits of “Zero-Shot” Intelligence

Authors:
(1) Vishaal Udandarao, Tubingen AI Center, University of Tubingen, University of Cambridge, and equal contribution;
(2) Ameya Prabhu, Tubingen AI Center, University of Tubingen, University of Oxford, and equal contribution;
(3) Adhiraj Ghosh, Tubingen AI Center, University of Tubingen;
(4) Yash Sharma, Tubingen AI Center, University of Tubingen;
(5) Philip H.S. Torr, University of Oxford;
(6) Adel Bibi, University of Oxford;
(7) Samuel Albanie, University of Cambridge and equal advising, order decided by a coin flip;
(8) Matthias Bethge, Tubingen AI Center, University of Tubingen and equal advising, order decided by a coin flip.
Table of Links
2 Concepts in Pretraining Data and Quantifying Frequency
3 Comparing Pretraining Frequency & “Zero-Shot” Performance and 3.1 Experimental Setup
3.2 Result: Pretraining Frequency is Predictive of “Zero-Shot” Performance
4.2 Testing Generalization to Purely Synthetic Concept and Data Distributions
5 Additional Insights from Pretraining Concept Frequencies
6 Testing the Tail: Let It Wag!
8 Conclusions and Open Problems, Acknowledgements, and References
Part I
Appendix
A. Concept Frequency is Predictive of Performance Across Prompting Strategies
B. Concept Frequency is Predictive of Performance Across Retrieval Metrics
C. Concept Frequency is Predictive of Performance for T2I Models
D. Concept Frequency is Predictive of Performance across Concepts only from Image and Text Domains
F. Why and How Do We Use RAM++?
G. Details about Misalignment Degree Results
I. Classification Results: Let It Wag!
Abstract
Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?
We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets [79], and testing on purely synthetic data distributions [51]. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the Let it Wag! benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.
1 Introduction
Multimodal models like CLIP [91] and Stable Diffusion [96] have revolutionized performance on downstream tasks—CLIP is now the de-facto standard for “zero-shot” image recognition [133, 72, 126, 48, 132] and imagetext retrieval [46, 64, 24, 117, 129], while Stable Diffusion is now the de-facto standard for “zero-shot” text-to-image (T2I) generation [93, 17, 96, 41]. In this work, we investigate this empirical success through the lens of zero-shot generalization [69], which refers to the ability of the model to apply its learned knowledge to new unseen concepts. Accordingly, we ask: Are current multimodal models truly capable of “zero-shot” generalization?
To address this, we conducted a comparative analysis involving two main factors: (1) the performance of models across various downstream tasks and (2) the frequency of test concepts within their pretraining datasets. We compiled a comprehensive list of 4, 029 concepts[1] from 27 downstream tasks spanning classification, retrieval, and image generation, assessing the performance against these concepts. Our analysis spanned five large-scale pretraining datasets with different scales, data curation methods and sources (CC-3M [107], CC-12M [27], YFCC-15M [113], LAION-Aesthetics [103], LAION-400M [102]), and evaluated the performance of 10 CLIP models and 24 T2I models, spanning different architectures and parameter scales. We consistently find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is a strong predictor of the model’s performance on test examples containing that concept. Notably, model performance scales linearly as the concept frequency in pretraining data grows exponentially i.e., we observe a consistent log-linear scaling trend. We find that this log-linear trend is robust to controlling for correlated factors (similar samples in pretraining and test data [79]) and testing across different concept distributions along with samples generated entirely synthetically [51].
Our findings indicate that the impressive empirical performance of multimodal models like CLIP and Stable Diffusion can be largely attributed to the presence of test concepts within their vast pretraining datasets, thus their reported empirical performance does not constitute “zero-shot” generalization. Quite the contrary, these models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency.
In our analysis, we additionally document the distribution of concepts encountered in pretraining data and find that:
• Concept Distribution: Across all pretraining datasets, the distribution of concepts is long-tailed (see Fig. 5 in Sec. 5), which indicates that a large fraction of concepts are rare. However, given the extreme sample inefficiency observed, what is rare is not properly learned during multimodal pretraining.
• Concept Correlation across Pretraining Datasets: The distribution of concepts across different pretraining datasets are strongly correlated (see Tab. 4 in Sec. 5), which suggests web crawls yield surprisingly similar concept distributions across different pretraining data curation strategies, necessitating explicit rebalancing efforts [11, 125].
• Image-Text Misalignment between Concepts in Pretraining Data: Concepts often appear in one modality but not the other, which implies significant misalignment (see Tab. 3 in Sec. 5). Our released data artifacts can help image-text alignment efforts at scale by precisely indicating the examples in which modalities misalign. Note that the log-linear trend across both modalities is robust to this misalignment.
To provide a simple benchmark for generalization performance for multimodal models, which controls for the concept frequency in the training set, we introduce a new long-tailed test dataset called “Let It Wag!”. Current models trained on both openly available datasets (e.g., LAION-2B [103], DataComp-1B [46]) and closed-source datasets (e.g., OpenAI-WIT [91], WebLI [29]) have significant drops in performance, providing evidence that our observations may also transfer to closed-source datasets. We publicly release all our data artifacts (over 300GB), amortising the cost of analyzing the pretraining datasets of multimodal foundation models for a more data-centric understanding of the properties of multimodal models in the future.
Several prior works [91, 46, 82, 42, 83, 74] have investigated the role of pretraining data in affecting performance. Mayilvahanan et al. [79] showed that CLIP’s performance is correlated with the similarity between training and test datasets. In other studies on specific areas like question-answering [62] and numerical reasoning [94] in large language models, high train-test set similarity did not fully account for observed performance levels [127]. Our comprehensive analysis of several pretraining image-text datasets significantly adds to this line of work, by (1) showing that concept frequency determines zero-shot performance and (2) pinpointing the exponential need for training data as a fundamental issue for current large-scale multimodal models. We conclude that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.
[1] class categories for classification tasks, objects in the text captions for retrieval tasks, and objects in the text prompts for generation tasks, see Sec. 2 for more details on how we define concepts.
AI Research
Researchers ‘polarised’ over use of AI in peer review

Researchers appear to be becoming more divided over whether generative artificial intelligence should be used in peer review, with a survey showing entrenched views on either side.
A poll by IOP Publishing found that there has been a big increase in the number of scholars who are positive about the potential impact of new technologies on the process, which is often criticised for being slow and overly burdensome for those involved.
A total of 41 per cent of respondents now see the benefits of AI, up from 12 per cent from a similar survey carried out last year. But this is almost equal to the proportion with negative opinions which stands at 37 per cent after a 2 per cent year-on-year increase.
This leaves only 22 per cent of researchers neutral or unsure about the issue, down from 36 per cent, which IOP said indicates a “growing polarisation in views” as AI use becomes more commonplace.
Women tended to have more negative views about the impact of AI compared with men while junior researchers tended to have a more positive view than their more senior colleagues.
Nearly a third (32 per cent) of those surveyed say they already used AI tools to support them with peer reviews in some form.
Half of these say they apply it in more than one way with the most common use being to assist with editing grammar and improving the flow of text.
A minority used it in more questionable ways such as the 13 per cent who asked the AI to summarise an article they were reviewing – despite confidentiality and data privacy concerns – and the 2 per cent who admitted to uploading an entire manuscript into a chatbot so it could generate a review on their behalf.
IOP – which currently does not allow AI use in peer reviews – said the survey showed a growing recognition that the technology has the potential to “support, rather than replace, the peer review process”.
But publishers must fund ways to “reconcile” the two opposing viewpoints, the publisher added.
A solution could be developing tools that can operate within peer review software, it said, which could support reviewers without positing security or integrity risks.
Publishers should also be more explicit and transparent about why chatbots “are not suitable tools for fully authoring peer review reports”, IOP said.
“These findings highlight the need for clearer community standards and transparency around the use of generative AI in scholarly publishing. As the technology continues to evolve, so too must the frameworks that support ethical and trustworthy peer review,” Laura Feetham-Walker, reviewer engagement manager at IOP and lead author of the study, said.
AI Research
Amazon Employing AI to Help Shoppers Comb Reviews

Amazon earlier this year began rolling out artificial intelligence-voiced product descriptions for select customers and products.
AI Research
Nubank To Continue Leveraging AI To Enhance Digital Financial Services In Latin America

Nubank (NYSE: NU) is reportedly millions of customers across Latin America. Recently, the company’s Chief Technology Officer, Eric Young, shared his vision for leveraging artificial intelligence to fuel Nubank’s global expansion and improve financial services.
During a recent discussion, Young outlined how AI is not just a tool but a cornerstone for operational efficiency, customer-centric growth, and democratizing access to personalized finance.
With a career that includes work at Amazon in the early 2000s, Young brings a philosophy of prioritizing customer experience.
At Amazon, he witnessed firsthand how technology could transform user experiences, a mindset he now applies to Nubank’s mission. “If not us, then who?”
Young posed rhetorically during the videocast, underscoring Nubank’s unique position to disrupt traditional banking.
Founded in Brazil in 2013, Nubank has positively impacted the financial sector by prioritizing financial inclusion and superior customer service, challenging legacy banks with its digital-first approach.
Under Young’s leadership, Nubank’s priorities are clear: enhance agility, expand internationally, and harness AI to serve customers better.
He emphasized the need for cross-functional collaboration, particularly with the product and design teams.
This includes partnering with Nubank’s recently appointed Chief Design Officer (CDO), Ethan Eismann, to iterate quickly on new features.
By fostering a culture of testing and learning, Young aims to deliver products that not only meet but exceed user expectations, ultimately capturing a larger market share.
This involves deepening engagement with existing users, attracting new ones, and venturing into underserved markets where financial services remain inaccessible.
Central to Young’s strategy is AI’s transformative potential.
Nubank’s 2024 acquisition of Hyperplane, an AI-focused startup, marks a pivotal step in this direction.
Young highlighted how advanced language models—such as those powering ChatGPT and Google Gemini—can bridge the gap between everyday users and elite financial advisory services.
These models excel at processing vast amounts of data, including transaction histories, to offer hyper-personalized recommendations.
Imagine an AI that automates budgeting, predicts spending patterns, and suggests investment opportunities tailored to an individual’s financial profile, all without the hefty fees of traditional private banking.
Young drew a parallel to the exclusivity of high-end services.
Historically, AI-driven private banking was reserved for the ultra-wealthy, but Nubank’s vision is to make it ubiquitous.
“We’re democratizing access to hyper-personalized financial experiences.”
By analyzing user data ethically and securely, AI can empower customers from all segments—whether a small business owner in Mexico or a young professional in Colombia—to manage their finances with the precision once afforded only to elites.
This aligns with Nubank’s core ethos of inclusion, ensuring that technology serves as an equalizer rather than a divider.
Looking ahead, Young sees AI as the engine for Nubank’s platformization efforts, enabling scalable solutions that support international growth.
As Nubank eyes further expansion beyond Brazil, Mexico, and Colombia, AI will streamline operations, from fraud detection to customer support chatbots, reducing costs while enhancing reliability.
Yet, Young cautioned that success hinges on responsible implementation—prioritizing privacy, transparency, and human oversight to build trust.
In an era where fintechs aggressively compete for market share, Eric Young’s insights position Nubank not just as a bank, but as a key player in AI-powered financial services.
By blending technological prowess with a focus on the customer, Nubank is set to transform money management, making various services more accessible to consumers.
As Young basically put it, the question isn’t whether AI will change finance—it’s how Nubank will aim to make a positive impact.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms1 month ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers3 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Education3 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi
-
Funding & Business3 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries