Connect with us

AI Research

What 300GB of AI Research Reveals About the True Limits of “Zero-Shot” Intelligence

Published

on


Authors:

(1) Vishaal Udandarao, Tubingen AI Center, University of Tubingen, University of Cambridge, and equal contribution;

(2) Ameya Prabhu, Tubingen AI Center, University of Tubingen, University of Oxford, and equal contribution;

(3) Adhiraj Ghosh, Tubingen AI Center, University of Tubingen;

(4) Yash Sharma, Tubingen AI Center, University of Tubingen;

(5) Philip H.S. Torr, University of Oxford;

(6) Adel Bibi, University of Oxford;

(7) Samuel Albanie, University of Cambridge and equal advising, order decided by a coin flip;

(8) Matthias Bethge, Tubingen AI Center, University of Tubingen and equal advising, order decided by a coin flip.

Abstract and 1. Introduction

2 Concepts in Pretraining Data and Quantifying Frequency

3 Comparing Pretraining Frequency & “Zero-Shot” Performance and 3.1 Experimental Setup

3.2 Result: Pretraining Frequency is Predictive of “Zero-Shot” Performance

4 Stress-Testing the Concept Frequency-Performance Scaling Trend and 4.1 Controlling for Similar Samples in Pretraining and Downstream Data

4.2 Testing Generalization to Purely Synthetic Concept and Data Distributions

5 Additional Insights from Pretraining Concept Frequencies

6 Testing the Tail: Let It Wag!

7 Related Work

8 Conclusions and Open Problems, Acknowledgements, and References

Part I

Appendix

A. Concept Frequency is Predictive of Performance Across Prompting Strategies

B. Concept Frequency is Predictive of Performance Across Retrieval Metrics

C. Concept Frequency is Predictive of Performance for T2I Models

D. Concept Frequency is Predictive of Performance across Concepts only from Image and Text Domains

E. Experimental Details

F. Why and How Do We Use RAM++?

G. Details about Misalignment Degree Results

H. T2I Models: Evaluation

I. Classification Results: Let It Wag!

Abstract

Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?

We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets [79], and testing on purely synthetic data distributions [51]. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the Let it Wag! benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.

1 Introduction

Multimodal models like CLIP [91] and Stable Diffusion [96] have revolutionized performance on downstream tasks—CLIP is now the de-facto standard for “zero-shot” image recognition [133, 72, 126, 48, 132] and imagetext retrieval [46, 64, 24, 117, 129], while Stable Diffusion is now the de-facto standard for “zero-shot” text-to-image (T2I) generation [93, 17, 96, 41]. In this work, we investigate this empirical success through the lens of zero-shot generalization [69], which refers to the ability of the model to apply its learned knowledge to new unseen concepts. Accordingly, we ask: Are current multimodal models truly capable of “zero-shot” generalization?

To address this, we conducted a comparative analysis involving two main factors: (1) the performance of models across various downstream tasks and (2) the frequency of test concepts within their pretraining datasets. We compiled a comprehensive list of 4, 029 concepts[1] from 27 downstream tasks spanning classification, retrieval, and image generation, assessing the performance against these concepts. Our analysis spanned five large-scale pretraining datasets with different scales, data curation methods and sources (CC-3M [107], CC-12M [27], YFCC-15M [113], LAION-Aesthetics [103], LAION-400M [102]), and evaluated the performance of 10 CLIP models and 24 T2I models, spanning different architectures and parameter scales. We consistently find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is a strong predictor of the model’s performance on test examples containing that concept. Notably, model performance scales linearly as the concept frequency in pretraining data grows exponentially i.e., we observe a consistent log-linear scaling trend. We find that this log-linear trend is robust to controlling for correlated factors (similar samples in pretraining and test data [79]) and testing across different concept distributions along with samples generated entirely synthetically [51].

Our findings indicate that the impressive empirical performance of multimodal models like CLIP and Stable Diffusion can be largely attributed to the presence of test concepts within their vast pretraining datasets, thus their reported empirical performance does not constitute “zero-shot” generalization. Quite the contrary, these models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency.

In our analysis, we additionally document the distribution of concepts encountered in pretraining data and find that:

• Concept Distribution: Across all pretraining datasets, the distribution of concepts is long-tailed (see Fig. 5 in Sec. 5), which indicates that a large fraction of concepts are rare. However, given the extreme sample inefficiency observed, what is rare is not properly learned during multimodal pretraining.

• Concept Correlation across Pretraining Datasets: The distribution of concepts across different pretraining datasets are strongly correlated (see Tab. 4 in Sec. 5), which suggests web crawls yield surprisingly similar concept distributions across different pretraining data curation strategies, necessitating explicit rebalancing efforts [11, 125].

• Image-Text Misalignment between Concepts in Pretraining Data: Concepts often appear in one modality but not the other, which implies significant misalignment (see Tab. 3 in Sec. 5). Our released data artifacts can help image-text alignment efforts at scale by precisely indicating the examples in which modalities misalign. Note that the log-linear trend across both modalities is robust to this misalignment.

To provide a simple benchmark for generalization performance for multimodal models, which controls for the concept frequency in the training set, we introduce a new long-tailed test dataset called “Let It Wag!”. Current models trained on both openly available datasets (e.g., LAION-2B [103], DataComp-1B [46]) and closed-source datasets (e.g., OpenAI-WIT [91], WebLI [29]) have significant drops in performance, providing evidence that our observations may also transfer to closed-source datasets. We publicly release all our data artifacts (over 300GB), amortising the cost of analyzing the pretraining datasets of multimodal foundation models for a more data-centric understanding of the properties of multimodal models in the future.

Several prior works [91, 46, 82, 42, 83, 74] have investigated the role of pretraining data in affecting performance. Mayilvahanan et al. [79] showed that CLIP’s performance is correlated with the similarity between training and test datasets. In other studies on specific areas like question-answering [62] and numerical reasoning [94] in large language models, high train-test set similarity did not fully account for observed performance levels [127]. Our comprehensive analysis of several pretraining image-text datasets significantly adds to this line of work, by (1) showing that concept frequency determines zero-shot performance and (2) pinpointing the exponential need for training data as a fundamental issue for current large-scale multimodal models. We conclude that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.

[1] class categories for classification tasks, objects in the text captions for retrieval tasks, and objects in the text prompts for generation tasks, see Sec. 2 for more details on how we define concepts.



Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI Research

Atari Video Chess checkmates Copilot after knocking over ChatGPT’s king

Published

on



  • Microsoft Copilot has lost a game of chess to an Atari 2600.
  • The loss follows ChatGPT’s similar loss in Atari’s Video Chess.
  • The AIs repeatedly lost track of the board state, demonstrating a key weakness in LLMs.

AI chatbot developers often boast about the logic and reasoning abilities of their models, but that doesn’t mean the LLMs behind the chatbots are any good at chess. An experiment pitting Microsoft Copilot against the “AI” powering the 1979 Atari 2600 game Video Chess just ended in an embarrassing failure for Microsoft’s pride and joy. Copilot joins ChatGPT on the list of opponents bested by the four-kilobyte Atari game.

Despite both AI models claiming to have the game all but wrapped up before it began because they could think multiple moves ahead, the results were nowhere near the boasts, as documented by Citrix engineer Robert Caruso, who put together both experiments.



Source link

Continue Reading

AI Research

How has AI affected your technology job — or job hunt?

Published

on


Not too many years ago, a degree in computer science was considered a guarantee of high-paying stable employment. But in recent months, demand for computer science graduates has slumped.

A recent report from the Federal Reserve Bank of New York found an unemployment rate of 6 percent for CS grads. That’s higher than the unemployment rate for art history majors.

Much of the blame has fallen upon the rise of artificial intelligence systems like ChatGPT, which are capable of writing original computer programs on request, with no need for formally trained coders. And even for those computer scientists who have found steady work, the nature of their work is changing, as they use AI tools to increase their productivity.

The Globe is looking to speak to technology workers and job seekers in Greater Boston who are being affected by this new normal in the world of software development. Fill out the survey below and a reporter may be in touch.


Hiawatha Bray can be reached at hiawatha.bray@globe.com. Follow him @GlobeTechLab.





Source link

Continue Reading

AI Research

SoundHound AI, or This Other Magnificent Artificial Intelligence Stock?

Published

on


  • SoundHound AI is a rapidly growing specialist in conversational artificial intelligence (AI), and it amassed an impressive list of customers.

  • DigitalOcean provides cloud services to small and mid-sized businesses, and now it’s helping those customers tap into the AI revolution.

  • There are positives and negatives for both, but one clearly looks like the better investment right now.

  • 10 stocks we like better than SoundHound AI ›

SoundHound AI (NASDAQ: SOUN) is a leading developer of conversational artificial intelligence (AI) software, and its revenue is growing at a lightning-fast pace. Its stock soared by 835% in 2024 after Nvidia revealed a small stake in the company, although the chip giant has since sold its entire position.

DigitalOcean (NYSE: DOCN) is another up-and-coming AI company. It operates a cloud computing platform designed specifically for small and mid-sized businesses (SMBs), which features a growing portfolio of AI services, including data center infrastructure and a new tool that allows them to build custom AI agents.

With the second half of 2025 officially underway, which stock is the better buy between SoundHound AI and DigitalOcean?

Image source: Getty Images.

SoundHound AI amassed an impressive customer list that includes automotive giants like Hyundai and Kia and quick-service restaurant chains like Chipotle and Papa John’s. All of them use SoundHound’s conversational AI software to deliver new and unique experiences for their customers.

Automotive manufacturers are integrating SoundHound’s Chat AI product into their new vehicles, where it can teach drivers how to use different features or answer questions about gas mileage and even the weather. Manufacturers can customize Chat AI’s personality to suit their brand, which differentiates the user experience from the competition.

Restaurant chains use SoundHound’s software to autonomously take customer orders in-store, over the phone, and in the drive-thru. They also use the company’s voice-activated virtual assistant tool called Employee Assist, which workers can consult whenever they need instructions for preparing a menu item or help understanding store policies.

SoundHound generated $84.7 million in revenue during 2024, which was an 85% increase from the previous year. However, management’s latest guidance suggests the company could deliver $167 million in revenue during 2025, which would represent accelerated growth of 97%. SoundHound also has an order backlog worth over $1.2 billion, which it expects to convert into revenue over the next six years, so that will support further growth.

But there are a couple of caveats. First, SoundHound continues to lose money at the bottom line. It burned through $69.1 million on a non-GAAP (adjusted) basis in 2024 and a further $22.3 million in the first quarter of 2025 (ended March 31). The company only has $246 million in cash on hand, so it can’t afford to keep losing money at this pace forever — eventually, it will have to cut costs and sacrifice some of its revenue growth to achieve profitability.

The second caveat is SoundHound’s valuation, which we’ll explore further in a moment.

The cloud computing industry is dominated by trillion-dollar tech giants like Amazon and Microsoft, but they mostly design their services for large organizations with deep pockets. SMB customers don’t really move the needle for them, but that leaves an enormous gap in the cloud market for other players like DigitalOcean.

DigitalOcean offers clear and transparent pricing, attentive customer service, and a simple dashboard, which is a great set of features for small- and mid-sized businesses with limited resources. The company is now helping those customers tap into the AI revolution in a cost-efficient way with a growing portfolio of services.

DigitalOcean operates data centers filled with graphics processing units (GPUs) from leading suppliers like Nvidia and Advanced Micro Devices, and it offers fractional capacity, which means its customers can access between one and eight chips. This is ideal for small workloads like deploying an AI customer service chatbot on a website.

Earlier this year, DigitalOcean launched a new platform called GenAI, where its clients can create and deploy custom AI agents. These agents can do almost anything, whether an SMB needs them to analyze documents, detect fraud, or even autonomously onboard new employees. The agents are built on the latest third-party large language models from leading developers like OpenAI and Meta Platforms, so SMBs know they are getting the same technology as some of their largest competitors.

DigitalOcean expects to generate $880 million in total revenue during 2025, which would represent a modest growth of 13% compared to the prior year. However, during the first quarter, the company said its AI revenue surged by an eye-popping 160%. Management doesn’t disclose exactly how much revenue is attributable to its AI services, but it says demand for GPU capacity continues to outstrip supply, which means the significant growth is likely to continue for now.

Unlike SoundHound AI, DigitalOcean is highly profitable. It generated $84.5 million in generally accepted accounting principles (GAAP) net income during 2024, which was up by a whopping 335% from the previous year. It carried that momentum into 2025, with its first-quarter net income soaring by 171% to $38.2 million.

For me, the choice between SoundHound AI and DigitalOcean mostly comes down to valuation. SoundHound AI stock is trading at a sky-high price-to-sales (P/S) ratio of 41.4, making it even more expensive than Nvidia, which is one of the highest-quality companies in the world. DigitalOcean stock, on the other hand, trades at a very modest P/S ratio of just 3.5, which is actually near the cheapest level since the company went public in 2021.

SOUN PS Ratio Chart
SOUN PS Ratio data by YCharts

We can also value DigitalOcean based on its earnings, which can’t be said for SoundHound because the company isn’t profitable. DigitalOcean stock is trading at a price-to-earnings (P/E) ratio of 26.2, which makes it much cheaper than larger cloud providers like Amazon and Microsoft (although they also operate a host of other businesses):

MSFT PE Ratio Chart
MSFT PE Ratio data by YCharts

SoundHound’s rich valuation might limit further upside in the near term. When we combine that with the company’s steep losses at the bottom line, its stock simply doesn’t look very attractive right now, which might be why Nvidia sold it. DigitalOcean stock looks like a bargain in comparison, and it has legitimate potential for upside from here thanks to the company’s surging AI revenue and highly profitable business.

Before you buy stock in SoundHound AI, consider this:

The Motley Fool Stock Advisor analyst team just identified what they believe are the 10 best stocks for investors to buy now… and SoundHound AI wasn’t one of them. The 10 stocks that made the cut could produce monster returns in the coming years.

Consider when Netflix made this list on December 17, 2004… if you invested $1,000 at the time of our recommendation, you’d have $695,481!* Or when Nvidia made this list on April 15, 2005… if you invested $1,000 at the time of our recommendation, you’d have $969,935!*

Now, it’s worth noting Stock Advisor’s total average return is 1,053% — a market-crushing outperformance compared to 179% for the S&P 500. Don’t miss out on the latest top 10 list, available when you join Stock Advisor.

See the 10 stocks »

*Stock Advisor returns as of July 7, 2025

John Mackey, former CEO of Whole Foods Market, an Amazon subsidiary, is a member of The Motley Fool’s board of directors. Randi Zuckerberg, a former director of market development and spokeswoman for Facebook and sister to Meta Platforms CEO Mark Zuckerberg, is a member of The Motley Fool’s board of directors. Anthony Di Pizio has no position in any of the stocks mentioned. The Motley Fool has positions in and recommends Advanced Micro Devices, Amazon, Chipotle Mexican Grill, DigitalOcean, Meta Platforms, Microsoft, and Nvidia. The Motley Fool recommends the following options: long January 2026 $395 calls on Microsoft, short January 2026 $405 calls on Microsoft, and short June 2025 $55 calls on Chipotle Mexican Grill. The Motley Fool has a disclosure policy.

Better Buy in 2025: SoundHound AI, or This Other Magnificent Artificial Intelligence Stock? was originally published by The Motley Fool



Source link

Continue Reading

Trending