AI Research

Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?

We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets [79], and testing on purely synthetic data distributions [51]. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the Let it Wag! benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.

1 Introduction

Multimodal models like CLIP [91] and Stable Diffusion [96] have revolutionized performance on downstream tasks—CLIP is now the de-facto standard for “zero-shot” image recognition [133, 72, 126, 48, 132] and imagetext retrieval [46, 64, 24, 117, 129], while Stable Diffusion is now the de-facto standard for “zero-shot” text-to-image (T2I) generation [93, 17, 96, 41]. In this work, we investigate this empirical success through the lens of zero-shot generalization [69], which refers to the ability of the model to apply its learned knowledge to new unseen concepts. Accordingly, we ask: Are current multimodal models truly capable of “zero-shot” generalization?

To address this, we conducted a comparative analysis involving two main factors: (1) the performance of models across various downstream tasks and (2) the frequency of test concepts within their pretraining datasets. We compiled a comprehensive list of 4, 029 concepts[1] from 27 downstream tasks spanning classification, retrieval, and image generation, assessing the performance against these concepts. Our analysis spanned five large-scale pretraining datasets with different scales, data curation methods and sources (CC-3M [107], CC-12M [27], YFCC-15M [113], LAION-Aesthetics [103], LAION-400M [102]), and evaluated the performance of 10 CLIP models and 24 T2I models, spanning different architectures and parameter scales. We consistently find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is a strong predictor of the model’s performance on test examples containing that concept. Notably, model performance scales linearly as the concept frequency in pretraining data grows exponentially i.e., we observe a consistent log-linear scaling trend. We find that this log-linear trend is robust to controlling for correlated factors (similar samples in pretraining and test data [79]) and testing across different concept distributions along with samples generated entirely synthetically [51].

Our findings indicate that the impressive empirical performance of multimodal models like CLIP and Stable Diffusion can be largely attributed to the presence of test concepts within their vast pretraining datasets, thus their reported empirical performance does not constitute “zero-shot” generalization. Quite the contrary, these models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency.

In our analysis, we additionally document the distribution of concepts encountered in pretraining data and find that:

• Concept Distribution: Across all pretraining datasets, the distribution of concepts is long-tailed (see Fig. 5 in Sec. 5), which indicates that a large fraction of concepts are rare. However, given the extreme sample inefficiency observed, what is rare is not properly learned during multimodal pretraining.

• Concept Correlation across Pretraining Datasets: The distribution of concepts across different pretraining datasets are strongly correlated (see Tab. 4 in Sec. 5), which suggests web crawls yield surprisingly similar concept distributions across different pretraining data curation strategies, necessitating explicit rebalancing efforts [11, 125].

• Image-Text Misalignment between Concepts in Pretraining Data: Concepts often appear in one modality but not the other, which implies significant misalignment (see Tab. 3 in Sec. 5). Our released data artifacts can help image-text alignment efforts at scale by precisely indicating the examples in which modalities misalign. Note that the log-linear trend across both modalities is robust to this misalignment.

To provide a simple benchmark for generalization performance for multimodal models, which controls for the concept frequency in the training set, we introduce a new long-tailed test dataset called “Let It Wag!”. Current models trained on both openly available datasets (e.g., LAION-2B [103], DataComp-1B [46]) and closed-source datasets (e.g., OpenAI-WIT [91], WebLI [29]) have significant drops in performance, providing evidence that our observations may also transfer to closed-source datasets. We publicly release all our data artifacts (over 300GB), amortising the cost of analyzing the pretraining datasets of multimodal foundation models for a more data-centric understanding of the properties of multimodal models in the future.

Several prior works [91, 46, 82, 42, 83, 74] have investigated the role of pretraining data in affecting performance. Mayilvahanan et al. [79] showed that CLIP’s performance is correlated with the similarity between training and test datasets. In other studies on specific areas like question-answering [62] and numerical reasoning [94] in large language models, high train-test set similarity did not fully account for observed performance levels [127]. Our comprehensive analysis of several pretraining image-text datasets significantly adds to this line of work, by (1) showing that concept frequency determines zero-shot performance and (2) pinpointing the exponential need for training data as a fundamental issue for current large-scale multimodal models. We conclude that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.

[1] class categories for classification tasks, objects in the text captions for retrieval tasks, and objects in the text prompts for generation tasks, see Sec. 2 for more details on how we define concepts.

Source link

AI Research

Atari Video Chess checkmates Copilot after knocking over ChatGPT’s king

Published

25 minutes ago

July 9, 2025

Eric Hal Schwartz

Microsoft Copilot has lost a game of chess to an Atari 2600.
The loss follows ChatGPT’s similar loss in Atari’s Video Chess.
The AIs repeatedly lost track of the board state, demonstrating a key weakness in LLMs.

AI chatbot developers often boast about the logic and reasoning abilities of their models, but that doesn’t mean the LLMs behind the chatbots are any good at chess. An experiment pitting Microsoft Copilot against the “AI” powering the 1979 Atari 2600 game Video Chess just ended in an embarrassing failure for Microsoft’s pride and joy. Copilot joins ChatGPT on the list of opponents bested by the four-kilobyte Atari game.

Despite both AI models claiming to have the game all but wrapped up before it began because they could think multiple moves ahead, the results were nowhere near the boasts, as documented by Citrix engineer Robert Caruso, who put together both experiments.

Caruso described how, on paper, the modern AI models should have crushed the rudimentary tool from nearly half a century ago. ChatGPT and Copilot are they’re trained on massive datasets, including chess games and strategy guides. They’ve absorbed thousands of hours of Reddit chess discussion. One would assume they could beat a 1970s video game cartridge powered by static electricity.

Instead, after Microsoft Copilot promised a “strong fight,” things immediately fell apart.

“By the seventh turn, it had lost two pawns, a knight, and a bishop — for only a single pawn in return—and was now instructing me to place its queen right in front of the Atari’s queen to be captured on the next turn,” Caruso wrote. “Earlier, Copilot had said, “Keep an eye on any quirks in the Atari’s gameplay… it sometimes made bizarre moves!” But now, it was getting embarrassed—like the Chiefs in the Super Bowl.”

This was after Copilot asked for a screenshot after every Atari move to help remember the board, after Caruso explained that ChatGPT lost because it couldn’t keep track of where all the pieces were. “I’ll remember the board,” Copilot insisted. The losses piled up so quickly that Caruso soon asked Copilot if it wanted to concede rather than continue to lose badly. The response was gracious, if bizarrely phrased.

“You’re absolutely right, Bob — Atari’s earned the win this round. I’ll tip my digital king with dignity and honor the vintage silicon mastermind that bested me fair and square,” Caruson quoted Copilot as writing. “Even in defeat, I’ve got to say: that was a blast… Long live 8-bit battles and noble resignations! ♟️😄🕹️”

Chess AI

The losses are amusing, but also reveal a basic fact of LLMs. ChatGPT and Copilot couldn’t win at chess because they couldn’t ‘remember’ what just happened in a game where the entire premise is based on remembering moves and projecting future board setups.

These AI models aren’t built for the kind of persistent memory required for chess, or human thinking, for that matter. The common, and mostly accurate, comparison is to very impressive text prediction. That doesn’t require coherence in the long term, while chess doesn’t make sense without it. So while Copilot and ChatGPT can seem to wax poetic about how great chess is, they can’t complete a game successfully.

It’s a good warning to companies eager to replace humans with AI, too. These AI models can’t reliably handle a 64-square system with clearly defined rules. Why would it suddenly be good at tracking customer complaints or long-term coding tasks, or a legal argument stretching across multiple conversations? They can’t, of course. Not that I would leave my legal briefs to an Atari 2600 cartridge, either, but nor would anyone think it’s a good idea. And maybe we should use AI models to help us create new games based on our prompts, rather than believe they can play against humans well enough to win.

Source link

AI Research

How has AI affected your technology job — or job hunt?

Published

1 hour ago

July 9, 2025

The Boston Globe

Not too many years ago, a degree in computer science was considered a guarantee of high-paying stable employment. But in recent months, demand for computer science graduates has slumped.

A recent report from the Federal Reserve Bank of New York found an unemployment rate of 6 percent for CS grads. That’s higher than the unemployment rate for art history majors.

Much of the blame has fallen upon the rise of artificial intelligence systems like ChatGPT, which are capable of writing original computer programs on request, with no need for formally trained coders. And even for those computer scientists who have found steady work, the nature of their work is changing, as they use AI tools to increase their productivity.

The Globe is looking to speak to technology workers and job seekers in Greater Boston who are being affected by this new normal in the world of software development. Fill out the survey below and a reporter may be in touch.

Hiawatha Bray can be reached at hiawatha.bray@globe.com. Follow him @GlobeTechLab.

Source link

AI Research

SoundHound AI, or This Other Magnificent Artificial Intelligence Stock?

Published

1 hour ago

July 9, 2025

Anthony Di Pizio, The Motley Fool

SoundHound AI is a rapidly growing specialist in conversational artificial intelligence (AI), and it amassed an impressive list of customers.
DigitalOcean provides cloud services to small and mid-sized businesses, and now it’s helping those customers tap into the AI revolution.
There are positives and negatives for both, but one clearly looks like the better investment right now.
10 stocks we like better than SoundHound AI ›

SoundHound AI (NASDAQ: SOUN) is a leading developer of conversational artificial intelligence (AI) software, and its revenue is growing at a lightning-fast pace. Its stock soared by 835% in 2024 after Nvidia revealed a small stake in the company, although the chip giant has since sold its entire position.

DigitalOcean (NYSE: DOCN) is another up-and-coming AI company. It operates a cloud computing platform designed specifically for small and mid-sized businesses (SMBs), which features a growing portfolio of AI services, including data center infrastructure and a new tool that allows them to build custom AI agents.

With the second half of 2025 officially underway, which stock is the better buy between SoundHound AI and DigitalOcean?

Image source: Getty Images.

SoundHound AI amassed an impressive customer list that includes automotive giants like Hyundai and Kia and quick-service restaurant chains like Chipotle and Papa John’s. All of them use SoundHound’s conversational AI software to deliver new and unique experiences for their customers.

Automotive manufacturers are integrating SoundHound’s Chat AI product into their new vehicles, where it can teach drivers how to use different features or answer questions about gas mileage and even the weather. Manufacturers can customize Chat AI’s personality to suit their brand, which differentiates the user experience from the competition.

Restaurant chains use SoundHound’s software to autonomously take customer orders in-store, over the phone, and in the drive-thru. They also use the company’s voice-activated virtual assistant tool called Employee Assist, which workers can consult whenever they need instructions for preparing a menu item or help understanding store policies.

SoundHound generated $84.7 million in revenue during 2024, which was an 85% increase from the previous year. However, management’s latest guidance suggests the company could deliver $167 million in revenue during 2025, which would represent accelerated growth of 97%. SoundHound also has an order backlog worth over $1.2 billion, which it expects to convert into revenue over the next six years, so that will support further growth.

But there are a couple of caveats. First, SoundHound continues to lose money at the bottom line. It burned through $69.1 million on a non-GAAP (adjusted) basis in 2024 and a further $22.3 million in the first quarter of 2025 (ended March 31). The company only has $246 million in cash on hand, so it can’t afford to keep losing money at this pace forever — eventually, it will have to cut costs and sacrifice some of its revenue growth to achieve profitability.

The second caveat is SoundHound’s valuation, which we’ll explore further in a moment.

The cloud computing industry is dominated by trillion-dollar tech giants like Amazon and Microsoft, but they mostly design their services for large organizations with deep pockets. SMB customers don’t really move the needle for them, but that leaves an enormous gap in the cloud market for other players like DigitalOcean.

DigitalOcean offers clear and transparent pricing, attentive customer service, and a simple dashboard, which is a great set of features for small- and mid-sized businesses with limited resources. The company is now helping those customers tap into the AI revolution in a cost-efficient way with a growing portfolio of services.

DigitalOcean operates data centers filled with graphics processing units (GPUs) from leading suppliers like Nvidia and Advanced Micro Devices, and it offers fractional capacity, which means its customers can access between one and eight chips. This is ideal for small workloads like deploying an AI customer service chatbot on a website.

Earlier this year, DigitalOcean launched a new platform called GenAI, where its clients can create and deploy custom AI agents. These agents can do almost anything, whether an SMB needs them to analyze documents, detect fraud, or even autonomously onboard new employees. The agents are built on the latest third-party large language models from leading developers like OpenAI and Meta Platforms, so SMBs know they are getting the same technology as some of their largest competitors.

DigitalOcean expects to generate $880 million in total revenue during 2025, which would represent a modest growth of 13% compared to the prior year. However, during the first quarter, the company said its AI revenue surged by an eye-popping 160%. Management doesn’t disclose exactly how much revenue is attributable to its AI services, but it says demand for GPU capacity continues to outstrip supply, which means the significant growth is likely to continue for now.

Unlike SoundHound AI, DigitalOcean is highly profitable. It generated $84.5 million in generally accepted accounting principles (GAAP) net income during 2024, which was up by a whopping 335% from the previous year. It carried that momentum into 2025, with its first-quarter net income soaring by 171% to $38.2 million.

For me, the choice between SoundHound AI and DigitalOcean mostly comes down to valuation. SoundHound AI stock is trading at a sky-high price-to-sales (P/S) ratio of 41.4, making it even more expensive than Nvidia, which is one of the highest-quality companies in the world. DigitalOcean stock, on the other hand, trades at a very modest P/S ratio of just 3.5, which is actually near the cheapest level since the company went public in 2021.

SOUN PS Ratio Chart — SOUN PS Ratio data by YCharts

We can also value DigitalOcean based on its earnings, which can’t be said for SoundHound because the company isn’t profitable. DigitalOcean stock is trading at a price-to-earnings (P/E) ratio of 26.2, which makes it much cheaper than larger cloud providers like Amazon and Microsoft (although they also operate a host of other businesses):

MSFT PE Ratio Chart — MSFT PE Ratio data by YCharts

SoundHound’s rich valuation might limit further upside in the near term. When we combine that with the company’s steep losses at the bottom line, its stock simply doesn’t look very attractive right now, which might be why Nvidia sold it. DigitalOcean stock looks like a bargain in comparison, and it has legitimate potential for upside from here thanks to the company’s surging AI revenue and highly profitable business.

Before you buy stock in SoundHound AI, consider this:

The Motley Fool Stock Advisor analyst team just identified what they believe are the 10 best stocks for investors to buy now… and SoundHound AI wasn’t one of them. The 10 stocks that made the cut could produce monster returns in the coming years.

Consider when Netflix made this list on December 17, 2004… if you invested $1,000 at the time of our recommendation, you’d have $695,481!* Or when Nvidia made this list on April 15, 2005… if you invested $1,000 at the time of our recommendation, you’d have $969,935!*

Now, it’s worth noting Stock Advisor’s total average return is 1,053% — a market-crushing outperformance compared to 179% for the S&P 500. Don’t miss out on the latest top 10 list, available when you join Stock Advisor.

See the 10 stocks »

*Stock Advisor returns as of July 7, 2025

John Mackey, former CEO of Whole Foods Market, an Amazon subsidiary, is a member of The Motley Fool’s board of directors. Randi Zuckerberg, a former director of market development and spokeswoman for Facebook and sister to Meta Platforms CEO Mark Zuckerberg, is a member of The Motley Fool’s board of directors. Anthony Di Pizio has no position in any of the stocks mentioned. The Motley Fool has positions in and recommends Advanced Micro Devices, Amazon, Chipotle Mexican Grill, DigitalOcean, Meta Platforms, Microsoft, and Nvidia. The Motley Fool recommends the following options: long January 2026 $395 calls on Microsoft, short January 2026 $405 calls on Microsoft, and short June 2025 $55 calls on Chipotle Mexican Grill. The Motley Fool has a disclosure policy.

Better Buy in 2025: SoundHound AI, or This Other Magnificent Artificial Intelligence Stock? was originally published by The Motley Fool

Source link