AI Research

A major AI training data set contains millions of examples of personal data

Published

2 months ago

July 18, 2025

The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.”

The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. (In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.)

A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

Examples of identity-related documents found in CommonPool’s small-scale data set show a credit card, a Social Security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.

COURTESY OF THE RESEARCHERS

When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well.

CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022.

While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the data sets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data. CommonPool researchers did not respond to emailed questions.

And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. Those would duplicate similar privacy risks.

Good intentions are not enough

“You can assume that any large-scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found).

Source link

AI Research

AI revolutionizes weather prediction to help farmers in India

Published

19 minutes ago

September 16, 2025

Robert Sanders

Artificial intelligence is revolutionizing weather prediction around the world, as evidenced by the successful prediction this spring of a delayed onset of the monsoon in northeastern India.

The prediction gave millions of smallholder farmers the option of postponing planting to take better advantage of the rains or to plant different crops. Based on a preliminary phone survey, many farmers adjusted their planting as a result.

This AI-based weather model — a collaboration between the University of California, Berkeley, and the University of Chicago — paves the way for much better forecasts for hundreds of millions of farmers across the tropics and global south whose livelihoods depend on timing crop planting with the monsoon’s arrival. Nearly two-thirds of the world’s population live in regions of the tropics impacted by monsoon rains, whose arrival each year is being affected by climate change.

“This program harnesses the revolution in AI-based weather forecasting to predict the arrival of continuous rains, empowering farmers to plan agricultural activities with greater confidence and manage risks. We look forward to continuing to improve this effort in future years,” said Pramod Kumar Meherda, additional secretary at the Indian Ministry of Agriculture and Farmers’ Welfare.

The success of this AI prediction project — the largest targeted dissemination of AI weather forecasts to date — required a herculean effort by atmospheric scientists, AI experts, India’s Ministry of Agriculture and Farmers’ Welfare and a global nonprofit that supports smallholder farmers. Key to these predictions were daily climate data compiled and made publicly available by the U.S. National Oceanic and Atmospheric Administration.

To make the actual predictions, UChicago AI expert Pedram Hassanzadeh teamed up with Berkeley atmospheric scientist William Boos to evaluate and use global AI weather models that were developed independently by Google and the European Centre for Medium-range Weather Forecasts (ECMWF). Both of those models have been trained on 40 years of global climate data. To localize the models to India and correct biases in their predictions, the UC Berkeley and UChicago teams used statistics from 100 years of rainfall data from the India Meteorological Department.

The monsoon-onset forecasts, which differed for different regions, were delivered weekly to about 38 million farmers across 13 states in central and northeastern India — most of the core monsoon zone. These forecasts provided predictions up to four weeks in advance for the arrival of monsoon rains in particular regions, something that had not been done before in 150 years of monsoon forecasting, Boos said. Current numerical models, based on the physics of the atmosphere, typically provide reasonably accurate rainfall predictions no more than five days out.

When the monsoon hit southern India in early June, the AI-based model predicted that it would stop temporarily, something that was not predicted by any other available forecast. That’s what actually happened — it stalled for 20 days.

“Demonstrating that the long lead-time precipitation forecasts made by these AI models are of practical use in a tropical region where people live is a major step forward — no one really knew that before we did this work,” said Boos, a UC Berkeley professor of earth and planetary science.

Parasnath Tiwari, a farmer from Madhya Pradesh, received the forecast on his phone and was able to prepare earlier, he said. He decided to switch the types of crops he planted to more lucrative ones because the message gave him confidence that the season would be long enough.

“Before this, I mostly relied on my own experience and local knowledge to know when the monsoon would arrive,” said Tiwari. “The forecast about the arrival of the monsoon was accurate…. I have increased trust in the forecast, and I will rely on the information shared by scientists in the future.”

A false monsoon could mean disaster

Farmers in each region of northeastern India were updated on a mostly weekly basis between May and July about the probability that the true monsoon would start within a certain window of time. In a typical year, the monsoon arrives between June 15 and June 30 in the south and proceeds northward, bringing steady rain to most of the country by July. The AI model predicted the nearly three-week stall, which the Indian government communicated to the farmers.

The AI-based weather prediction model produced monsoon forecast maps like these every week, beginning May 20. The model divides India into a grid and estimates the likelihood that the monsoon rains will start in the next 1, 2, 3 or 4 weeks (bar chart) in each grid square. Each square is color coded according to which 2-week period had the highest combined probability of rain onset. A simpler message – which 2-week period is most likely to see the onset of rains – was communicated to smallholder farmers each week. The May 27 forecast, for example, shows that the monsoon rains have already arrived in the 3 southernmost regions (gray) but predicts that it will take 3 weeks — until June 18 — to move farther north (light orange) and at least one more week after that to reach the northernmost regions (yellow). The normal monsoon usually proceeds steadily northward, but this unexpected 20-day pause was accurately called by the AI model.

Courtesy of the Human-Centered Weather Forecasts Initiative at the University of Chicago

“We actually gave farmers probabilistic forecasts, telling them how likely it was that monsoon rains would start in a particular week,” Boos said. “By field-testing the SMS messages with farmers in advance, our team was able to tailor the language of the message so that they understood what was being predicted and the level of certainty of the prediction.”

Boos studies atmospheric dynamics, primarily the atmospheric wind patterns that deliver water in the form of monsoon rains to Central America, South America, Africa, Northern Australia and South Asia. The onset of these monsoons is important to farmers because, unlike in the U.S., the majority of farmers planting wheat, rice and other staple crops have small plots and cannot afford to irrigate if the rains fail.

“The classic catastrophe scenario is that you get a wet spell, it rains for a few days, they plant their seeds, they’re like, ‘Hooray, the rainy season has arrived,’ and then there’s 15 days of dryness afterward and all the seeds dry out and die,” Boos said. “They just spent an enormous amount of their savings to buy seed stock and plant it, and it died, and that’s a huge loss.”

Based on an analysis led by UChicago Nobel Prize-winning economist Michael Kremer, one of the leaders of the project, the researchers concluded that farmers in rural India could benefit economically from a better prediction of when the annual rains would truly begin. AI-based weather prediction models seemed like the place to start.

“We have been going through an AI-driven revolution since 2022, and AI models have shown promise for many one- to two-week forecasting applications. But their ability to predict complex phenomena — like the monsoon — was unclear, and frankly, unexpected,” Hassanzadeh said. The first revolution, beginning in the 1950s, focused on physics-based models and numerical simulations on supercomputers. This second revolution is being powered by AI models trained on observation-based data and capable of being run on a laptop.

Boos and Hassanzadeh tested more than half a dozen of the current AI weather prediction models that make predictions a month out and also predict rainfall, and chose the two best: Google’s NeuralGCM, for neural general circulation model, and the AI Forecasting System (AIFS) created by ECMWF.

Boos said that many of these models have been shown to predict global aspects of the climate as well as or better than earlier physics-based models, but few have been tasked with predictions of the seasonal onset of rains in a specific region.

Because each model had different strengths and weaknesses, the team mathematically blended Google’s NeuralGCM, ECMWF’s AIFS and historical rainfall data from the India Meteorological Department.

This blend produced a probabilistic model with a 30-day lead time, “merging multiple AI models and statistical methods to produce useful forecasts targeted at agriculture,” Boos said. “Forecasts of the start of sustained monsoon rains have historically been difficult or impossible to deliver locally with this much lead time, especially on such a large scale.”

Delivering the message

The Ministry of Agriculture and Farmers’ Welfare delivered the forecasts to the farmers directly using its SMS texting platform. The Government of Odisha also partnered with the research team to reach nearly 1 million more through a voice messaging platform. Precision Development (PxD), a global nonprofit supporting smallholder farmers in digital advisory services, led message design and testing.

a man in white shirt looking at mobile phone — Farmers throughout northwestern India received weekly forecasts about the arrival of the monsoon in the spring of 2025. Planting crops with the arrival of the monsoon is an annual ritual that can be upended when rains suddenly stop.

Photo courtesy of Precision Development, PxD

The project leaders concluded that farmers responded to these weather forecast messages. Based on early results from a phone survey, up to 55% recalled receiving weather forecasts on their phones, and among those who remembered specifically the monsoon onset forecasts, nearly half reported using the information to adjust their planting decisions. A majority of farmers also shared these messages with other farmers, suggesting an even greater reach and impact.

“I shared the monsoon arrival forecasts with other farmers in my locality. We usually talk to each other and share useful information that we come across,” Tiwari said. “Some farmers have benefited from the information I shared about the arrival of the monsoon. I feel that others will also start relying on this information and trust it for their agricultural decision-making.”

“Disseminating AI weather forecasts has an incredibly high return on investment, likely generating more than $100 for farmers for each dollar invested by the government,” said Kremer, co-director of UChicago’s Human-Centered Weather Forecasts Initiative. “India is leading the way in using AI to improve people’s lives across many sectors, including agriculture.”

The effort was partially supported by catalytic funding from AIM for Scale, a global initiative backed by the Gates Foundation and the United Arab Emirates, which works to scale up evidenced-backed, cost-effective agricultural innovations for the benefit of farmers in low- and middle-income countries. The researchers behind the project are now working with AIM for Scale to start similar programs in other low- and middle-income countries and to train government meteorologists in the global south on how to use AI models effectively.

“One of the things we would like to do for future years, hopefully for next year, is to be able to predict dry spells throughout the entire summer, issuing predictions of the likelihood of a dry period occurring within the next two to three weeks,” Boos said.

RELATED INFORMATION

Source link

AI Research

Machine learning unravels quantum atomic vibrations in materials

Published

29 minutes ago

September 16, 2025

Robert Egan

Credit: Rosa Romano, EAS Communications/Caltech

Caltech scientists have developed an artificial intelligence (AI)–based method that dramatically speeds up calculations of the quantum interactions that take place in materials. In new work, the group focuses on interactions among atomic vibrations, or phonons—interactions that govern a wide range of material properties, including heat transport, thermal expansion, and phase transitions. The new machine learning approach could be extended to compute all quantum interactions, potentially enabling encyclopedic knowledge about how particles and excitations behave in materials.

Scientists like Marco Bernardi, professor of applied physics, physics, and materials science at Caltech, and his graduate student Yao Luo (MS ’24) have been trying to find ways to speed up the gargantuan calculations required to understand such particle interactions from first principles in real materials—that is, beginning with only a material’s atomic structure and the laws of quantum mechanics.

Last year, Bernardi and Luo developed a data-driven method based on a technique called singular value decomposition (SVD) to simplify the enormous mathematical matrices scientists use to represent the interactions between electrons and phonons in a material.

The case of phonon interactions is even more complex. These interactions are encoded in multidimensional objects called tensors, generalizations of vectors and matrices in higher dimensions. The complexity of these tensors grows exponentially with the number of particles involved, limiting scientists’ understanding of interactions involving three or more phonons.

Now, inspired by recent advances in machine learning, Bernardi and Luo have developed an AI-based technique that sifts through the high-order tensors that encode phonon interactions in a material and extracts only the crucial bits needed to complete the calculations that explain thermal transport. They describe the work in a paper that appears in the journal Physical Review Letters.

Using current state-of-the-art techniques, a supercomputer takes hours or days to calculate the interactions between three or four phonons in a material. The new method enables computers to complete the same thermal transport and phonon dynamics calculations 1,000 to 10,000 times faster, all while maintaining accuracy.

“The calculations for four-phonon interactions are a nightmare,” Bernardi says. “For complex materials, this task would involve weekslong calculations. Now we can do them in 10 seconds.”

Bernardi explains more about the method:

“We use a machine learning technique called CANDECOMP/PARAFAC tensor decomposition, but we had to adapt it to satisfy the symmetry of this specific physical problem. We first set up a neural network and then run it on GPUs and ask: ‘What are the best functions to approximate the actual tensor that describes these phonon interactions?’

“Once we fix the number of product terms we want to keep, the machine learning process returns the best functions to approximate the full tensor. We typically only need a few of these products, saving orders of magnitude in computational complexity compared to using the full tensor. This method allows us to learn the compressed form of phonon interactions, and we can still use these highly compressed tensors to compute all the observables of interest with the same accuracy.”

Bernardi adds that the new method is well suited for high-throughput screening of thermal physics and heat transport in large material databases, a major effort in the materials community. As for future work, he says, “My vision right now is to compress all different types of quantum interactions and high-order processes in materials with similar techniques. The key will be to bypass the formation of large tensors altogether and to learn the interactions directly in compressed form.”

The paper is titled “Tensor Learning and Compression of N-phonon Interactions.” Additional authors are Dhruv Mangtani, who worked on the project as a SURF student in Bernardi’s lab; Shiyu Peng, a postdoctoral scholar research associate; and Caltech graduate students Jia Yao (MS ’25) and Sergei Kliavinek.

More information:
Yao Luo et al, Tensor Learning and Compression of N-Phonon Interactions, Physical Review Letters (2025). DOI: 10.1103/nmgj-yq1g link.aps.org/doi/10.1103/nmgj-yq1g. On arXiv: DOI: 10.48550/arxiv.2503.05913

Provided by
California Institute of Technology

Citation:
Machine learning unravels quantum atomic vibrations in materials (2025, September 16)
retrieved 16 September 2025
from https://phys.org/news/2025-09-machine-unravels-quantum-atomic-vibrations.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link

AI Research

YouTube Unveils AI-Powered Tools for Creators

Published

58 minutes ago

September 16, 2025

PYMNTS

An artificial intelligence (AI)-powered “creative partner” for creators is one of several AI tools unveiled Tuesday (Sept. 16) by YouTube.

The company announced these new offerings during its Made on YouTube event.

YouTube’s new AI-powered creative partner is a new YouTube Studio tool, is called Ask Studio and can answer questions about things like how the creator’s latest video is performing and what is being said about their editing style, according to a Tuesday blog post.

“It’ll provide personalized and actionable strategic insights based on knowledge of you as a Creator, your channel and how YouTube works,” Amjad Hanif, vice president of creator products at YouTube, said in the post. “We’ll keep adding more capabilities in the future.”

Hanif also said in the post that YouTube has expanded the availability of its AI-powered likeness detection tool in open beta to all YouTube Partner Program creators. This tool helps creators safeguard their identity by detecting, managing and requesting the removal of unauthorized videos made with their facial likeness.

For its livestreaming platform YouTube Live, the company has added AI-powered highlights, a tool that creates lasting content from live content, according to another Tuesday blog post.

Advertisement: Scroll to Continue

“It finds the most compelling moments from the livestream and automatically creates ready-to-share Shorts,” Aaron Filner, senior director, product management at YouTube, said in the post.

YouTube also announced new creation tools for Shorts that can generate video with sound, bring photos to life by applying motion from a video, apply new looks to video footage by applying styles like pop art or origami, add objects to videos via a text description, per another Tuesday blog post.

The company is also experimenting with a feature called Edit with AI that will be added to Shorts and the YouTube Create app and will generate a first draft of a video from the user’s raw camera roll footage, according to the post.

“This gives you a solid starting point so you can jump straight to the fun part: personalizing your video and bringing your unique vision to life,” Dina Berrada, director of product, generative AI creation, at YouTube, said in the post.

To help creators earn more, YouTube has introduced an AI-powered system in YouTube Shopping that tags products in videos, according to another Tuesday blog post.

“We know tagging products can be time-consuming, so to make the experience better for creators, we’re leaning on an AI-powered system to identify the optimal moment a product is mentioned and automatically display the product tag at that time, capturing viewer interest when it’s highest,” Todd Sherman, senior director, product management, and Michael Beckmann, director, product management, data and creator earnings, said in the post.

YouTube parent company Alphabet said in October 2024 that it wants its products—from Google to Android to YouTube—to be synonymous with AI.

Source link