Business
At Least 15 Million YouTube Videos Have Been Snatched by AI Companies

Editor’s note: This analysis is part of The Atlantic’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly here, to see whether videos you’ve created or watched are included in the data sets. This work is part of AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.
When Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction.
But Peters’s channel could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub.
In most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the Books3, OpenSubtitles, and LibGen data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example.
(A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.)
To create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading violates the platform’s terms of service, but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment.
Not all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some judges have disagreed in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing.
Generative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies are drowning out fact-checked, expert-produced content. Popular music-remix videos are frequently created using this technology, and many of them perform better than human-made videos.
The problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been decimated by text-generation tools; video creators should expect similar challenges from generative-AI tools in the near future.
Many major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.”
We can’t be certain whether all these these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called Movie Gen that creates videos from text prompts, and Snap offers “AI Video Lenses” that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters.
AI companies are more interested in some videos than others. A spreadsheet leaked to 404 Media by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.”
Developers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, noted that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training.
To prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost.
AI video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are lip-synched with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent.
There are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as Facetune, or ditch your mug entirely with a face-swapper such as Facewow. With Runway’s Aleph, you can change the colors of objects, or turn sunshine into a snowstorm.
Then there are tools that generate new videos based on an image you provide. Google encourages Gemini users to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or swing a golf club. These are often both amazing and creepy. “Talking head generation”—for employee-orientation videos, for example—is also advancing. Vidnoz AI promises to generate “Realistic AI Spokespersons of Any Style.” A company called Arcads will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include virtual try-on of clothes, generating custom video games, and animating cartoon characters and people.
Some companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival withdrew the award. Last month, Salvador sued DM9 along with its clients—Whirpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and cited “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered.
Others in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers joined the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, David Millette sued Nvidia for unjust enrichment and unfair competition with regard to the training of its Cosmos AI, but the case was voluntarily dismissed months later.
The Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, pays “creators” to post AI-generated videos made with its tools on YouTube. It currently offers $500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly encourage the posting of AI-generated content. Not surprisingly, a coterie of gurus has arrived to teach the secrets of making money with AI-generated content.
Google and Meta have also trained AI tools on large quantities of videos from their own platforms: Google has taken at least 70 million clips from YouTube, and Meta has taken more than 65 million clips from Instagram. If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is.
I asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”
Business
Rising cost of school uniform is scary, says mum from Luton

Julita WaleskiewiczEast of England

A mother-of-three said she has found it “scary” trying to keep up with the cost of sending her children to school.
Lauren Barford-Dowling, 27, from Luton, described the price of uniforms, shoes, meals and trips as “daunting”.
Level Trust, a Luton-based charity that provides free school supplies to families, said demand for its services had risen by up to 20% compared with last year.
“You want them to look their best, but it’s hard to keep up,” Ms Barford-Dowling added.

Ms Barford-Dowling has three children aged 10, six and five – and a fourth on the way.
She said branded jumpers and tops have risen in price, adding: “I worry about having enough money for all the essentials like shoes, trainers, trousers, dresses, tops.
“Three pairs of trainers cost over £100 – and they’ll be ruined in a couple of months. It’s scary.”
School meals also add to the pressure, she said, and her eldest child’s lunches cost £44 a month.
“When all three move up to Key Stage 2, I’ll be paying nearly £100 a month just so they can eat,” she added.

Ms Barford-Dowling said the Level Trust provided her children with free school shoes and trainers for PE.
Kerri Porthouse, the deputy chief executive of the charity, explained demand for the organisation’s services have risen.
“We’ve already seen an increase of between 15% and 20% compared with last year.
“That’s 200 more families in July and August alone. It’s a huge increase for a charity to cope with.
“Parents with children moving into reception or secondary often don’t realise how much uniform is needed until school begins. Then they come to us in a panic,” she said.
Research by the Child Poverty Action Group found it cost £1,000 a year to send a child to primary school and £2,300 for secondary.
Kate Anstey, the group’s head of education policy, said children from low-income families were dropping subjects because of the price of trips and equipment.
“Too many children are growing up in poverty, and it’s having a stark impact on their school day,” she said.
A Department for Education spokesperson said: “No child should face barriers to their education because of their family’s finances.
“We are capping the number of branded uniform items schools can require, and from 2026 all children in households on Universal Credit will be entitled to free school meals.”
Business
Millions missing out on benefits and government support, analysis suggests

Dan WhitworthReporter, Radio 4 Money Box

New analysis suggests seven million households are missing out on £24bn of financial help and support because of unclaimed benefits and social tariffs.
The research from Policy in Practice, a social policy and data analytics company, says awareness, complexity and stigma are the main barriers stopping people claiming.
This analysis covers benefits across England, Scotland and Wales such as universal credit and pension credit, local authority help including free school meals and council tax support, as well as social tariffs from water, energy and broadband providers.
The government said it ran public campaigns to promote benefits and pointed to the free Help to Claim service.
Andrea Paterson in London persuaded her mum, Sally, to apply for attendance allowance on behalf of her dad, Ian, last December after hearing about the benefit on Radio 4’s Money Box.
Ian, who died in May, was in poor health at the time and he and Sally qualified for the higher rate of attendance allowance of £110 per week, which made a huge difference to their finances, according to Andrea.
“£110 per week is a lot of money and they weren’t getting the winter fuel payment anymore,” she said.
“So the first words that came out of Mum’s mouth were ‘well, that will make up for losing the winter fuel payment’, which [was] great.
“All pensioners worry about money, everyone in that generation worries about money. I think it eased that worry a little bit and it did allow them to keep the house [warmer].”
Unclaimed benefits increasing
In its latest report, Policy in Practice estimates that £24.1bn in benefits and social tariffs will go unclaimed in 2025-26.
It previously estimated that £23bn would go unclaimed in 2024-25, and £19bn the year before that, although this year’s calculations are more detailed than ever before.
“There are three main barriers to claiming – awareness, complexity and stigma,” said Deven Ghelani, founder and chief executive of Policy in Practice.
“With awareness people just don’t know these benefits exist or, if they do know about them, they just immediately assume they won’t qualify.
“Then you’ve got complexity, so being able to complete the form, being able to provide the evidence to be able to claim. Maybe you can do that once but actually you have to do it three, four, five , six, seven times depending on the support you’re potentially eligible for and people just run out of steam.
“Then you’ve got stigma. People are made to feel it’s not for them or they don’t trust the organisation administering that support.”
Although a lot of financial support is going unclaimed, the report does point to progress being made.
More older people are now claiming pension credit, with that number expected to continue to rise.
Some local authorities are reaching 95% of students eligible for free school meals because of better use of data.
Gateway benefits
Government figures show it is forecast to spend £316.1bn in 2025-26 on the social security system in England, Scotland and Wales, accounting for 10.6% of GDP and 23.5% of the total amount the government spends.
Responding to criticism that the benefits bill is already too large, Mr Ghelani said: “The key thing is you can’t rely on the system being too complicated to save money.
“On the one hand you’ve designed these systems to get support to people and then you’re making it hard to claim. That doesn’t make any sense.”
A government spokesperson said: “We’re making sure everyone gets the support they are entitled to by promoting benefits through public campaigns and funding the free Help to Claim service.
“We are also developing skills and opening up opportunities so more people can move into good, secure jobs, while ensuring the welfare system is there for those who need it.”
The advice if you think you might be eligible is to claim, especially for support like pension credit, known as a gateway benefit, which can lead to other financial help for those who are struggling.
Robin, from Greater Manchester, told the BBC that being able to claim pension credit was vital to his finances.
“Pension credit is essential to me to enable me to survive financially,” he said.
[But] because I’m on pension credit I get council tax exemption, I also get free dental treatment, a contribution to my spectacles and I get the warm home discount scheme as well.”
Business
Free Training for Small Businesses

Google’s latest initiative in Pennsylvania is set to transform how small businesses harness artificial intelligence, marking a significant push by the tech giant to democratize AI tools across the Keystone State. Announced at the AI Horizons Summit in Pittsburgh, the Pennsylvania AI Accelerator program aims to equip local entrepreneurs with essential skills and resources to integrate AI into their operations. This move comes amid a broader effort by Google to foster economic growth through technology, building on years of investment in the region.
Drawing from insights in a recent post on Google’s official blog, the accelerator offers free workshops, online courses, and hands-on training tailored for small businesses. Participants can learn to use AI for tasks like customer service automation and data analysis, potentially boosting efficiency and competitiveness. The program is part of Google’s Grow with Google initiative, which has already trained thousands in digital skills nationwide.
Strategic Expansion in Pennsylvania
Recent web searches reveal that Google’s commitment extends beyond training, with plans for substantial infrastructure investments. According to a report from GovTech, the company intends to pour about $25 billion into Pennsylvania’s data centers and AI facilities over the next two years. This investment underscores Pennsylvania’s growing role as a hub for tech innovation, supported by its proactive government policies on AI adoption.
Posts on X highlight the buzz around this launch, with users noting Google’s long-standing presence in the state, including digital skills programs that have generated billions in economic activity. For instance, sentiments from local business communities emphasize the accelerator’s potential to level the playing field for small enterprises against larger competitors.
Impact on Small Businesses
A deeper look into news from StartupHub.ai analyzes Google’s strategy, suggesting the accelerator could accelerate AI adoption among small and medium-sized businesses (SMBs), fostering innovation and job creation. The program includes access to tools like Gemini AI, enabling businesses to automate routine tasks and gain insights from data without needing extensive technical expertise.
Industry insiders point out that this initiative aligns with Pennsylvania’s high ranking in government AI readiness, as detailed in a City & State Pennsylvania analysis. The state’s forward-thinking approach, including pilots with technologies like ChatGPT in government operations, creates a fertile environment for such private-sector programs.
Collaborations and Broader Ecosystem
Partnerships are key to the accelerator’s success. News from Editor and Publisher reports on collaborations with entities like the Pennsylvania NewsMedia Association and Google News Initiative, extending AI benefits to media and other sectors. These alliances aim to sustain local industries through targeted accelerators.
Moreover, X posts from figures like Governor Josh Shapiro showcase the state’s enthusiasm, citing time savings from AI in public services that mirror potential gains for businesses. Google’s broader efforts, such as the AI for Education Accelerator involving Pennsylvania universities, indicate a holistic approach to building an AI-savvy workforce.
Future Prospects and Challenges
While the accelerator promises growth, challenges remain, including ensuring equitable access for rural businesses and addressing AI ethics. Insights from Google’s blog on AI training emphasize responsible implementation, with resources to mitigate biases and privacy concerns.
As Pennsylvania positions itself as an AI leader, Google’s program could serve as a model for other states. With ongoing updates from web sources and social media, the initiative’s evolution will likely reveal its true economic impact, potentially reshaping how small businesses thrive in an AI-driven era.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms1 month ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers2 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Education2 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Funding & Business2 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi