AI Insights
LLM training and copyright: fair use to boost generative artificial intelligence with a first-of-its-kind decision | by Raffaella Aghemo | Jul, 2025

The case Andrea Bartz, et al. v. Anthropic PBC represents a watershed moment for the training of generative artificial intelligence systems, as, in a summary judgment, the federal judge of the Northern Court of California, gave a positive opinion on the use of copyrighted material to train LLMs.
But the folds of this result are much more subtle and we will try to analyze them in this contribution of mine.
In the Order of 23 June 2025, we read in the introduction: “An artificial intelligence company downloaded millions of copyrighted books in digital format for free from pirate sites on the Internet. The company also purchased copyrighted books. (some superimposed on those acquired from pirate sites), tore out the bindings, scanned every page and stored them in digitised, searchable files. All the above was done to accumulate a central library of “all the books in the world” to be preserved “forever”. From this central library, the artificial intelligence company selected various sets and subsets of digitised books to train various large language models under development to feed its artificial intelligence services. Some of these books were written by plaintiff authors, who are now suing for copyright infringement. On summary judgment, the issue is the extent to which any of the uses of the works in question qualify as ‘fair use’ under Section 107 of the Copyright Act.”
Over seven million books acquired, legally and otherwise, by Anthropic to train the AI assistant, Claude. Four phases of acquisition:
1. Each selected book was copied from the library to create a working copy for training.
2. Each book was “cleaned up” by removing low-value or repetitive content (e.g. footers).
3. The cleaned books were converted into “tokenized” versions by simplifying them and breaking them down into short character sequences, then translated into numeric tokens using Anthropic’s custom dictionary. These tokens were repeatedly used during training, enabling the model to discover statistical relationships between huge amounts of textual data.
4. Each fully trained LLM stored “compressed” copies of the books.
“Actors Andrea Bartz, Charles Graeber and Kirk Wallace Johnson are authors of books that Anthropic copied from pirated and purchased sources. And which he assembled into a central library of his own, further copying various sets and subsets of those library copies to include in various “data mixes” used to train various LLMs.”
“In 2021, another co-founder of Anthropic, Ben Mann, downloaded Books, an online library 196,640 books that he knew had been assembled from unauthorised copies of copyrighted books, i.e. pirated. Anthropic’s subsequent pirated acquisitions involved downloading distributed and shared copies of other pirated libraries. In June 2021, Mann downloaded at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, he downloaded at least two million copies of books from Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.”
Although Anthropic’s acquisition procedure for training its LLMs is well explained in the document, ‘However, the training copies did not go any further and spread to the outside world. When each LLM was included in a public-facing version of Claude, it was complemented by other software that filtered user input to the LLM and filtered output from the LLM to the user’.
In the Order, there follows the ANALYSIS, which we can summarize by extrapolating it from the Act itself in this manner, where the conclusions are well highlighted in bold: “the use of the books in question to train Claude and his precursors was extremely transformative and constituted fair use within the meaning of Section 107 of the Copyright Act. And the digitization of the books purchased in paper form by Anthropic was also a fair use, but not for the same reason that applies to training copies. Instead, it was a fair use because all Anthropic did was replace the printed copies it had purchased for its library centre with more convenient, space-saving and searchable digital copies for its library center — without adding new copies, creating new works or redistributing existing copies. However, Anthropic had no right to use pirated copies for its central library. The creation of a permanent, general-purpose library did not in itself constitute fair use justifying Anthropic’s piracy.”
Fair use is based on the following elements:
(1) the purpose and nature of the use, including whether such use is of a commercial nature or is for non-profit educational purposes;
(2) the nature of the copyrighted work;
(3) the quantity and substantiality of the part used in relation to the copyrighted work as a whole; and
(4) the effect of the use on the potential market or value of the copyrighted work.
Here, our side disputes which use(s) is at issue: Anthropic claims to have copied author’s books for only one use: to train LLMs. On the contrary, the authors argue that it did so for at least two uses: first to build a large central library of potentially useful content and second to train specific LLMs using variable sets and subsets of that content, over time selecting the most well-organised and best-expressed works for training. The authors also complained that changing the format from print to digital was in itself a violation of fair use.
The authors argue that using the works to train Claude’s underlying LLM was like using works to train any person to read and write, so the authors should be able to exclude Anthropic from this use. But the authors cannot legitimately exclude anyone from using their works for training or learning as such. Everyone reads texts, then writes new ones. But to charge anyone specifically for the use of a book every time they read it, every time they recall it from memory, every time they later draw on it, to write new things in new ways would be unthinkable. For centuries we have read and re-read books. We have admired, memorised and internalised them. their overwhelming themes, their substantive points and their stylistic solutions to writing.
“In short, the purpose and nature of using copyrighted works to train LLMs to generate new text was essentially transformative. Like any reader who aspires to become a writer,”
Furthermore, the conversion of legally acquired written texts into digital copies, it is further stated: ‘Anthropic purchased its printed copies honestly and fairly. With each purchase came Anthropic’s right to ‘dispose[ ]’ of each copy as it saw fit. Thus, Anthropic had the right to keep the copies in its central library for all ordinary uses.” After all, “spoiling” the binding in order to scan the pages was a necessary step, just as, quoting the document, in Sony Betamax, the Supreme Court ruled that making a recording of a television programme in order to watch it later was a copy but did not usurp any legitimate interest of the copyright holder!
Therefore, if for the first two contentions of the authors, the Court invoked fair use, in that the use of the works as input for training — not for direct replication, but to enable the generation of new content — was legitimate fair use, as was the switch from paper to digital format, the downloading and storage by Anthropic of more than seven million pirated books, without payment, was a different matter.
This represents a first of its kind case, so much so that two days after this ruling was issued, another judge in the Northern District of California ruled in Kadrey et al. v. Meta Platforms Inc. and concluded that the artificial intelligence technology at issue in his case was transformative. However, the basis of its ruling in favour of Meta on the issue of fair use was not transformation, but rather the failure of the plaintiffs to “present significant evidence that Meta’s use of their works to create [a generative artificial intelligence engine] had an impact on the market” for books.
Anthropic will now face judgment for the use of pirated copies.
All Rights Reserved
Raffaella Aghemo, Lawyer
AI Insights
Cegid Retail refines its approach to the challenges of artificial intelligence for commerce

Published
September 14, 2025
Last year, Lyon-based management solutions specialist Cegid announced that it was bringing its Forward 2026 development strategy into line with generative artificial intelligence. Renamed Forward.ia, the development program continues, in particular for the Cegid Retail division, which is stepping up deployment and experimentation of retail solutions, while taking care not to fall into the trap of multiplying useless generative tools.
“Our approach has always been to make innovation useful and not to create gas factories that serve no one,” Nathalie Echinard, general manager of the retail division, tells FashionNetwork.com. “Nevertheless, we know that AI and generative AI will transform the business, both for our team internally and for our customers.”
Last year’s biennial Cegid Connections Retail event in Rome featured eight AI use cases anticipating the needs of commerce to 2030. Four have since been delivered. Starting with an enhancement to the Livestore checkout solution. This now enables a sales assistant to interact with customers with whom they do not share a common language, via a split screen.
The Cegid Retail Store Excellence tool, used for communication between a head office and its store network, has also been enhanced. “It can now translate and send messages to each store, for example to explain a new collection, in the case of fashion brands”, explained Echinard. The manager also points out that AI can now directly generate message bases or visuals to guide sales teams.
In the apparel business, these teams are subject to high turnover and have no time for training. In the past, Cegid has responded to this need with simplified dashboards that are easy to learn. But AI now enables the salesperson to exchange directly with the system verbally to explain their problem, so as to be guided through the task.
“We’re gradually moving towards an augmented sales assistant,” explained Echinard. “Augmented vis-à-vis the customer, but also in terms of efficiency and time optimization. Applications will help them to choose their priorities according to what’s happening in the store, but above all to navigate from one task to another without really realizing it.”

Personalization has not been forgotten. A tool, presented last year to Cegid’s partners, is currently being developed based on customer data and product recommendation learning. AI will reinforce the Livestore tool by helping the sales assistant to identify the needs of existing tastes and customers.
“This takes the form of a 6-8 word cloud that gives maximum information in a short space of time to the salesperson. After all, no one wants a sales assistant who remains immersed in the tablet”, explained the Cegid Retail manager.
A manager who also bears witness to the growing demands of brands in terms of security. Security breaches, cyber-attacks and data ransomware are on everyone’s mind. “When our customers are major players in the luxury goods and CAC40 sectors, we take this very seriously”, sais Echinard. She points out that, by contract, security updates are the only ones Cegid can launch to warn its customers.
“And, given what’s at stake, no brand has a problem with that,” said Echinard.
After the NRF (formerly Paris Retail Week) trade show, to be held in Paris from September 16 to 18, she will be preparing the 2026 edition of Cegid Connections Retail, to be held in Prague in the spring. Perhaps by then, the market will have taken a more rational look at AI.
“The enthusiasm it generates needs to stabilize, and everyone needs to stop going off in all directions,” concluded Echinard. She is mindful of the Gartner study which, last June, estimated that 40% of AI tools in development could be abandoned by 2027.
This article is an automatic translation.
Click here to read the original article.
Copyright © 2025 FashionNetwork.com All rights reserved.
AI Insights
IAB Europe unveils framework for AI publisher compensation

According to IAB Europe Data Analyst Dimitris Beis, the framework addresses “a paradigm of publisher remuneration for content ingestion” through three core mechanisms: content access controls, discovery protocols, and monetization APIs. The 11-page document establishes technical specifications for AI platforms accessing publisher content.
The framework emerges from documented traffic disruptions affecting digital publishers. According to Similarweb data cited in the report, referrals from AI platforms increased 357% year-over-year in June 2025, reaching 1.13 billion visits compared to 191 billion visits from organic Google search. However, news and media sectors experienced 770% traffic growth from AI platforms during the same period.
Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Cloudflare CEO Matthew Prince, speaking at a Cannes event, described shifting economics in content crawling. According to the framework, Prince reported the ratio of pages crawled to visitors referred increased from 2:1 a decade ago to 6:1 at the beginning of 2025 and 18:1 in June. OpenAI’s ratio reportedly grew from 250:1 to 1,250:1 during this timeframe.
The framework contradicts Google’s August rebuttal claiming stable year-over-year referrals from organic search. According to Chartbeat research covering 565 US and UK news websites, search referral consistency has been maintained over the past year. Google acknowledged certain query types may not generate clicks, similar to previous features like sports scores.
Adobe research conducted between July 2024 and February 2025 revealed AI-referred visitors stayed 8% longer on sites, viewed 12% more pages, and showed 23% lower bounce rates. However, these visitors lagged 9% behind non-AI-referred users in conversion rates.
The IAB framework proposes blocking unauthorised scraping through robots.txt files and Web Application Firewall methods. According to the document, unauthorised scraping increased 40% from Q3 to Q4 2024, with robots.txt compliance declining significantly.
Three content discovery mechanisms form the framework’s second component. Publishers would implement content access rules pages containing usage terms, scraper instructions, contact information, and content metadata. JSON-based content metadata would provide site summaries and IAB content taxonomy mappings. An llms.txt markdown file would contain information digestible by large language models.
The monetization component introduces Cost-per-crawl (CPCr) APIs featuring tiered pricing based on content type, bot classification, and access frequency. According to the framework, a more sophisticated LLM ingest content API would support per-query pricing through bid-response exchanges, enabling real-time content valuation.
The per-query model addresses retrieval-augmented generation, where AI platforms query publisher content directly rather than using pre-trained datasets. According to the document, this approach “more closely tracks value extracted from using publisher content and facilitates a fairer deal than cost-per-crawl.”
The framework identifies three implementation challenges. Controlling content access requires commitment from AI operators beyond technical measures, as multiple investigations suggest robots.txt compliance varies significantly. Auction dynamics differ from advertising markets, with single AI operators typically bidding rather than multiple competing buyers.
Content valuation presents complexity in determining marginal benefits of additional content for LLM responses. According to the framework, pricing decisions become probabilistic when based solely on metadata, potentially requiring verification mechanisms before content licensing.
Alternative models include revenue-sharing subscriptions, where Perplexity distributes 80% of user fees to participating publishers based on engagement metrics. Bilateral licensing agreements between major publishers and AI platforms provide direct compensation but concentrate benefits among large content creators.
Collective licensing schemes, similar to music rights societies, would create central compensation pools distributed according to usage measurements. According to the framework, this model requires regulatory action and allocation consensus.
Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.
The framework establishes three requirements for viable compensation models. Effective content access control must reliably block unauthorised scraping. Purpose-limited use assurance prevents single-query content from training dataset repurposing. Transparency in pricing and trade-offs provides publishers visibility into content usage and valuation.
Current conditions fail to meet these requirements. According to the document, unauthorised scraping continues rising as the root cause of publisher concerns. Most publishers lack visibility into content usage after access, with only large publishers securing protections through bespoke AI operator agreements.
Cloudflare recently introduced AI crawler blocking capabilities and piloting systems where AI platforms declare content access purposes while publishers control permissions. According to the framework, the company develops signed requests and mTLS technologies for strengthening crawler identification.
IAB Tech Lab CEO Tony Katsur has advocated for regulatory intervention, urging publishers to advocate for their interests. According to the document, structural solutions enforcing access control, transparency, and verifiable usage represent prerequisites before remuneration models can function at scale.
The marketing community faces significant implications from these developments. Publishers experiencing declining traffic revenues must evaluate alternative monetization strategies beyond traditional advertising models. AI-powered search features reduce click-through rates while maintaining content dependency for training and inference processes.
Campaign strategies may require adaptation as zero-click searches increase and publisher content appears in AI summaries without corresponding traffic. Performance measurement frameworks need updating to account for content usage in AI responses rather than website visit metrics.
The framework represents industrywide momentum toward formalised compensation structures. According to the document, remuneration models likely diverge rather than converge on single mechanisms, with publishers anticipating patchwork approaches depending on market position and jurisdiction.
IAB Europe’s Artificial Intelligence Working Group seeks European publisher collaboration. The working group can be contacted through Dimitris Beis at beis [at] iabeurope [dot] eu for participation information.
Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Timeline
Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Summary
Who: IAB Europe Data Analyst Dimitris Beis authored the framework. The initiative involves publishers, AI platforms, and the IAB Tech Lab working group seeking European publisher collaboration.
What: A technical framework establishing three mechanisms for AI platform compensation to publishers: content access controls, discovery protocols, and monetization APIs including Cost-per-crawl and LLM ingest content APIs.
When: Published in September 2025, following industry discussions throughout 2025 including the July IAB Tech Lab summit and August working group launch.
Where: The framework applies globally but emphasises European implementation through IAB Europe’s Artificial Intelligence Working Group collaboration with European publishers.
Why: Addresses declining publisher revenues from increased AI content scraping (357% growth year-over-year) and zero-click searches (rising from 56% to 69% in May 2025) while establishing fair compensation for content used in AI training and inference.
AI Insights
Planned artificial intelligence centers strain energy grid – El Paso Inc.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms1 month ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers3 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Education3 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Funding & Business3 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi