AI Insights

LLM training and copyright: fair use to boost generative artificial intelligence with a first-of-its-kind decision | by Raffaella Aghemo | Jul, 2025

Published

2 months ago

July 9, 2025

The case Andrea Bartz, et al. v. Anthropic PBC represents a watershed moment for the training of generative artificial intelligence systems, as, in a summary judgment, the federal judge of the Northern Court of California, gave a positive opinion on the use of copyrighted material to train LLMs.

But the folds of this result are much more subtle and we will try to analyze them in this contribution of mine.

In the Order of 23 June 2025, we read in the introduction: “An artificial intelligence company downloaded millions of copyrighted books in digital format for free from pirate sites on the Internet. The company also purchased copyrighted books. (some superimposed on those acquired from pirate sites), tore out the bindings, scanned every page and stored them in digitised, searchable files. All the above was done to accumulate a central library of “all the books in the world” to be preserved “forever”. From this central library, the artificial intelligence company selected various sets and subsets of digitised books to train various large language models under development to feed its artificial intelligence services. Some of these books were written by plaintiff authors, who are now suing for copyright infringement. On summary judgment, the issue is the extent to which any of the uses of the works in question qualify as ‘fair use’ under Section 107 of the Copyright Act.”

Over seven million books acquired, legally and otherwise, by Anthropic to train the AI assistant, Claude. Four phases of acquisition:

1. Each selected book was copied from the library to create a working copy for training.

2. Each book was “cleaned up” by removing low-value or repetitive content (e.g. footers).

3. The cleaned books were converted into “tokenized” versions by simplifying them and breaking them down into short character sequences, then translated into numeric tokens using Anthropic’s custom dictionary. These tokens were repeatedly used during training, enabling the model to discover statistical relationships between huge amounts of textual data.

4. Each fully trained LLM stored “compressed” copies of the books.

“Actors Andrea Bartz, Charles Graeber and Kirk Wallace Johnson are authors of books that Anthropic copied from pirated and purchased sources. And which he assembled into a central library of his own, further copying various sets and subsets of those library copies to include in various “data mixes” used to train various LLMs.”

“In 2021, another co-founder of Anthropic, Ben Mann, downloaded Books, an online library 196,640 books that he knew had been assembled from unauthorised copies of copyrighted books, i.e. pirated. Anthropic’s subsequent pirated acquisitions involved downloading distributed and shared copies of other pirated libraries. In June 2021, Mann downloaded at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, he downloaded at least two million copies of books from Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.”

Although Anthropic’s acquisition procedure for training its LLMs is well explained in the document, ‘However, the training copies did not go any further and spread to the outside world. When each LLM was included in a public-facing version of Claude, it was complemented by other software that filtered user input to the LLM and filtered output from the LLM to the user’.

In the Order, there follows the ANALYSIS, which we can summarize by extrapolating it from the Act itself in this manner, where the conclusions are well highlighted in bold: “the use of the books in question to train Claude and his precursors was extremely transformative and constituted fair use within the meaning of Section 107 of the Copyright Act. And the digitization of the books purchased in paper form by Anthropic was also a fair use, but not for the same reason that applies to training copies. Instead, it was a fair use because all Anthropic did was replace the printed copies it had purchased for its library centre with more convenient, space-saving and searchable digital copies for its library center — without adding new copies, creating new works or redistributing existing copies. However, Anthropic had no right to use pirated copies for its central library. The creation of a permanent, general-purpose library did not in itself constitute fair use justifying Anthropic’s piracy.”

Fair use is based on the following elements:

(1) the purpose and nature of the use, including whether such use is of a commercial nature or is for non-profit educational purposes;

(2) the nature of the copyrighted work;

(3) the quantity and substantiality of the part used in relation to the copyrighted work as a whole; and

(4) the effect of the use on the potential market or value of the copyrighted work.

Here, our side disputes which use(s) is at issue: Anthropic claims to have copied author’s books for only one use: to train LLMs. On the contrary, the authors argue that it did so for at least two uses: first to build a large central library of potentially useful content and second to train specific LLMs using variable sets and subsets of that content, over time selecting the most well-organised and best-expressed works for training. The authors also complained that changing the format from print to digital was in itself a violation of fair use.

The authors argue that using the works to train Claude’s underlying LLM was like using works to train any person to read and write, so the authors should be able to exclude Anthropic from this use. But the authors cannot legitimately exclude anyone from using their works for training or learning as such. Everyone reads texts, then writes new ones. But to charge anyone specifically for the use of a book every time they read it, every time they recall it from memory, every time they later draw on it, to write new things in new ways would be unthinkable. For centuries we have read and re-read books. We have admired, memorised and internalised them. their overwhelming themes, their substantive points and their stylistic solutions to writing.

“In short, the purpose and nature of using copyrighted works to train LLMs to generate new text was essentially transformative. Like any reader who aspires to become a writer,”

Furthermore, the conversion of legally acquired written texts into digital copies, it is further stated: ‘Anthropic purchased its printed copies honestly and fairly. With each purchase came Anthropic’s right to ‘dispose[ ]’ of each copy as it saw fit. Thus, Anthropic had the right to keep the copies in its central library for all ordinary uses.” After all, “spoiling” the binding in order to scan the pages was a necessary step, just as, quoting the document, in Sony Betamax, the Supreme Court ruled that making a recording of a television programme in order to watch it later was a copy but did not usurp any legitimate interest of the copyright holder!

Therefore, if for the first two contentions of the authors, the Court invoked fair use, in that the use of the works as input for training — not for direct replication, but to enable the generation of new content — was legitimate fair use, as was the switch from paper to digital format, the downloading and storage by Anthropic of more than seven million pirated books, without payment, was a different matter.

This represents a first of its kind case, so much so that two days after this ruling was issued, another judge in the Northern District of California ruled in Kadrey et al. v. Meta Platforms Inc. and concluded that the artificial intelligence technology at issue in his case was transformative. However, the basis of its ruling in favour of Meta on the issue of fair use was not transformation, but rather the failure of the plaintiffs to “present significant evidence that Meta’s use of their works to create [a generative artificial intelligence engine] had an impact on the market” for books.

Anthropic will now face judgment for the use of pirated copies.

Raffaella Aghemo, Lawyer

Source link

AI Insights

Cegid Retail refines its approach to the challenges of artificial intelligence for commerce

Published

1 hour ago

September 14, 2025

The Editors

Published

aistoriz.com

LLM training and copyright: fair use to boost generative artificial intelligence with a first-of-its-kind decision | by Raffaella Aghemo | Jul, 2025

AI Insights

LLM training and copyright: fair use to boost generative artificial intelligence with a first-of-its-kind decision | by Raffaella Aghemo | Jul, 2025

Leave a Reply
Cancel reply

Leave a Reply

AI Insights

Cegid Retail refines its approach to the challenges of artificial intelligence for commerce

AI Insights

IAB Europe unveils framework for AI publisher compensation

Timeline

Summary

AI Insights

Planned artificial intelligence centers strain energy grid – El Paso Inc.

Trending

aistoriz.com

LLM training and copyright: fair use to boost generative artificial intelligence with a first-of-its-kind decision | by Raffaella Aghemo | Jul, 2025

You may like

Leave a Reply Cancel reply

Leave a Reply

AI Insights

Cegid Retail refines its approach to the challenges of artificial intelligence for commerce

AI Insights

IAB Europe unveils framework for AI publisher compensation

Timeline

Summary

AI Insights

Planned artificial intelligence centers strain energy grid – El Paso Inc.

Trending

Leave a Reply
Cancel reply