Adobe is facing a proposed class-action lawsuit that accuses the software giant of using pirated books to train one of its AI algorithms, escalating a widening legal battle over how generative AI systems are built and what data they can lawfully learn from.
The complaint was filed on behalf of Elizabeth Lyon, an Oregon-based author known for nonfiction writing guidebooks. The lawsuit alleges that Adobe’s language model program called SlimLM was trained, at least in part, on copyrighted works that were copied without permission—potentially affecting a broad class of authors whose books were included in the same training sources.
What the lawsuit alleges about Adobe’s SlimLM training data
According to the filing, the core allegation is that Adobe used pirated versions of numerous books—including Lyon’s works—to train SlimLM. Adobe has described SlimLM as a small language model series designed to be “optimized for document assistance tasks on mobile devices,” a category that typically includes summarization, rewriting, drafting, and other text-centric features.
The lawsuit focuses on the provenance of the data used to pre-train SlimLM. Adobe has stated that SlimLM was pre-trained on SlimPajama-627B, which it characterizes as a “deduplicated, multi-corpora, open-source dataset” released by Cerebras in June 2023. The plaintiff argues that, despite being described as open-source, the dataset is alleged to contain material derived from sources that included pirated copyrighted books.
The chain: SlimPajama, RedPajama, and Books3
The filing argues that SlimPajama was created by copying and manipulating the RedPajama dataset, and that RedPajama, in turn, incorporated Books3—a large corpus of approximately 191,000 books that has repeatedly surfaced in disputes over AI training data. The complaint claims that because SlimPajama is a derivative dataset, it allegedly “contains the Books3 dataset,” including copyrighted works belonging to Lyon and other authors.
At the heart of the dispute is a question that has become central to the generative AI era: whether training on copyrighted text without consent, attribution, or compensation is permissible, and whether “open” datasets can still embed unlawful copies when their upstream sources are contested.
Why Books3 keeps showing up in AI copyright fights
Books3 has become a flashpoint because it is widely discussed as a training source for multiple generative AI systems, yet it is also frequently described by rightsholders as a repository of pirated books. As AI developers race to build more capable models, large-scale text datasets have been assembled from many sources—some licensed, some public domain, and some alleged to be scraped or copied without authorization.
The Lyon complaint underscores a growing tension: even when a company points to an intermediary dataset that is labeled “open-source,” authors argue that the presence of copyrighted works in upstream components can still create liability for downstream users—especially if the copyrighted works were unlawfully obtained in the first place.
Adobe joins a crowded field of AI training lawsuits
The proposed class-action arrives amid a broader wave of litigation targeting how AI models are trained. In recent months, lawsuits have increasingly cited shared datasets and common pipelines that are used across the industry, alleging that multiple companies benefited from the same questionable sources.
RedPajama, specifically, has been referenced in other high-profile cases. A September lawsuit against Apple alleged that the company used copyrighted material in training for its Apple Intelligence efforts, claiming protected works were copied “without consent and without credit or compensation.” In October, a similar lawsuit against Salesforce also alleged use of RedPajama for training purposes. The Adobe complaint follows this pattern by focusing less on a single file or a single book and more on the training supply chain that can propagate disputed content across models.
Settlements raise the stakes for the industry
Legal pressure has also produced expensive outcomes. In September, Anthropic agreed to pay $1.5 billion to authors who accused the company of using pirated versions of their works to train its chatbot, Claude. That resolution was widely viewed as a marker of how costly these disputes can become—and as a signal that courts and plaintiffs may increasingly test the boundaries of copyright law as applied to machine learning.
What this could mean for Adobe’s AI strategy
Adobe has been one of the most prominent creative software companies to embrace generative AI since 2023, rolling out multiple AI features and services, including its Firefly media-generation suite. While the present lawsuit centers on SlimLM and book datasets rather than image generation, it touches a similar reputational nerve: creative professionals and rightsholders want clarity on what data is used, who gets credited, and whether creators are compensated.
If the case proceeds, it may force deeper scrutiny into Adobe’s documentation around model training, dataset selection, and internal governance—particularly how the company evaluates third-party datasets that are described as open-source or publicly available. It could also intensify calls for standardized data audits and provenance tracking, especially as more AI products move onto mobile devices and into everyday workflows.
Key questions likely to shape the case
While the lawsuit’s merits will be argued in court, the allegations raise several questions that have become central across the AI sector:
- Whether using copyrighted books for model training constitutes infringement or can be defended under doctrines like fair use (depending on jurisdiction and specific facts).
- How liability should be assigned when a model is trained on a dataset that is itself derived from other datasets.
- What “open-source dataset” should mean when upstream components may include disputed or unlawful material.
- Whether authors are entitled to compensation, attribution, opt-out mechanisms, or other remedies when their works are used in training.
For now, the proposed class-action adds another major name to the list of companies being challenged over AI training data practices, and it reinforces a message rippling through the tech industry: dataset provenance is no longer a back-office detail—it is becoming a defining legal and public-trust issue.

