SVG
Reports

Generative AI and Copyright Infringement: Lessons from past Fair Use Cases

The ChatGPT AI app icon appears alongside other AI chatbot applications in this photo illustration in Brussels, Belgium, on February 8, 2025. (Jonathan Raa via Getty Images)
Caption
The ChatGPT AI app icon appears alongside other AI chatbot applications in this photo illustration in Brussels, Belgium, on February 8, 2025. (Jonathan Raa via Getty Images)

View PDF

Introduction

The development of generative artificial intelligence (AI) models in recent years is transforming digital technology, with some even asking if current AI advancements represent a “fourth industrial revolution.”1 However, as we enter this new era of technological advancement, there are unanswered questions about how generative AI models are developed and what effect they could have on society. Specifically, copyright owners and creator communities have significant concerns about what materials are being ingested for training and whether AI companies will be held liable for the mass unauthorized use of copyrighted works to build their generative models.

Seeking answers and accountability, copyright owners have now brought over forty copyright infringement lawsuits against AI companies.2 These cases, which have mostly been filed over the past two years, are winding their way through various federal courts and are all leading to one pivotal question: Does the ingestion of copyrighted works for generative AI training constitute direct infringement of copyright owners’ reproduction rights, or does it qualify as fair use?3

Thus, fair use is not just a big question—it is the only question that really matters in generative AI copyright infringement litigation. AI companies and their supporters argue that copying protected works to train AI models constitutes a transformative purpose that tips the scales in favor of fair use and that past fair use cases clearly support their position. However, as this policy memo will show (and as courts and the United States Copyright Office are already recognizing), the fair use cases AI companies rely upon (1) are significantly undermined by the Supreme Court’s recent Warhol v. Goldsmith decision, (2) are, regardless of Warhol v. Goldsmith, readily distinguishable and do not set a precedent that generative AI training is fair use, and (3) in fact demonstrate that in most cases generative AI training does not qualify as fair use.

Background and Rise of Generative AI Litigation 

While AI has been incorporated into a variety of technologies for years, the generative capabilities of large language, image, music, and motion picture models are now progressing at a remarkable pace, ushering in a dynamic new AI era.4 But as more and more companies entered the market and the capabilities of their models became public, questions and concerns arose about what materials were being used for training, how datasets were being compiled and curated, and what permission (if any) developers had to scrape and copy works from the internet. We have learned in the last few years that most of the leading generative AI developers trained their models by ingesting massive amounts of copyright-protected works without authorization, which were scraped or downloaded from the internet and compiled—sometimes by a third party—into datasets that often contain copies of pirated works.5

Once copyright owners confirmed that their works were being fed into generative AI models for training purposes, the lawsuits were not far behind. While a couple of AI-related copyright infringement suits preceded the highly publicized launch of various generative models in late 2022, a steady stream of infringement actions filed by copyright owners against AI companies began in early 2023, and infringement actions continue to be filed to this day.6 Roughly half of the active lawsuits are class actions involving creators of the same types of works against a common generative AI developer, with most of them brought by groups of authors of literary works such as books, articles, newspapers, and journals against large language model (LLM) developers. There are also several cases involving musical compositions and song lyrics, visual artworks, sound recordings, computer code, photographs, videos, and databases.

The cases against AI developers are all in different stages procedurally, and some of the plaintiffs in the earlier-filed cases have amended their complaints to narrow down the claims—for instance, after their infringement claims related to AI outputs and derivative works were rejected. Remaining in nearly all of the cases are claims for direct infringement against the AI developer for the unauthorized reproduction of copyright-protected material at the ingestion, training, and/or fine-tuning stages of generative AI model development.7 These claims have survived the early stages of the lawsuits and are becoming the focal point of the litigation as they progress through the courts.

Because there is no real dispute that generative AI companies are reproducing copyrighted works without permission, the copyright infringement claims will inevitably be decided by federal courts applying the fair use defense. The purpose of fair use is to excuse otherwise infringing conduct where the imposition of liability would thwart the very expression that copyright law is intended to promote. The doctrine was initially developed through case law, with judges balancing the exclusive rights of authors to profit from their works with the public’s legitimate need to borrow protected expression without authorization where the use does not unfairly impinge upon those rights. To assess the fair use defense, courts must apply the four factors enumerated in Section 107 of the Copyright Act:

  • The purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes;
  • The nature of the copyrighted work;
  • The amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  • The effect of the use upon the potential market for or value of the copyrighted work.8

Section 107 also provides that typical fair uses include copying for benign purposes such as criticism, comment, news reporting, teaching, scholarship, or research. However, it is important to understand that these purposes are only illustrative of the type of uses that may qualify as fair use and are not examples of what will always qualify. In fact, there are no bright-line rules in determining fair use because it is necessarily determined on a case-by-case basis. Courts will need to evaluate fair use defenses involving generative AI systems the same way they evaluate fair use in all contexts—by applying the four factors listed above to the specific use at issue. And given that fair use is an affirmative defense, the defendant AI companies will bear the burden of proof for each of the four factors.

While the statute does not use the term, courts often consider whether the purpose of a particular use is transformative under the first factor. The idea is that transformative uses are less likely to interfere with the exclusive rights of authors, and they are thus more likely to promote the progress of the arts. With generative AI training, it is doubtful that the purpose of ingestion will be found to be transformative because the outputs produced by these AI systems serve the same purpose as the works that are ingested to create them. Indeed, in Part 3 of its comprehensive study of copyright and AI that was released in May 2025, the U.S. Copyright Office addresses generative AI training and confirms that there is nothing inherently transformative about it.9 While the report acknowledges that there are potentially some instances in which training could be transformative, it ultimately concludes that “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.”10

As the Copyright Office’s report explains, the Supreme Court recently made clear in Warhol v. Goldsmith that “although transformativeness often leads to a finding of fair use, not every transformative use is a fair one.”11 The question is not merely whether a particular use is for a distinct purpose; it is whether, under the fourth factor, that purpose differs enough such that it offsets the harm to the market that it causes for the original work. The first and fourth fair use factors are thus intertwined—an increase in transformativeness decreases the significance of market harm. But the problem for the generative AI companies is that the inverse is also true, and “less transformative uses are more likely to serve as market substitutes.”12 This is especially true with generative AI models, where not only are the uses not transformative because they serve the same purpose, but copyright owners have made licenses available for AI training. Indeed, the Copyright Office’s report indicates that these lost sales and licensing opportunities, combined with the market dilution that may occur from the resulting outputs, can all properly be considered as part of a fourth-factor market harm analysis that weighs against whatever claim of transformative use the AI companies may assert.13

Early Decisions Question the Applicability of Past Fair Use Cases

The unique nature of AI technologies and the limited applicability of past fair use cases in assessing their legality have already been recognized by federal district courts in multiple AI infringement cases. Judges overseeing these disputes have issued several preliminary orders rejecting the AI companies’ attempts to equate generative AI training to uses that have been found to be fair in other contexts.14 Moreover, in the first AI training decision to assess the fair use defense on the merits—Thomson Reuters v. Ross—the district court granted summary judgment in favor of the plaintiff, Thomson Reuters, finding that the defendant AI company’s unauthorized use of its copyrighted works for training purposes was not fair.15 The court recognized that the technology at issue differs somewhat from the AI models at the center of dozens of other lawsuits because it is not “AI that writes new content itself” and instead merely “spits back relevant judicial opinions that have already been written.”16 And without a generative component to the AI technology, the court easily held that the use was not transformative because Ross’s purpose was to create a legal research tool that directly competes with Thomson Reuters.

However, it is difficult to see how other courts will not look to Thomson Reuters v. Ross for guidance because its fourth-factor analysis is applicable to most of the generative AI cases. The opinion acknowledges that the fourth factor—the effect of the unauthorized use on the market for the original works—has historically been treated as the most important in the fair use analysis. It then explains that courts must consider not only existing markets, but also potential markets. Confirming a critical point that could have a significant impact on all AI litigation, the opinion states that there is an “obvious” potential market for using copyrighted material for AI training.17 The fact that there is an established and growing market for the use of copyrighted works for AI training—including the works of the plaintiffs in many of the ongoing lawsuits—is something that alleged infringers will have a difficult time overcoming.18 With such weight rightly afforded to factor four, the obvious market (and potential market) harm to copyright owners when their works are used without permission to train large language models, image and music generators, and motion picture models could swing the ongoing lawsuits against the generative AI companies.

These early decisions do not seem to have deterred AI companies, and they are sticking to their argument that existing case law sets some sort of fair use precedent for the unauthorized ingestion of copyrighted material for training purposes. In court filings, congressional hearings, and comments to the Copyright Office and the press, AI companies and their supporters have averred that scraping and copying protected material qualifies as fair use and that past fair use cases clearly support their position.19 They most frequently invoke the Second Circuit’s Authors Guild v. Google decision from 2015, but they also rely on various other fair use cases concerning truly transformative uses that are easily distinguishable from AI training. These new generative AI cases involve novel technologies that utilize copyrighted works on an unprecedented scale, raising serious doubts about the merits of the fair use defense. Furthermore, the significance of the Supreme Court’s recent decision in Warhol v. Goldsmith has changed the fair use calculus by swinging the pendulum back to its intended balance and thereby seriously weakening the AI companies’ chief argument—transformativeness.20

Warhol v. Goldsmith Reins in Transformative Fair Use

Before looking at the cases AI companies point to in support of their training-as-fair-use argument, it is critical to understand how the Supreme Court’s Warhol v. Goldsmith decision reset the boundaries of the fair use doctrine and how it will likely impact generative AI litigation.21 In May 2023, the Supreme Court, in a 7–2 opinion written by Justice Sonia Sotomayor, found that the purpose and character of the Andy Warhol Foundation’s use of Lynn Goldsmith’s photograph did not favor the fair use defense under the first factor. The landmark decision reaffirmed a critical tenet of the fair use doctrine—that whether a use is transformative not only does not control the fair use determination, but it also does not control the factor-one analysis. The Supreme Court’s thoughtful examination of the boundaries of fair use properly reined in the expansive notions of transformativeness put forth by the lower courts and others who misinterpreted the doctrine that it first handed down in Campbell v. Acuff Rose almost thirty years earlier.

In Campbell, the Supreme Court held that the unauthorized use of portions of a Roy Orbison song in a parody by the rap group 2 Live Crew qualified as fair use, in part because of the transformative purpose of the use.22 The Supreme Court adopted a nuanced approach to the first fair use factor that looks “to see whether the new work merely supersedes the objects of the original expression, supplanting the original, or instead adds something new, with a further purpose or different character, altering the first with new expression, meaning, or message; it asks, in other words, whether and to what extent the new work is transformative.”23 And while Campbell solidified the concept of transformativeness, it is important to recognize that the decision simply provided that it is considered as one part of a factor-one analysis. It did not suggest that transformativeness should be dispositive of an ultimate finding in favor of fair use (or, for that matter, the factor-one analysis). Equally as significant, Campbell addressed (and Warhol later confirmed) the central importance of the justification of the use under the first factor. That is, whether an unauthorized use is reasonably necessary, as it was found to be in the case of the parody at issue in Campbell, must also be considered when assessing the purpose and character of the use.

Unfortunately, in the intervening years between the Campbell and Warhol decisions, the lower courts took a very broad view of transformativeness that often eclipsed the other fair use factors. They incorrectly read Campbell to hold that practically any new expression, message, or meaning would tilt the first factor in favor of the alleged infringer. In fact, for many years, if a court concluded that the purpose of the use was transformative under the first factor, that determination would control the entire fair use analysis. An empirical study by Professor Jiarui Liu examined the fair use decisions that were handed down before Warhol and showed that a transformative use finding on the first factor ultimately led to a fair use conclusion in 94 percent of the cases.24 In other words, in the years following Campbell but before Warhol, lower courts created a distorted transformative use standard that diminished other considerations—including the justification for the use and its effect on the market for the original—and thereby improperly swallowed up the entirety of the statutory fair use analysis.

Warhol’s clarification of the proper weight afforded to transformative use and the importance of a valid justification for the copying is not just an indictment of some lower courts’ past wrongs;25 it also serves as a warning to the courts now faced with questions about whether the use of copyrighted works for AI training purposes is transformative and whether the use qualifies as fair use. A review of the post-Warhol cases that have applied the Supreme Court’s restraint of transformative fair use reveals that the factor-one analyses have become more nuanced and findings of transformativeness are correctly understood to no longer drive the other relevant considerations.26 It is a welcome result that returns the first fair use factor to its intended impartiality, but it remains to be seen whether generative AI companies focusing on transformative use arguments will truly grasp Warhol’s impact or will instead continue to ignore it.

Many comments submitted by generative AI companies in response to the Copyright Office’s AI study focused their fair use analyses on what they claim to be the transformative purpose of generative AI training. For example, in suggesting that the training of its Claude model qualifies as fair use, Anthropic stated that “this sort of transformative use has been recognized as lawful in the past and should continue to be considered lawful in this case.”27 This statement may be accurate in that a finding of transformativeness in the years before the Warhol decision often resulted in an ultimate holding of fair use, but that is far less of a certainty today after the Supreme Court limited transformative use in Warhol. Indeed, once the clear commercial purpose of many of the generative AI models and the lack of any valid justification for using massive amounts of copyrighted works for training are properly taken into consideration, a finding of transformativeness would, at best, result in the first fair use factor being neutral.

Case Law Does Not Support a Categorical Fair Use Exception for Generative AI Training

The cases that AI companies and their defenders most frequently reference in support of their position that the ingestion of copyrighted works for training purposes categorically qualifies as fair use largely fall into three categories: searchable database cases, reverse engineering cases, and search engine cases. As the following summaries demonstrate, these cases involve distinguishable fact patterns, technologies, and purposes that clearly limit their applicability to the unauthorized use of copyrighted works for developing AI models.

  1. Searchable Database Cases

Authors Guild v. Google—commonly known as Google Books—is the case most AI developers rely on when they claim that AI ingestion of copyrighted materials qualifies as fair use.28 However, the decision is highly distinguishable because, unlike with generative AI training, the use of the copyrighted materials in Google Books was genuinely for a different purpose. This case involved Google’s unauthorized digitization of millions of copyright-protected literary works for its Google Books project, which used the digital copies to create a database that members of the public could access through a search engine. In response to queries, users of Google Books received basic information about the books, such as the number of times a particular word appeared within the text, as well as brief passages—snippets—of the copyrighted text itself.

The Second Circuit held that this was fair use because “Google’s making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them.”29 Significantly, the decision made clear that it “tests the boundaries of fair use” and would likely have come out differently if the “snippet view could provide a significant substitute” for the original works.30 The Second Circuit held that Google’s use was transformative because it served a different purpose that justified the wholesale copying while making sure it did not create a competing substitute for consumers of the original works. In contrast, generative AI models utilize entire copyrighted works so that users can make the same types of works in a way that is much more likely to usurp the market for the underlying works.

Unlike the activity at issue in Google Books, the purpose of generative AI currently has nothing to do with providing factual information about copyrighted works to users. Instead, generative AI systems typically reproduce and use the expressive elements from ingested works as part of a process that results in the creation of AI-generated works that compete in the same market as the original copyrighted works. Another important element that the Second Circuit cited in Google Books, with regard to the fourth factor, was the absence of an actual or potential market for the licensing of copyrighted works to create searchable databases.31 This is significantly different than generative AI training where, as noted above, there is a burgeoning market for the licensing of copyrighted works for ingestion purposes.

AI companies also cite Authors Guild v. HathiTrust, in which the Second Circuit analyzed Google’s creation of the HathiTrust Digital Library (HDL), a digital repository comprising millions of copyrighted works.32 In addition to providing information about the works—the issue in Google Books—the HDL allowed library patrons with print disabilities to access the full text of the works in accessible formats. The Second Circuit found that making the works available to print-disabled patrons was not a transformative use under the first fair use factor since authors “write books to be read” and “the underlying purpose of the HDL’s use is the same as the author’s original purpose.”33Nevertheless, the court concluded that providing such access was a valid purpose under the first factor because “making a copy of a copyright work for the convenience of a blind person is expressly identified by the House Committee Report as an example of fair use.”34 Moreover, the Second Circuit noted the insignificance of any market harm because “publishers did not usually make their books available in specialized formats for the blind.”35

The Second Circuit’s analysis in the HathiTrust case is wholly inapplicable to generative AI training because the purpose of the HDL searchable database—to point users to information about the copyrighted works while not competing with them—is entirely different than the purpose of AI models, which is to generate images, text, or music in a way that competes with the ingested works. As to making the full text available to disabled patrons, there is no legislative history suggesting that Congress considers AI training to be fair use. Furthermore, the market for specialized formats is completely different than the generative AI training market in that it serves a much smaller-scale, specific community. Thus, any market harm that might occur from the HDL is negligible in comparison to that which results from the unauthorized use of massive amounts of copyrighted works for training purposes. Finally, it should also be noted that HathiTrust is representative of the first factor having an undue influence over the entire fair use analysis—an approach that, as discussed earlier, the Supreme Court recently rejected in Warhol v. Goldsmith.

  1. Reverse Engineering Cases

AI companies also wrongly attempt to analogize their conduct to the facts of two Ninth Circuit reverse engineering cases involving computer software—Sega v. Accolade and Sony v. Connectix. In Sega, Accolade copied small portions of object code from Sega’s games, converted it to source code—a form of reverse engineering—and used what it learned to write its own computer code enabling its own games to work on Sega’s Genesis console.36 The Ninth Circuit held that Accolade’s reverse engineering of the computer code, which was a necessary step for compatibility purposes, constituted fair use.37 The court found that the use was justified because it was the only means available for examining the unprotected and functional aspects of the computer code at issue. In contrast, generative AI systems make unauthorized use of highly expressive, nonfunctional works of authorship for purposes that are themselves expressive. In fact, the Ninth Circuit in Sega was clear that its analysis would be different if the works at issue were more expressive and less functional. The court explained that because “Sega’s video game programs contain unprotected aspects that cannot be examined without copying, we afford them a lower degree of protection than more traditional literary works.”38

Similarly, in Sony, Connectix was sued for copying the software program that operated Sony’s PlayStation gaming console in order to emulate its functionality on a regular desktop computer.39 The Ninth Circuit concluded that Connectix’s intermediate copying qualified as fair use because it was necessary for making its own gaming software compatible with Sony’s PlayStation games.40 The Sony case is similar to Sega in that it involved the copying of protected computer code for the purpose of reverse engineering it in order to develop noninfringing competitive products. Following Sega’s precedent, the Ninth Circuit found that Connectix’s copying was justified because it was the only means of accessing the unprotected, functional elements of Sony’s copyrighted computer code. This factor-one analysis was central to the court’s ultimate fair use holding, and it would be inapplicable to the highly expressive works that are ingested by generative AI systems for the purpose of exploiting their expressive aspects.

It should also be noted that both Sega and Sony are based on the understanding that interoperability exceptions to copyright law are justifiable when they support legitimate forms of competition. The focus on copying protected elements of a work in order to gain access to its unprotected aspects through reverse engineering for the purpose of developing legitimate competitive products is nothing like generative AI’s ingestion of expressive works for purely nonfunctional purposes that directly compete with the originals. This is an important distinction that was recently confirmed in the aforementioned first case deciding whether AI training qualifies as fair use. In Thomson Reuters v. Ross, the court addressed the computer software reverse engineering cases head-on, explaining that “here, though, there is no computer code whose underlying ideas can be reached only by copying their expression.”41 The opinion goes on to quote from the Supreme Court’s Warhol decision, finding that in the context of the defendant’s development of an AI tool, the “copying is not reasonably necessary to achieve the user’s new purpose.”42

  1. Search Engine Cases

The third category of cases that AI companies rely on in arguing that ingestion for training purposes is fair use involves search engines. A leading example is Kelly v. Arriba, where the Ninth Circuit analyzed whether Arriba’s search engine infringed the plaintiff’s copyrighted photographs by scraping them from the internet and then displaying smaller, lower-resolution thumbnail versions in its search results.43 The court held that Arriba’s copying qualified as transformative fair use because the thumbnail images served an entirely different purpose than the originals. Specifically, the Ninth Circuit found that the plaintiff’s photographs were “artistic works intended to inform and to engage the viewer in an aesthetic experience” in contrast to Arriba’s search engine that “functions as a tool to help index and improve access to images on the internet and their related web sites.”44 When a work is used “for the same intrinsic purpose,” the court reasoned, “such use seriously weakens a claimed fair use.”45 However, Arriba’s use was transformative because it served an entirely different function by “improving access to information on the internet versus artistic expression.”46

Contrary to those who argue that Arriba stands for the proposition that AI ingestion is categorically fair use, the Ninth Circuit’s decision actually suggests the opposite. In the context of generative AI, the purpose of the works that are ingested in developing the AI model and the purpose of the works that it ultimately creates are the same. Unlike Arriba’s search engine that utilized copyrighted works for reasons that were unrelated to their aesthetic purpose, generative AI models ingest protected expression in order to provide users with an artistic experience. The Ninth Circuit held that the “thumbnails do not stifle artistic creativity because they are not used for illustrative or artistic purposes and therefore do not supplant the need for the originals.”47 And it found that Arriba’s use “does not harm the market” for the original photographs or the plaintiff’s “ability to sell or license his full-sized images.”48 Generative AI training, by contrast, threatens to displace artistic creativity and the market for copyrighted works because its nontransformative purpose supersedes the objects of the original creations that it ingests without permission or compensation.

Importantly, the Arriba decision demonstrates how the fair use analysis must holistically consider the ultimate aim of the unauthorized copying—and not just discrete parts of the process. Those claiming that generative AI training qualifies as fair use disaggregate the purpose of AI systems in their analyses by focusing heavily on the ingestion component without acknowledging the final result. However, just as the Ninth Circuit found it essential to consider the functionality of Arriba’s search engine in determining whether its copying of protected photographs was justified, the ingestion of copyrighted works for generative AI training cannot be examined in isolation from the resulting outputs. The Arriba court’s fair use analysis did not end with Arriba’s scraping of the internet or its creation of the thumbnail images—the Ninth Circuit went on to consider the eventual purpose of that intermediate copying and whether it supplanted the need for the original. Thus, when assessing whether generative AI training is fair use, it is necessary to consider the copying that occurs at the ingestion stage as well the outputs that the AI model ultimately produces.

Conclusion

As the Copyright Office’s recent report on generative AI training explains, the Supreme Court’s decision in Warhol v. Goldsmith confirmed the necessity of considering the ultimate purpose of the use: “Warhol requires examining not just the immediate act of copying but its ultimate goal. Accordingly, whether copying a work to compile a training dataset is transformative depends on whether the dataset will be used for a transformative purpose.”49 To be sure, the Copyright Office recognizes that generative AI training might be transformative if the ultimate purpose of the use is distinct from that of the copyrighted works that are ingested. For example, “a language model can be used to help learn a foreign language by chatting with users on diverse topics and offering corrective feedback.”50 However, the Copyright Office concludes that when the purpose of the use is to generate outputs that serve the same purpose as the works that are ingested, then “it is hard to see the use as transformative.”51

With generative AI training, expressive works are ingested for the purpose of developing an AI model that generates expressive works. The fair use question for generative AI turns on the ultimate purpose of the copying—not solely on the intermediate steps taken in training the AI models. Searchable databases and search engines make use of expressive works, but the copying is justified by the functional ends (or the unique circumstances of the print-disabled audience). So to with reverse engineering, in which protected computer code is copied in order to access its unprotected aspects. Generative AI technologies have the potential to disrupt human creativity and the markets for copyrighted works, and this weighs heavily against the AI companies in the fair use analysis. The current copyright infringement cases against the AI companies involve clearly commercial uses for nontransformative purposes where there is no credible justification for the use. Not only do the past cases fail to demonstrate that generative AI training is categorically fair use, but they also suggest that the opposite is true. Indeed, without a purpose that is truly transformative, it is difficult to see how these AI technologies are not clear-cut infringements.