Chances are you have already interacted with generative AI – whether by prompting ChatGPT to write a poem about the unbearable lightness of being on a Thursday night, or by generating images of “Pikachu riding a horse in Tuscany, ukiyo-e style” with Dall-E. Generative AI systems like ChatGPT, Dall-E, and many of the like work on the same principle. They are trained on existing creative works (e.g., images, videos, text, software code) and then remix them to derive more works of the same kind.
2022 was the year when generative AI gained wide popularity, not only for entertainment but also for its potential professional applications. Nonetheless, a lot of legal uncertainty is still involved in relation to AI input and output and – in particular – their compliance with copyright law. This article focuses on two class-action lawsuits in the US against two different types of generative AI systems. These lawsuits are important because they could provide clarification as to the applicability of existing legal requirements.
On 13 January 2023, a class action was filed in San Francisco, CA against three companies, Stability AI (Stable Diffusion), Midjourney, and DeviantArt, Inc. (DreamUp) on behalf of artists whose works were used to train AI algorithms.
Stable Diffusion relies on a mathematical process called diffusion to store compressed copies of training images, which in turn are recombined to generate new images. The main contention of the complainants is that Stable Diffusion contains unauthorized copies of millions (and possibly billions) of copyrighted images, made without the knowledge or consent of the artists.
The complaint dives deep into the details of how the technology behind Stable Diffusion works. It also highlights how Stability AI paid LAION ("Large-Scale Artificial Intelligence Open Network") to put together LAION-5B, a dataset of 5.85 billion images. LAION’s image datasets are built off of Common Crawl, a non-profit that scrapes billions of web pages monthly and releases them publicly as massive datasets. Some of the most common websites scraped for content by Common Crawl include Pinterest, Flickr, Tumblr, Wikimedia, DeviantArt, and WordPress websites (https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/).
The claims in the Stable Diffusion class-action include allegations of:
The complaint for the case Andersen v. Stability AI Ltd. (3:23-cv-00201), District Court, N.D. California and further information can be found at: stablediffusionlitigation.com
Several months earlier, on 3 November 2022, a class-action lawsuit was filed at the US federal court in San Francisco against GitHub’s Copilot AI coding assistant. GitHub, Microsoft (the owner of GitHub), and OpenAI are being sued for allegedly violating copyright law by reproducing open-source code using AI.
Copilot is an AI system trained on publicly available sources and allegedly on public GitHub repositories, which aims to make coding easier by accepting a code “prompt” from a programmer and generating a possible completion of that code as an output. However, Copilot appears to occasionally reproduce verbatim code from existing codebases, even those under restrictive licenses. The main contention of the complaint is that GitHub has violated the legal rights of a vast number of creators who posted code or other work under open-source licenses on GitHub.
The alleged infringement covers a set of 11 popular open-source licenses that all require attribution of the author’s name and copyright, including the MIT license, GPL, and Apache license. Copilot, however, does not provide the end user any attribution of the original author of the code, nor information about their license requirements.
The claims in the GitHub Copilot class-action include allegations of:
The complaint for the case Doe 1 v. GitHub Inc. (3:22-cv-06823), District Court, N.D. California and further information can be found at: githubcopilotlitigation.com
The Stable Diffusion and GitHub Copilot cases go into the heart of many legal uncertainties related to the training and use of generative AI. In the US specifically, one of the main questions to be clarified by the courts is whether using copyrighted content to train AI systems and generate new output can be considered “fair use”. While the GitHublitigation tries to side-step a fair use defence by focusing on other claims such as violations of DMCA, CCPA, contracts, and unlawful anti-competitive conduct, the Stable Diffusion litigation alleges both direct and vicarious copyright infringement, inviting the court to decide on the applicability of the fair use doctrine to generative AI training.
Similarly, to the system of copyright exceptions and limitations in the EU, the US fair use doctrine aims to promote freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances. Several considerations need to be taken into account when assessing whether a use is ‘fair’ but when it comes to generative AI, two main factors are likely to carry the greatest weight in the legal analysis:
The output of generative AI systems often doesn’t outwardly resemble the training data, largely due to the huge amounts of information the algorithm is trained on, and is highly likely to be considered transformative. Nevertheless, AI outputs derived from copies of the training data could potentially compete with them in the marketplace, especially when they result from prompts “in the style of” a particular artist. These are complicated issues to tackle, and it would be interesting to follow the future development of the two cases, even if they are outside of the European legal system, as they will surely influence the reasoning followed by European judges.
This naturally leads to the question: how is this issue regulated in the EU and could we see a flux of similar (class-action) litigation by content creators in the EU (for the sake of this article, leaving out the question of the comparative difficulty of bringing class-action lawsuits between the US and EU Member States)?
Unlike the United States, the European legislator provides for exceptions based on the numerus clausus-principle. For this reason, Directive (EU) 2019/790 on Copyright and Related Rights in the Digital Single Market (DCDSM) attempts to regulate copyright issues with respect to AI inputs i.e., the training datasets, with two text and data mining (“TDM”) exceptions in Articles 3 and 4. The broader Article 3 is limited to scientific research by research and cultural institutions, making Article 4 the main exception for businesses to rely on for their AI training. Article 4 permits TDM by anyone, however, it also allows rightsholders to limit its applicability contractually, including by technical means. In other words, rightsholders in Europe can “opt-out” of the TDM exception and demand the use of their work to be licensed for the purpose of training generative (and other types of) AI systems.
This legislative solution has been under scrutiny for putting the European AI sector at a competitive disadvantage due to the substantially higher costs involved in negotiating licenses for the large amounts of works needed as training data. Indeed, if courts in the US clarify and confirm the applicability of the fair use doctrine to generative AI, in most cases US businesses wouldn’t need any licensing for their input datasets. EU-based companies, on the other hand, could be forced to frequently enter into negotiations with any rightsholders that have imposed contractual limitations on the TDM exception, or exclude their works from training databases altogether. It remains to be seen whether and how these differences between US and EU copyright law will impact future AI development.
Do you have questions about the copyright protection of AI inputs and outputs? Please contact Timelex.