11/10/24

Artificial intelligence and copyright: what are AI companies allowed to do?

Having addressed the issue of copyright protection of the output of generative AI tools (such as whether AI-generated content can be protected by copyright), this article explores the issue of the use of copyright protected material as input data for training these tools.

Generative AI is a type of artificial intelligence that can create new content, such as text, images, videos, audio or software code, in response to user prompts. Generative AI models (like those used by OpenAI’s ChatGPT) are trained with a large amount of publicly available data. A common method of collecting this data is “data scraping”, i.e. an automated process of extracting large amounts of data from websites using software tools or scripts.

However, the content that is freely available on the internet may include copyright protected works, raising the question of whether such use is permissible.

Copyright infringement

Copyright protects original works created by authors, such as articles, images and songs. The copyright holder has exclusive rights to reproduce, create derivative works, distribute and communicate a protected work to the public. Third parties, on the other hand, are not entitled to reproduce protected content or communicate it to the public without the right holder’s permission, unless the use is covered by one of the legal exceptions. In the event of copyright infringement, right holders may initiate injunctive relief proceedings against the infringer and/or claim damages.

In a case, which preceded the emergence of AI, the Brussels Court of Appeal found that a widely-used search engine had infringed a newspapers’ copyrights by displaying their headlines and snippets and linking to content from their websites in its news service without the right holders’ permission.

In the case of generative AI tools that are trained on copyright protected input material, the copies of the input data made by such AI tools may be seen as “reproductions” of such data. Consequently, when these copies are made without having obtained the right holder’s prior authorization, such use can constitute copyright infringement.

Exceptions to the exclusive rights

In Europe, the Copyright in the Digital Single Market Directive n° 2019/790 introduced two exceptions to right holders’ exclusive rights for “text and data mining”, defined as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”. 

More specifically, there are two scenarios in which using copyright protected content is considered lawful without the right holder’s prior consent:

  1. for scientific research purposes: a research organization or cultural heritage institution may reproduce or extract protected content to which it has lawful access in a text and data mining process if it is for the purposes of scientific research;
     
  2. for other purposes: other users may reproduce or extract protected lawfully accessible content for the purposes of text and data mining, provided that the right holders have not expressly reserved the use of the protected content “in an appropriate manner, such as machine-readable means in the case of content made publicly available online”.

Recently, on 27 September 2024, the Regional Court of Hamburg dismissed a claim brought by the German photographer Robert Kneschke against LAION e.V., the non-profit Large-scale Artificial Intelligence Open Network. Kneschke argued that the scraping of his copyrighted images by LAION to create a dataset for AI training infringed his copyright. The Court found that LAION could rely on the copyright exception that permits reproductions of copyrighted content for text and data mining for non-commercial scientific research purposes without the right holder’s consent. This decision clarifies that creators of datasets using text and data mining and making such datasets available for AI training purposes can fall within the scope for the text and data mining exceptions for non-commercial scientific research and commercial purposes. 

These exceptions will be covered in more detail in a subsequent article.

AI Act

The new AI Act contains references to the interaction between AI technologies and copyright protection. Providers of general-purpose AI models are required to implement a policy to comply with Union law on copyright and related rights, in particular to identify and comply with the reservation of rights expressed by right holders. They are also required to provide a detailed summary of the content used for training, in a comprehensive way that allows right holders to enforce their rights.

The AI Act will be discussed in more detail in a subsequent article.

US cases

In the United States, several significant cases have emerged where copyright holders are challenging AI companies over the use of copyright protected content to train their AI tools, including the following:

  • The New York Times has filed a lawsuit against OpenAI and Microsoft, alleging that they used its articles without permission to train their generative AI tools ChatGPT and Bing Chat (Copilot). The New York Times argues that its copyright protected content was used unlawfully to develop these AI tools, and it is seeking substantial damages.
  • Getty Images has sued Stability AI for allegedly using millions of copyright protected images without authorization to train its AI models. Getty Images claims that Stability AI “scraped” these images from its website and used them to develop and train the AI, which then produced images that infringed Getty’s copyrights and trademarks.

These cases highlight the tension between AI development and copyright protection, with potential implications for how AI companies operate globally.

Conclusion

As stated in the AI Act, generative AI models “present unique innovation opportunities but also challenges to artists, authors, and other creators and the way their creative content is created, distributed, used and consumed”.

Balancing these interests will be crucial in fostering a creative landscape that embraces both technology and intellectual property.

dotted_texture