OpenAI and Microsoft Face Copyright Lawsuit from The New York Times Over AI Training Data

Introduction

In 2024 [6], OpenAI and Microsoft faced significant legal challenges, particularly from a lawsuit filed by The New York Times. The case centers on allegations of copyright infringement, with claims that the companies used copyrighted articles from the Times to train their AI models without permission. This legal battle is part of a broader wave of copyright actions against AI developers, raising important questions about the use of copyrighted material in AI training [6].

Description

In 2024 [6], OpenAI faced significant legal challenges, particularly highlighted by a lawsuit initiated by The New York Times on December 27, 2023, against both the company and Microsoft in the US District Court for the Southern District of New York. The lawsuit alleges copyright infringement and related claims [4], contending that the companies utilized millions of the Times’ copyrighted articles [4], including paywalled news pieces, to train their AI models without obtaining consent [4]. The Times argues that the resulting AI tools, such as ChatGPT [1], provide “nearly verbatim” excerpts of its articles in response to specific prompts, constituting a violation of copyright law [1]. This unauthorized use has reportedly led to economic harm by diverting users from the Times’ paywalled content, negatively impacting its advertising revenue, and harming its relationships with other newspapers and readers [1]. The Times is seeking billions of dollars in statutory and actual damages [4], with multiple causes of action including copyright infringement [4], unfair competition [4], and trademark dilution [4], asserting that the defendants’ actions exceed the bounds of “fair use.”

This legal scrutiny follows a broader wave of copyright actions against OpenAI since the launch of ChatGPT in 2022, including a separate lawsuit filed by eight major daily newspapers against OpenAI and Microsoft [6], which also alleges unauthorized use of news articles for AI model training [6]. In response to the New York Times’ claims, OpenAI has asserted that the lawsuit lacks merit, accusing the publication of incomplete reporting [6]. The company contends that instances of content regurgitation from the Times are based on older articles available on third-party sites and suggests that the Times may have manipulated prompts to elicit specific responses from ChatGPT [6]. OpenAI mounted a fair use defense [5], arguing that the use of publicly available materials did not infringe copyright and that the outputs generated by the chatbot were distinct from the original articles [5]. OpenAI and Microsoft sought dismissal of the lawsuit [1], arguing that it is an attempt to hinder technological advancement [1].

Amid these developments [5], Suchir Balaji [7], a former researcher at OpenAI who worked on training the ChatGPT and GPT-4 models, raised concerns about potential copyright violations before his recent death [7]. His involvement is linked to the New York Times lawsuit, as the Times’ legal team sought to include Balaji as a “custodian” in the case [7], citing his possession of unique documents relevant to their copyright infringement claims [7]. Other former OpenAI employees have also been proposed as custodians [7]. OpenAI sought information from the New York Times regarding its use of generative artificial intelligence tools [2], the creation and use of its own generative AI products [2], and its stance on generative AI [2]. However, a ruling by Magistrate Judge Ona T. Wang determined that OpenAI did not adequately demonstrate the relevance of this information to the fair use analysis [2]. Currently, there are ongoing discovery disputes [4], particularly regarding OpenAI’s denial of discovery related to fair-use defenses [4].

In a related development [1], OpenAI reportedly deleted legal data during its litigation with the New York Times and other newspapers [3]. The newspapers’ legal teams discovered that OpenAI’s engineers had erased all relevant programs and search result data associated with the News Plaintiffs [3], as indicated in a court filing [3]. OpenAI disclosed its training data [1], but the New York Times claimed that evidence was compromised due to data deletion during its review process [1], which OpenAI attributed to a system error [1]. On September 13, 2024 [4], the court consolidated the New York Times’ case with one filed by the Daily News and other publications [4], with Judge Sidney Stein presiding [4]. The fact discovery deadline has been extended to April 30, 2025, and the deadline for amending the complaint to April 15, 2025 [4].

The outcomes of these lawsuits could significantly influence the future of AI model training [6], raising critical questions about the use of copyrighted material in AI development [6]. While some view large language models as merely reproducing learned information [6], AI developers argue that the training process is akin to learning through imitation [6], similar to a musician learning by playing riffs [6]. The legal debate continues [6], with both sides presenting compelling arguments regarding copyright and the ethical implications of AI training practices [6]. Additionally, other copyright disputes have emerged [5], including an open letter from musicians in April 2024 accusing AI vendors of intellectual property infringement for training models on their work without consent [5], and a cease-and-desist letter sent by the Times to Perplexity AI in October 2024 for unauthorized use of its content [5]. A ruling in favor of the Times could have significant financial implications for AI companies and restrict the datasets available for training AI models [7], with the lawsuit asserting that OpenAI and Microsoft could be liable for “billions of dollars” in damages [7].

Conclusion

The legal challenges faced by OpenAI and Microsoft underscore the complex intersection of AI development and copyright law. The outcomes of these cases could have far-reaching implications for the AI industry, potentially reshaping how AI models are trained and the legal frameworks governing the use of copyrighted material. As the debate continues [6], the balance between technological advancement and intellectual property rights remains a critical issue for stakeholders across various sectors.

References

[1] https://www.lexology.com/library/detail.aspx?g=75fb28a8-1528-480b-b1bb-0c822f5d0143
[2] https://chatgptiseatingtheworld.com/2024/12/24/openai-microsoft-object-to-mag-judge-wangs-denial-of-discovery-in-authors-guild-new-york-times-suits/
[3] https://www.businessinsider.com/openais-biggest-moments-in-2024-2024-12
[4] https://www.mckoolsmith.com/newsroom-ailitigation-3
[5] https://www.techtarget.com/searchEnterpriseAI/feature/The-year-in-AI-Catch-up-on-the-top-AI-news-of-2024
[6] https://www.laptopmag.com/ai/the-new-york-times-made-headlines-by-suing-openai-this-year-but-the-case-against-ai-isnt-so-black-and-white
[7] https://news.yahoo.com/news/former-openai-employee-died-suicide-071412811.html

OpenAI and Microsoft Face Copyright Lawsuit from The New York Times Over AI Training Data

You may also want to see:

Southampton UK