Introduction

The ongoing legal battle between the New York Times and tech giants OpenAI and Microsoft centers on allegations of copyright infringement. The New York Times accuses these companies of unlawfully using its articles to train AI models, such as ChatGPT, leading to direct competition with its content [7]. This case highlights significant issues surrounding the use of copyrighted material in AI training and the complexities of fair use defenses.

Description

On November 23 [2], a motion by OpenAI to compel discovery of information regarding the New York Times’s use of AI was denied [2]. This decision also applied to lawsuits filed by book authors [2], including the Authors Guild [2], against OpenAI and Microsoft [1] [2] [6]. The discovery dispute pertains to the defendants’ fair use defense and was denied for reasons outlined in a related case [2]. The request was deemed premature [2], as it involved interrogatories asking plaintiffs to identify documents relevant to the defendants’ fair use defense [2].

The New York Times is currently engaged in a contentious copyright lawsuit against OpenAI and Microsoft [1], alleging that the companies illegally used its articles to train AI tools like ChatGPT [1]. The Times claims that OpenAI’s actions have resulted in direct competition with its content, seeking substantial damages amounting to billions of dollars for alleged copyright infringement [7]. During the discovery phase [1] [3], the Times reported that OpenAI’s engineers inadvertently deleted crucial evidence related to the AI’s training data [3], which had been meticulously compiled by the Times’ legal team [3]. Although OpenAI managed to recover some of the data [1], the recovery was disorganized [1], lacking original file names and folder structure [1] [3], complicating the Times’s ability to trace the incorporation of its articles into OpenAI’s models [1]. The publishers have spent significant time searching OpenAI’s training data for their content [4], but the unreliability of the recovered data has hindered their efforts. Key elements that demonstrate the AI’s pattern of copying are still missing [3], and OpenAI’s recovery efforts were deemed incomplete and unreliable [7], further complicating the tracing of the usage of the news organizations’ articles in the development of its AI tools [7].

In a related development, a federal magistrate judge has denied requests from OpenAI and Microsoft for discovery from the New York Times [6], which included demands for readership and performance data concerning the Times’ website content [6]. Despite these challenges, OpenAI has formed partnerships with other media outlets [7], indicating a preference among some publishers to collaborate rather than engage in litigation [7]. The court has required OpenAI to disclose its training data, a significant move as it has not publicly revealed this information before [1]. The Times has also requested additional communications from OpenAI’s key figures [1], including former employees and current executives [1], to further support its case [1]. Microsoft [1] [2] [5] [6] [7], in turn, has sought documents from the Times regarding its own use of generative AI [1], arguing that this information could be relevant to its defense [1]. The parties involved are expected to adhere to the agreed-upon scope of document production [2]. If the defendants still require document identification after reviewing the produced materials [2], they may renew their motion after conferring with the plaintiffs [2], providing justification for their request [2]. OpenAI has acknowledged the data deletion [3], attributing it to a “glitch,” and has denied any intentional deletion of evidence, claiming that the issue arose from a misconfiguration caused by the publishers. The company asserts that it has made training data available for inspection and contends that the plaintiffs’ actions have led to technical difficulties [4]. OpenAI appears to be preparing a rebuttal as the legal proceedings continue to unfold, maintaining that its use of publicly available data for training purposes falls under fair use and does not require licensing or compensation [4], despite its commercial application [4].

Conclusion

The legal proceedings between the New York Times, OpenAI [1] [2] [3] [4] [5] [6] [7], and Microsoft underscore the complexities of copyright law in the digital age, particularly concerning AI development. The case could set significant precedents for how copyrighted materials are used in training AI models and the boundaries of fair use. As the case unfolds, it may influence future collaborations and legal strategies between media companies and technology firms.

References

[1] https://www.wired.com/story/new-york-times-openai-erased-potential-lawsuit-evidence/
[2] https://chatgptiseatingtheworld.com/2024/12/03/magistrate-judge-wang-denies-openais-discovery-motion-in-book-authors-cases-re-fair-use-on-same-grounds-as-denial-in-new-york-times-lawsuit/
[3] https://www.engadget.com/ai/the-new-york-times-says-openai-deleted-evidence-in-its-copyright-lawsuit-231805285.html
[4] https://www.medianama.com/2024/11/223-new-york-times-openai-accidentally-deleted-evidence-copyright-lawsuit/
[5] https://futurism.com/openai-nyt-lawsuit-evidence-deleted
[6] https://news.bloomberglaw.com/ip-law/openai-microsoft-bid-for-nyt-revenue-info-shot-down-by-judge
[7] https://www.theverge.com/2024/11/21/24302606/openai-erases-evidence-in-training-data-lawsuit