Introduction
HarperCollins has partnered with Microsoft to license select nonfiction works for AI model training [3], aiming to enhance the accuracy of Microsoft’s AI programs [2]. This collaboration highlights the evolving relationship between publishing and AI technology, raising questions about revenue distribution, intellectual property rights [1] [3] [4], and the future role of human creators in an AI-driven landscape [5].
Description
HarperCollins has entered into a three-year partnership with Microsoft to license select nonfiction works for training AI models [3], aimed at enhancing the accuracy of Microsoft’s AI programs [2]. Under this agreement [3] [4], Microsoft will pay $5,000 per title [4], with the fee split evenly between the author and HarperCollins. Payments will be made directly to authors and will not be deducted from their advances. Authors must opt in to participate in this training program [5], and those who decline will have their works excluded from the training datasets [4]. Reports indicate that authors may receive offers [1], such as $2,500 for a three-year license [1], highlighting the varying compensation structures within the industry. The licensed content will be used for a yet-to-be-announced Microsoft model [5], with restrictions on the AI’s output limited to no more than 200 consecutive words or 5% of a book’s text [5], ensuring that authors retain control over their works [1].
The Authors Guild has raised concerns about the agreement [4], arguing that the 50-50 revenue split disproportionately benefits the publisher [4]. They assert that authors [4], as the rights holders [4], should receive a larger share of the revenue generated from AI licensing [4]. This concern is heightened by the diminishing pool of public content available for generative AI training, with projections indicating that the remaining 300 trillion tokens of public data could be exhausted between 2026 and 2032. Incorporating non-public domain books into the training data may extend this timeline [5], but the financial compensation offered to authors may still be inadequate, given that the publisher retains half of the earnings [5].
In response to growing concerns about the use of copyrighted material in AI [4], publishers are taking steps to protect their authors’ intellectual property [4]. HarperCollins emphasizes its commitment to providing authors with opportunities while safeguarding the value of their works and ensuring the integrity of revenue and royalty streams [2]. The agreement includes defined limitations and protections regarding the output of the AI model [2], thereby respecting authors’ rights [1] [2]. Other academic publishers [3], including Wiley and Taylor & Francis [3], have also formed partnerships with AI developers to supply content for training purposes [3]. For instance, Penguin Random House has updated its copyright language to prohibit the use of its books for AI training without explicit consent [4]. Additionally, News Corp [3] [4], the parent company of HarperCollins [4], has previously entered into agreements with OpenAI to allow the use of its digital content [4], including articles from The Wall Street Journal and the New York Post [4], for AI training [1] [2] [4] [5]. Similarly, the Financial Times has reached an agreement with OpenAI for the use of its archived content in the development of generative AI technology [4].
Public trust in companies to manage AI responsibly is low [5], with only 47% of US adults expressing confidence in their ability to prevent unauthorized derivative works [5]. While book content may be less vulnerable to unauthorized data scraping compared to news content [5], as full texts of books are less commonly available online [5], the future role of human creators in an AI-dominated landscape remains uncertain [5]. The Transparency Coalition advocates for the disclosure of training data to protect creators’ rights and ensure ethical sourcing of data utilized in AI development [4]. For authors who are open to generative AI using their work, this partnership could provide additional revenue while imposing clear limits on AI outputs [5].
Conclusion
The partnership between HarperCollins and Microsoft underscores the complexities of integrating AI into the publishing industry. It raises significant issues regarding fair compensation for authors, the protection of intellectual property, and the ethical use of copyrighted material in AI training. As the availability of public data for AI diminishes, the industry must navigate these challenges to ensure that authors’ rights are upheld and that AI development proceeds responsibly.
References
[1] https://www.theverge.com/2024/11/18/24299882/harpercollins-authors-license-books-ai-training
[2] https://mybroadband.co.za/news/ai/571054-microsoft-signs-ai-training-deal-with-major-book-publisher.html
[3] https://observer.com/2024/11/harpercollins-asking-authors-sell-books-ai-training/
[4] https://www.transparencycoalition.ai/news/harpercollins-ai-deal-with-microsoft-sets-first-public-price-for-training-data
[5] https://www.emarketer.com/content/microsoft-harpercollins-sign-ai-licensing-deal–author-opt-in-still-required