Introduction

The Global Partnership on AI (GPAI) has published a comprehensive report examining the intellectual property (IP) implications of data scraping in the development of AI systems [1]. This report delves into the reliance of AI models on large datasets, often acquired through data scraping [1], and the ensuing IP concerns. It highlights the legal challenges and economic trade-offs associated with regulating AI-generated content, offering insights into the global landscape of litigation and policy responses.

Description

The Global Partnership on AI (GPAI) has released a report addressing the intellectual property (IP) implications of data scraping in the context of AI system development [1]. The report highlights the reliance of AI models [1], particularly large language models [1], on extensive datasets [1] [4], often obtained through data scraping—an automated method of extracting information from various online sources [1]. This practice raises significant IP concerns [1], including issues related to copyright [1], database rights [1], trademarks [1], trade secrets [1] [2], and moral rights [1], with varying legal frameworks across jurisdictions [1]. The report specifically notes the potential for copyright infringement due to the use of these datasets, emphasizing the economic trade-offs associated with regulating AI-generated content [4].

Litigation surrounding data scraping is on the rise globally [1], with notable cases in the United States and European Union [1]. Ongoing lawsuits in the US are examining the applicability of the fair use exception to copyright concerning AI training data [5]. The report discusses the legal challenges posed by AI-generated outputs that may infringe on individual rights [1], prompting diverse legal responses to safeguard these rights [1]. The International Scientific Report on the Safety of Advanced AI identifies the protection of IP rights as a critical risk associated with advanced AI systems [1].

The report notes the lack of a universally accepted definition of data scraping and the distinct legal challenges faced by different stakeholders in the data scraping ecosystem [1]. While data scraping is often associated with commercial applications [1], it also plays a vital role in research and academia [1], indicating a need for tailored policy tools [1]. The European Union’s 2019 Directive allows exceptions for text and data mining [5], particularly for research and cultural heritage organizations [5], while enabling copyright owners to restrict commercial use [5]. In the UK [5], a proposed broad exception for commercial uses has been postponed [5]. Singapore has established an exception for computational data analysis [5], mandating lawful access to data [5], while China is focusing on excluding content that infringes intellectual property rights from training data [5]. There is a growing demand for transparency in AI [4], with calls for clearer disclosure regarding the sources of training data [4]. In this context [1] [2] [5], the European AI Office has proposed a template for summarizing training data that general-purpose AI model providers must publish under the EU AI Act [2] [3]. This initiative aims to balance the rights of GPAI developers with the interests of rightsholders [2] [3], particularly in light of increasing reluctance among developers to disclose training data [2]. The Act mandates that providers publish a “sufficiently detailed summary” of the training data used for GPAI models marketed in the EU [2], allowing copyright holders to exercise their rights while considering developers’ trade secrets [2].

The summary should include general information about the GPAI model provider [2], the date of market placement [2], overall training data size [2], and characteristics of the data modalities (text [2], image [2], video [2], audio) [2]. Providers are required to break down the types of data within each modality and list the sources used for training [2], which may include publicly accessible datasets [2], private datasets [2] [3], and data scraped from online sources [2] [3]. Specific disclosure requirements apply to different datasets; for instance [2], publicly accessible datasets must include unique identifiers and collection periods [3], while data scraped from online sources necessitates listing the top 10% of domain names per data modality [3], with a reduced requirement for small and medium-sized enterprises (SMEs) [3]. Additionally, the template will address relevant data processing measures [2], such as respecting rightsholder opt-outs and removing data subject to rights reservations [2] [3].

To address the challenges surrounding data scraping, the report proposes several policy approaches [1], including a voluntary data scraping code of conduct for AI developers, standardized technical tools [1], and awareness-raising initiatives [1]. These measures aim to provide a flexible framework for addressing IP [1], privacy [1] [5], and other related issues in a coordinated manner [1]. Policymakers are encouraged to consider these approaches when contemplating legal reforms [1], with an emphasis on developing standardized definitions and measures that balance innovation with effective rights protection [1].

A key recommendation is the establishment of a voluntary code of conduct for data scraping [1], which would provide guidelines for various actors in the AI ecosystem [1], including data aggregators and users [1]. This code could promote consistency through standard terminology and include mechanisms for monitoring compliance and transparency [1].

The report underscores the need for standardized technical tools to help rights holders manage access to their data and protect their IP rights [1]. These tools could facilitate compliance and streamline the protection of rights across platforms [1]. Additionally, the development of standard contract terms is suggested to address legal and operational challenges associated with data scraping [1], allowing for more efficient negotiations [1].

Raising awareness among stakeholders about the legal implications of data scraping is also emphasized [1], aiming to empower rights holders and educate AI system users on responsible practices [1]. Ethical considerations in web scraping with AI systems must be addressed to ensure compliance with legal standards and respect for individual rights [5], particularly concerning privacy and confidentiality when sensitive information is input into AI systems [5]. While some jurisdictions may seek to update IP laws in the long term [1], the report advocates for flexible [1], voluntary measures to address immediate challenges and accommodate diverse legal approaches [1].

Overall, the report serves as a guide for policymakers [1], developers [1] [2] [3] [4] [5], rights holders [1] [2] [3] [4], and other stakeholders [1], aiming to create a framework that respects IP rights while fostering technological advancement [1]. These insights are valuable for informing future regulatory decisions as governments explore their options and consider international collaboration in this dynamic legal landscape [4]. With compliance deadlines approaching and potential fines for non-compliance [3], GPAI model providers should begin gathering the necessary information and ensure collaboration between legal and technical teams for effective implementation [3].

Conclusion

The GPAI report underscores the complex interplay between data scraping and intellectual property rights in AI development. It highlights the need for a balanced approach that considers the interests of developers, rights holders [1] [2] [3] [4], and policymakers [1] [4]. By proposing voluntary codes of conduct [4], standardized tools [1], and awareness initiatives [1], the report aims to foster a collaborative environment that respects IP rights while promoting innovation. As the legal landscape evolves, these insights will be crucial for shaping future regulatory frameworks and ensuring responsible AI development.

References

[1] https://oecd.ai/en/wonk/ip-data-scraping
[2] https://www.lexology.com/library/detail.aspx?g=6ce0f32f-2d40-4efc-b3b1-623abc2483c2
[3] https://thelens.slaughterandmay.com/post/102jyxg/how-to-train-your-gpai-model-a-first-look-at-the-eus-data-summary-requirements
[4] https://barrysookman.com/2025/02/15/ai-copyright-understanding-recent-reports-and-implications/
[5] https://www.restack.io/p/ai-in-legal-tech-answer-legal-considerations-web-scraping-cat-ai