European Commission Mandates Disclosure of AI Training Data Summaries

Introduction

The European Commission has developed a Template under the EU AI Act (Regulation (EU) 2024/1689) to mandate the public disclosure of training content summaries for general-purpose AI models. This initiative is designed to enhance transparency, protect intellectual property rights [3], and build trust among stakeholders by providing insight into the origins of training data.

Description

The Template [1] [2], developed by the European Commission, establishes a mandatory framework for the public disclosure of training content summaries for general-purpose AI models [1], in accordance with Article 53(1)(d) of the EU AI Act (Regulation (EU) 2024/1689) [4]. This initiative aims to enhance transparency regarding the origins of training data and facilitate the enforcement of rights for parties with legitimate interests, such as copyright holders [1], under Union law [1]. Transparency in AI development and deployment is crucial for maintaining confidentiality and respecting intellectual property rights [3], as stakeholders increasingly seek insight into AI models [3]. This shift is essential for building trust among partners and users [3], ensuring regulatory compliance [3], and fostering responsible innovation [3].

Effective from 2 August 2025 [1], all providers of general-purpose AI models available in the EU, including those releasing models under free and open-source licenses [1], are required to publish detailed summaries of the data utilized for training their models [4]. This includes data from publicly available datasets [2], privately licensed archives [2], and data scraped from the internet [2]. Providers must ensure that certain essential technical documentation, copyright policies [1] [5], and risk assessment strategies are publicly available to assist downstream users [5]. For models launched prior to this date [1], summaries must be made available by 2 August 2027 [1]. In cases where certain required information is unavailable or difficult to obtain [1], providers must clearly state and justify these gaps in their summaries [1], ensuring that information shared with regulators is marked as sensitive and handled under strict cybersecurity protocols [3].

The Template seeks to balance the need for transparency with the protection of trade secrets and confidential business information [1]. It outlines a uniform baseline for disclosures [1], organized into three main sections: General information [1], List of data sources [1], and Relevant data processing aspects [1]. In the General information section [2], providers must identify the model and its versions [2], the types of data modalities used (such as text [2], image [2], audio [2] [3], and video), and the size of the datasets [2]. The List of data sources section requires providers to disclose publicly available datasets from third parties [2], private datasets licensed from sources [2], and data scraped from online sources [2]. Specifically, providers must disclose the top 10% of domain names scraped based on data volume [2], with a requirement for the top 5% for small and medium-sized enterprises (SMEs). For domains outside the top 10% [2], providers are encouraged to voluntarily disclose information to rightsholders upon request regarding whether their data was scraped [2].

Furthermore, the Template includes a chapter dedicated to safety and security, providing tools for documentation and risk management [5]. Certain sensitive information [1] [5], such as technical specifications and impact assessments [5], can be shared confidentially with regulators [5], benefiting from confidentiality protections under the AI Act [5]. This structured approach allows for the inclusion of additional voluntary information while ensuring that the rights of all parties, including those related to copyright [1], data protection [1], and consumer protection [1], are respected [1]. As the deadline approaches, compliance with the AI Act is becoming increasingly urgent for developers [5], and the Template serves as a foundational guideline for these obligations [5], facilitating interactions with regulators and minimizing legal and operational risks [5]. Transparency is not merely a regulatory requirement but a fundamental aspect of operational legitimacy [3], emphasizing the need for AI systems to be accountable and auditable [3]. Developers and deployers must audit documentation and refine model pipelines in preparation for a future where accountability and understanding of AI tools are paramount [3].

Conclusion

The implementation of the Template under the EU AI Act represents a significant step towards greater transparency and accountability in AI development. By mandating the disclosure of training data and related documentation, the initiative aims to protect intellectual property rights and foster trust among stakeholders. As compliance deadlines approach, developers must prioritize adherence to these guidelines to ensure regulatory compliance and minimize risks, ultimately contributing to responsible AI innovation.

References

[1] https://digital-strategy.ec.europa.eu/en/faqs/template-general-purpose-ai-model-providers-summarise-their-training-content
[2] https://www.pymnts.com/cpi-posts/eu-publishes-mandatory-template-for-disclosing-ai-training-data/
[3] https://www.linkedin.com/pulse/how-stay-compliant-eu-ai-acts-transparency-rules-r-pillai-fd1jf/
[4] https://www.lexisnexis.co.uk/legal/news/commission-publishes-template-for-ai-training-data-transparency
[5] https://perkinscoie.com/insights/update/delayed-eu-code-practice-provides-compliance-framework-general-purpose-ai-models

European Commission Mandates Disclosure of AI Training Data Summaries

You may also want to see:

Southampton UK