Copyright and Text and Data Mining

What is Text and Data Mining?

Text and data mining (TDM) is a process involving computer-based extraction and analysis of large amounts of text and data in order to gain new knowledge. TDM works by applying sophisticated computational techniques to enable meaningful patterns and links to be made between otherwise unconnected documents, generating new insights and understandings.

TDM has been used to advance research and to enable new discoveries in fields such as bioscience and medicine, agriculture and climate change, literature and the social sciences. For example, in 2007 a new link between genes and osteoporosis was discovered by analyzing Pubmed, a database of 30 million citations for biomedical literature.

High performing TDM models rely on high volume, high quality data - the sources can range from books, journals and images to web pages, databases and giant datasets. Since the data usually involves copyright-protected works, copyright law and licensing comes into play. For example, effective TDM entails the right of reproduction (copying of entire works to create a database for the mining process), right of communication to the public (the data is shared with other researchers for review and replicability), and if the project involves international collaboration, cross-border uses (the data will need to be sent to fellow researchers in other countries). Access to databases of interest to TDM researchers, such as subscription-based e-resources and institutional repositories, are often managed by the library in an institution. Unless the database is open access or Creative Commons licensed, activities are likely to need authorization through a publisher licence or a copyright exception.

TDM and artificial intelligence

TDM technologies and research methods underpin the development of artificial intelligence (AI) systems. Sometimes the terms are used interchangeably, for example, in European copyright law, text and data mining covers all AI and machine learning technologies (Article 2(2) Directive on Copyright in the Digital Single Market).

Alongside ‘traditional’ AI (such as TDM), new generative AI systems have emerged. The main function of traditional AI is to analyze data such as classifying and identifying patterns in text (known as non-expressive outputs). Traditional AI is a driver for modern science, research and scholarship. The main function of generative AI is to generate new content such as music, art, books (known as expressive outputs). Generative AI systems are increasingly used for a host of applications (commercial and non-commercial) and generative AI tools, such as ChatGPT and Google Bard, have become popular with the general public. While all AI systems use similar types of input (training) data, the outputs from generative AI can raise additional copyright issues as the AI-generated content may potentially infringe upon the rights of rightsholders whose works have been used to train the AI system.

What is EIFL’s position on Text and Data Mining?

  • At national level, EIFL advocates for a copyright exception for text and data mining to unambiguously authorize TDM for science and research. The exception should be open to all types of works, all protected uses, and all types of users. It could be a specific exception for TDM, an existing research exception or a general exception that is flexible enough to accommodate TDM (such as fair use or an open fair dealing provision).
  • At international level at WIPO, EIFL supports an international legal framework to ensure that all countries permit TDM research, and to enable cross-border collaboration between researchers.
  • As regulators and courts of law around the world grapple with copyright issues raised by generative AI systems, it is essential that ‘traditional’ AI systems (such as TDM) used for science, research and scholarship are not hampered or curtailed. Policy makers should have an understanding of the differences between TDM and generative AI outputs to ensure that when addressing one problem (not harming creators), they are not inadvertantly creating another (harming research).

 

Read more