Crawl Cautiously: Examining the Legal Landscape for Text and Data Mining in India – Part II

We are pleased to bring to you a two-part guest post by Viraj Ananth, examining the legal landscape of TDM in India, an issue we have covered on this blog previously here. Viraj is a fourth year B.A. LL.B. (Hons.) student at the National Law School of India University, Bangalore.

Part I of this post studied the question of copyright infringement liability for TDM use in India, after introducing TDM technology and its popular techniques. Part II first explores international developments on copyright exceptions for TDM use — specifically in the European Union, Singapore, Japan and at the WIPO. It then examines the prospect of contractual liability for TDM use in India, and finally lays down some guiding principles for entities using TDM, to minimise contractual liability.

Crawl Cautiously: Examining the Legal Landscape for Text and Data Mining in India – Part II

Viraj Ananth

Recent International Developments on Copyright Exceptions for TDM

The European Union’s (‘EU’) Directive on Copyright in the Digital Single Market (‘the DSM Directive’) came into force on June 6, 2019, giving European Union Member States until June 7, 2021 to transpose it into their national laws. Its provisions on TDM emerged in response to growing concerns from the scientific and research communities about the uncertainty surrounding copyright infringement liability for TDM, and dissatisfaction with the licensing requirements imposed by several Member States under the existing Information Society Directive of 2001. Most notably, licensing requirements were criticised for increasing transaction costs by requiring negotiations with a range of publishers/authors, who often imposed restrictive conditions on access to data. This, in turn, placed EU researchers at a competitive disadvantage as compared to their US counterparts.

Articles 3 and 4 of the DSM Directive grant exceptions for TDM in certain cases. Article 3 requires Member States to provide exceptions for “reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out… text and data mining of works”. ‘Research organisation’ has been defined widely in Article 2(1) to include universities, libraries and “any other entity” whose primary purpose is to “conduct scientific research or to carry out educational activities”. However, the research organisation must either perform these activities “on a not-for-profit basis” or “pursuant to a public interest mission recognised by a Member State”. ‘Cultural heritage institution’ is defined under Article 2(3) as including publicly accessible libraries, museums and other archives and arts heritage institutions.

This exception only extends to works to which the organisation/institution has “lawful access”. As per Recital 14, ‘lawful access’ means access to content through contractual arrangements, an open access policy or other lawful means, including content freely available online. Significantly, publishers/authors cannot choose to opt-out of this exception, since Article 7 declares that any contractual provision contrary to Article 5 (which mandates Member States to carve out exceptions for TDM) is unenforceable.

Article 4, on the other hand, requires Member States to provide narrower TDM exceptions for a wider range of stakeholders, including for-profit/commercial organisations. Here too, the organisation must have lawful access to the work. Unlike Article 3, however, which does not permit organisations to contractually bar TDM, Article 4 provides that the use of TDM may be expressly reserved by right-holders — in other words, right-holders have the option to opt-out from TDM use on their websites. Recital 18 clarifies that rights may be reserved via machine-readable means, including Terms of Service or using metadata. Right-holders may also use robot.txt files (discussed below) — which are machine-readable — to condition or restrict the use of TDM under Article 4. Thus, the exception under Article 4 weighs against for-profit research institutions, labs, journalists and software developers who may find themselves subject to the whims of right-holders.

Discussions surrounding copyright exceptions for TDM have not been limited to the EU. Singapore’s Ministry of Law recently published the Singapore Copyright Review, which reflects forthcoming amendments to the Copyright Act 1968. It proposes an exception for both non-profit and commercial TDM for the purpose of data analysis, where the user has lawful access. In May 2018, Japan passed a bill to amend its Copyright Act. Although TDM for commercial and non-commercial purposes was already permissible since 2009, the 2018 Amendment’s provisions on TDM seek to eradicate copyright-related barriers to AI innovation. Specifically, it permits the storage of electronic incidental copies of works and the use of copyrighted works for verification, both of which are essential to AI/ML research and development. It also recognises that copyrighted expressions are not perceived while feeding raw data to AI/ML algorithms, and accordingly, that the harm to right-holders is minimal.

In September 2019, the World Intellectual Property Organisation (‘WIPO’) convened a multi-state and stakeholder discussion on the intellectual property challenges of AI. Shortly after, it released a Draft Issues Paper detailing questions for comment. These include: whether use of ML for mining data in copyrighted works infringes copyright; whether separate exceptions should be made for this purpose; and how existing TDM exceptions would interact with such infringement.

Looking ahead, it is essential for India to consider including express exceptions not only for commercial and non-commercial uses of TDM, but also with respect to the specific challenges arising at the intersection of AI/ML technologies and TDM.

Breach of Contract — Terms of Service and Robot Exclusion Protocols

Beyond potential copyright concerns, unauthorised TDM may give rise to contractual liability, even if the scraper has not ‘signed in’ or explicitly agreed to the terms of the website. This is because many websites include ‘browse-wrap’ clauses in their Terms of Service (‘ToS’), due to which the mere browsing or scraping of data binds a scraper to the terms of the website. The imposition of restrictions on TDM, by way of such browse-wrap clauses, is considered legally tenable.

In Facebook, Inc. v. Power Ventures, Inc., the United States Court of Appeals for the Ninth Circuit underlined the agency of websites to regulate web-robots and crawlers through their ToS. It also observed that scrapers must utilise the application programming interfaces (‘APIs’) provided by websites (if any) to scrape data, and that non-use of the APIs may amount to a copyright violation. This agency to regulate TDM was, however, qualified in the 2019 decision of HiQ Labs Inc v. LinkedIn Corporation. Here, the same court drew a distinction between ‘private information’ over which LinkedIn enjoyed copyright protection and information that users knowingly made public, in which case LinkedIn lacked ownership interest. It accepted HiQ’s reasoning that authorisation was not required to access information that was open to the general public. Thus, automated TDM of publicly available information is lawful in the US and websites may not restrict access to such information.

In addition to ToS, websites commonly make use of robot exclusion protocols and robot.txt files (i.e., standards and guidelines that specify how scrapers are to interact with the website and its contents) to ‘regulate’ TDM. Robot.txt files may be used to prescribe restrictions and limits on TDM, such as conservative request rates and visit times. EU website owners may exercise their right to opt-out through robot.txt by requiring that certain privileged contents of the website not be mined. These protocols too, by virtue of enabling provisions in the ToS, often serve as contractually binding agreements between the website and the scraper. Internationally, there is limited case law dealing with the legal uncertainties arising from the use of such protocols. However, the non-use of ‘no-archive meta-tags’ (i.e., the industry standard to inform scrapers to refrain from caching) in robot.txt files has been interpreted as an implicit license to cache and index the website.

While most of these cases are concerned with violations under the US Computer Fraud and Abuse Act, 1986, they nonetheless crystallise important principles for the Indian judiciary’s consideration.

Steps for Businesses to Mitigate Risk

Considering the uncertain position in India regarding contractual liability for TDM use, entities conducting such operations are advised to abide by the following guiding principles to minimise liability:

Inspect the ToS of a website to determine its stance on TDM. Some websites will limit TDM to certain classes of data or sections of the website. Others may bar TDM and require that businesses obtain explicit permission from webmasters prior to use.
Examine the website’s robot exclusion protocols to understand the website’s internal mechanisms and guidelines on TDM. This could include limits on crawl rate, request rate and visit time, as well as other general restrictions on use. In the absence of such specifications, use conservative rates (1 request every 10-15 seconds). Further, make use of the APIs provided by websites for scraping data, if any.
Identify your website scraper with a ‘legitimate user string agent’ and link this back to a ‘scraping policy’ that details the scope of your activities, objectives, compliances and grievance redressal mechanisms.

Please click here to view Part I of this two-part post.

Mustafa Safiyuddin on The Fault in Our Fame – MP High Court Upholds the Release of ‘Haq’
Excellent analysis!Congratulations
Anonymous on Piracy and Terror Financing: Is the Devil Still in the Details?
Pls send me a film of devil
Faraz on SEPs are Decorative Acronyms in Indian Innovation Policies
It appears India's policy makers believe "SEP" stands for Showy Embellishments on Paper. Who knew 6G innovation started with interior…
Anonymous on [Part I] ANI v. Open AI – A Lesson in Resisting the Temptation to Borrow Excessively without Legislative Sanction
A well written post! thank you so much Shama
Kartik Sharma on Reflections from an IP Weekend: The First Rajiv K. Luthra Memorial Lecture 2025 Delivered by Prof. Dev Gangjee
Dear Anon, Yes, for the Chinese Courts, the right prompt (in terms of its detail and the personal intellectual input…

Crawl Cautiously: Examining the Legal Landscape for Text and Data Mining in India – Part II

About The Author

Divij Joshi

Leave a CommentCancel reply

About The Author

Divij Joshi

Leave a CommentCancel reply

You may also like:

Discover more from SpicyIP