Crawl Cautiously: Examining the Legal Landscape for Text and Data Mining in India – Part I

We are pleased to bring to you a two-part guest post by Viraj Ananth, examining the legal landscape of TDM in India, an issue we have covered on this blog previously here. Viraj is a fourth year B.A. LL.B. (Hons.) student at the National Law School of India University, Bangalore.

Part I of this post first introduces text and data mining (TDM) and contrasts popular TDM techniques, namely website scraping, website crawling or indexing, and website archiving. It then studies the question of copyright infringement liability for TDM in India. Part II of the post explores international developments on copyright exceptions for TDM use. It then examines the scope for website owners in India to contractually limit/condition TDM use on their websites, and concludes with some guiding principles for scrapers to minimise contractual liability.

Crawl Cautiously: Examining the Legal Landscape for Text and Data Mining in India – Part I

Viraj Ananth


The emergence of the Internet has led to vast quantities of data being stored online. Website scraping and other forms of text and data mining (‘TDM’) have arisen in response to the corresponding need for researchers, scientists and developers to access such information. TDM has long been subject to opposition from right-holders on grounds of copyright infringement and other business-related concerns. However, recent years have witnessed a growing recognition of the immense benefits of TDM for research and innovation, and calls for incorporating copyright exceptions for its use. Notwithstanding this, in India, there is presently neither legislation nor case-law that substantively engages with TDM and its associated legal concerns. This piece examines the prospective legality of TDM under Indian law and specifically focuses on copyright and contractual liability. It also studies recent international developments on providing copyright exceptions for TDM.

Understanding Text and Data Mining

Website scraping’ refers to the process of using technological tools to extract specific or relevant information from websites and convert this information into a usable form. This may be done either manually or autonomously, i.e., using pre-written programs. Website scraping is used by a variety of stakeholders including scientists, researchers, journalists and companies (such as news aggregators and AI software developers) to access online data. It may be contrasted with ‘website crawling’ or ‘website indexing’ which is predominantly conducted by search engines and involves systematically visiting websites using technological tools, and indexing, among other information, their Uniform Resource Locator. That said, website crawlers commonly use scraping technology to supplement their efforts to extract hyperlinked webpages.

Website scraping and crawling are closely related to ‘website archiving’ which is concerned with the preservation of both the content and the ‘look and feel’ of webpages. Unlike website scraping which is more focused on text, website archivers often collect additional information such as images, videos and the underlying code of archived webpages, in order to preserve them as is. Website achieving is commonly used by libraries and open access organisations — like the Internet Archive which was established in 1996 to provide “universal access to all knowledge” — who claim copyright exceptions broadly on grounds of non-commercial use for research and educational purposes.

These concepts may collectively be referred to under the umbrella term ‘text and data mining’ (‘TDM’). In recent years, some forms of TDM have been met with growing opposition, due to their ‘non-reciprocal nature’ and the novel business and intellectual property challenges they pose. TDM may, for instance, be used by businesses to obtain a real-time overview of a competitor’s publicly disclosed decisions, updates and product information. It may, in some cases, also be used to access poorly protected gated content on such websites and thus, reveal confidential information. Further, excessive rates of TDM may overload and inhibit a website’s server performance and in turn, potentially lead to a loss of reputation and revenue, and increased website infrastructure costs. TDM is also commonly challenged as violative of copyright, due to its potential to restrict the economic rights of authors.

More recently, TDM is increasingly used to create the data-sets that train artificial intelligence and machine learning (‘AI/ML’) algorithms. This too, creates novel copyright-related issues, given the scale of data mined and the potentially derivative nature of works produced. Such considerations of copyright ownership and infringement by AI have been discussed extensively on SpicyIP — here, here and here.

The Question of Copyright Infringement by TDM in India

India is party to the Berne Convention, 1886 (‘the Convention’), which prescribes certain minimum protections for the works and rights of authors from contracting states. Notably, it vests authors of ‘literary and artistic works’ with several exclusive rights of authorisation — a principle mirrored in Section 14 of the Copyright Act, 1957 (‘the Act’). The Convention also empowers contracting states to make exceptions for reproduction in certain cases — which is reflected in Section 52 of the Act.

In order to determine whether TDM infringes copyright law in India, three central questions must be examined — first, whether website contents qualify as a ‘literary work’ for protection under the Act; second, whether TDM restricts an exclusive right of authors, and third, whether this infringement is saved by any exceptions under the Act.

The Act provides copyright protection to original literary and artistic works, including compilations and computer databases. However, it is unclear whether this protection extends to creative arrangements of unoriginal facts — as may often be the case with website-hosted content. In contrast, under the Convention, ‘literary and artistic works’ includes not only original works featured on websites, but also works which “by reason of the selection and arrangement of their contents, constitute intellectual creations”. Accordingly, any website arrangement which discloses sufficient creativity and originality in the “selection and arrangement” of its contents will qualify as a ‘literary work’ and enjoy copyright protection.

The implications of “selection and arrangement” may be understood by reference to the United States (‘US’) Supreme Court case of Feist Publications Inc. v. Rural Telephone Service Co.. Here, the Court applied the Modicum of Creativity Doctrine and held that an arrangement is sufficiently original if it involves independent choices by the compiler as to selection and arrangement as well as entails a sufficient degree of creativity. Conversely, mere facts and factual arrangements based on ordinary or objective criterion, such as a directory listed in alphabetical order, will not qualify as a ‘literary work’ capable of protection. In Metcalf v. Bochco, it was observed that a creative arrangement of individually unprotectable elements may qualify as a protectable element in itself.

This ‘Modicum of Creativity’ test was affirmed by the Supreme Court of India in Eastern Book Company v. D.B. Modak. While acknowledging that mere factual compilations do not attract copyright protection, the court held that an unoriginal compilation is original and thus, protected, if “by virtue of selection, co-ordination or arrangement of pre-existing data contained in the work” it is “somewhat different in character” from the existing work. The court explicitly rejected the ‘sweat of the brow’ doctrine, which protects a derivative work as long as the author has spent time, effort and skill creating it. It observed that the requisite standard was not “creativity in the sense that it is novel or non-obvious” but that the work must merely have “some distinguishable features and flavour”. As a consequence of this decision, non-original databases remain unprotected in India and owners of such databases have little incentive to either disclose or regularly update them.

Thus, original content and the original selection and arrangement of unprotected content on websites, will constitute ‘literary works’ and enjoy copyright protection in India.

Next, Section 51 of the Act provides that the copyright in a work is deemed to be infringed when a person, without a license, performs or contravenes an exclusive right of the author. These exclusive rights are enumerated in Section 14 of the Act — authors of literary and artistic works enjoy, among others, the rights to reproduce the work in any material form, to communicate it to the public or to make any adaptation of the work. Some forms of TDM may infringe these exclusive rights. For instance, website archiving by for-profit organisations would restrict the author’s exclusive right to reproduce the work in a material form as well as to communicate it to the public. Similarly, the use of TDM by news aggregators to adapt copyrighted news articles may inhibit the author’s exclusive right to make any adaptation of the work. At the same time, however, this could be considered lawful where only some text is accessed, and not the expression of the work itself. This determination is fact-specific, both in terms of whether and which exclusive rights are restricted, as well as if the use falls within the exceptions under Section 52 of the Act.

Numerous cases have underlined that the ‘fair dealing’ exception in India is limited to the grounds enumerated under Section 52. The use of TDM is most likely to be protected under Section 52(1)(a) of the Act, which permits the use of copyrighted works for “private or personal use, including research”, “criticism or review” or “the reporting of current events and current affairs”. Beyond this, however, the section only accommodates TDM in narrow circumstances — such as to reproduce a current economic, political or social topic, unless the author has reserved the right to reproduce (Section 52(1)(m)); or to incidentally store a work for the purpose of providing electronic links or access, provided this is not prohibited by the right-holder (Section 52(1)(c)).

As explained here, ‘private or personal use’ refers to use of the works, and not the making available of the works. Accordingly, public libraries and organisations — like the JNU Data Depot — may make works publicly available for ‘private or personal use’ by researchers. Section 52(1)(a)(i) does not specify whether ‘private or personal use’ extends to commercial purposes — and since the provision was introduced in 2012, there has been little judicial engagement with it. That said, in 2013, the Calcutta High Court considered the question of infringement by a website which provided access to copyrighted works in exchange for “a fee or revenue from its sponsors or from third parties”. The Court held that this amounts to ‘commercial exploitation’ and falls beyond the ambit of ‘private or personal use’. This position was reaffirmed by the Bombay High Court in 2018, where it held that the use of sound recordings for commercial benefit is not permitted under Section 52(1)(a)(i).

Thus, the use of TDM is lawful as long as it falls within the boundaries of Section 52. As clarified by the Delhi High Court in the Delhi University Photocopying Case, the extent or quality of data accessed is irrelevant as long as this is justified by the specific purpose of TDM use and does not unreasonably prejudice the author’s rights. In all other cases, however, it is likely that TDM use will infringe copyright law in India. While literary works by Indian nationals are directly protectable under the Act, the International Copyright Order, 1999 extends the applicability of the Act to foreign works by nationals of parties to the Convention. The Act provides civil remedies including damages, permanent injunction, recovery of possession, enhanced penalty for repeat offences and seizure. It also classifies the infringement of copyright as a cognizable offence punishable by fine and imprisonment.

Please click here to view Part II of this two-part post.

Tags: , , , , , , ,

About The Author

1 thought on “Crawl Cautiously: Examining the Legal Landscape for Text and Data Mining in India – Part I”

  1. 2 clarifications.
    1) Did India adopt the American “modicum of creativity” test or the canadian “skill and judgment test”? Because there is a clear difference in the sense of requiring a creative spark. In Canada, hence the protection is a tad more liberal as against the US, in the sense that as long as there exists an intellectual choice involved (with or without a creative spark), it fulfills the skill and judgment standard.
    2) Do copyrights actually inventivise creation? For eg. has the Sui Generis database right in the EU actually helped inventivise creations? Factually it doesn’t seem so. You should check out Zimmerman’s and Prof. Indranath Gupta’s work. Copyright as incentives is a very superficial utilitarian belief fostered by proponents of over protectionism. It’s factual correctness has been questioned at various levels. Also, again relying on WIPO might not be extremely helpful, given its polarised structure towards the American utilitarian structure of Copyright. So re-think these maybe?

Leave a Comment

Scroll to Top