Continuing from his earlier post, where he explained the technical workings of Large Language Models vis-a-vis where different copyright questions arise, in this post Shivam Kaushik argues that LLMS are in effect interacting with non-expressive parts of the works in question. Further, he questions whether even a Text Data Mining Exception is required in the Indian Copyright Act. Part 1 of this 2 part series can be accessed here. Shivam is an LLM candidate at NUS Law specializing in IP and Tech Laws and a Research Assistant at the Lumens Machine Learning Project. He is interested in exploring the legal issues posed by emerging technologies. His previous posts can be accessed here.
Part II- Applying Natural Intelligence (NI) to Artificial Intelligence (AI): Understanding ‘Why’ Training ChatGPT Transcends the Contours of Copyright
Shivam Kaushik
When I say copyright, it means just what I choose it to mean- nothing more nor less.
In a nutshell, during the training, the LLM decomposes, abstracts, and constructs not text, but representations of relationships common to the tokens it generated from the earlier text! Now, one obvious question that arises from the copyright infringement perspective is this- once the text is converted into tokens, given a token ID, abstracted into numeric representations being vectors and word embeddings- Is any ‘expression’, which the copyright ostensibly seeks to protect, left in the work? There seems to be little doubt that ChatGPT has ‘used’ the copyrighted text. But is all use of the text protected by copyright? For instance, any text embodies the following:

It is well established that copyright over a work does not give exclusive rights over the idea imbued in the text. Similarly, the meaning of the words used in the text (semantics), and the grammatical arrangement of words (syntax) fall beyond the ambit of copyright protection. The only thing copyright protects is the ‘expression’ (I don’t think a source is needed for this). Now, when the LLM devours the text’s semantics, syntax, conceptual relationships and other underlying features, doesn’t it seem too far-fetched for any author to argue that she has ‘exclusive rights’ over them? Aren’t these elements, ideally speaking, “non-expressive” part of the work?
A language model merely compresses and cross-references linguistic information to identify predictable patterns and reduce redundancies by representing meaning probabilistically. During the pre-training, the ‘identity’ and ‘wholesomeness’ of copyrighted works are lost, and they are stripped of everything but their raw linguistic essence, functionality and utility. The compression only captures the relational meaning in mathematical form. The aspects of the copyrighted work ‘used’ by the model during pre-training are the mathematical representation of word relationships. Thus, pre-training ‘transcends’ the limit of copyright as it abstracts text into multi-dimensional numeric representations and patterns. Copyright can only protect the original expression, not the statistical relationships between words irrespective of the source being copyright-protected. In an article published way back in 2019 in the Journal of the Copyright Society, Prof. Matthew Sag, while giving the examples of non-expressive use, cited using software to identify patterns of speech, relationships, or frequency of particular words as possible instances of non-expressive use (pp.301 & 302).
The 2nd Circuit Court of the US comes to a similar-ish finding, though from a different perspective in Authors Guild Inc.v. Hathitrust (2014). The Court holds that the copyright owner cannot assert his copyright against a text-searchable database holding the copyrighted work without authorisation, as “the result of word search is different in purpose, character, expression, meaning and message from the page (and the book) from which it is drawn.” Pertinently, the 2ndCircuit gave a finding on fair use in the case, holding the use to be transformative, without examining whether there was copyright infringement in the first place. Calling an act as fair use when the purpose is changed, but my point is that when the ‘expression’ and ‘meaning’ of the copyrighted work are changed, such use ‘transcends’ the ambit of the right called copyright.
Certain jurisdictions, such as Singapore, have a Computational Data Analysis (CDA) exception (s.243) allowing identifying, extracting, and analysing information to improve the functioning of a computer program. For this particular ‘use’, the statute even allows making a copy of the work in question (s.244). India has no Text and Data Mining (TDM) or CDA exception. However, the use of non-expressive elements of copyrighted work is within the concept of copyright, and there is no per se need for a statutory exception. Prof. Tim Dornis echos the view that TDM or CDA does not require an exception, in his recent paper He further adds that the issue of infringement props up only because non-protectable non-expressive information is embedded in a copyright ‘container’ or ‘shell’ (p.7). He further says that the underlying aim of the exceptions is to legalise copies and reproductions that precede TDM. However, Prof. Dornis has a more fundamental beef with the proposition being canvassed here (p.11). He argues that since AI does not differentiate between semantic (non-expressive) and syntactic (apparently expressive) information during training, it infringes copyright. However, “somewhat surprisingly” (to quote Mikolov et al.), in his entire paper, he does not explain the basis for calling syntax a subject matter of copyright. It is inconceivable that anyone could monopolize grammatical structure and arrangement of words in a language.
Cautions and Disclaimer
There will be no conclusion to this post. The final word on LLMs is yet to be spoken. However, it is important to put out a few words of caution. The word AI is an attention grabber (that is why it was used in the title of the post) but has little substance to offer (that is why it has not been used in the body of post). Instead of making broad stroke arguments about AI, it would be more instructive for legal academicians to deal with specifics. For example, what I say here is very specific to text and language and would certainly not be applicable to Midjourney, Stable Diffusion and DALL E. Judges, lawyers and policy makers will have to appreciate the nitty-gritty and not generalize. Justice cannot be based on assumptions. Generalizations create prejudice, not fairness. The discussion on LLMs and copyright cannot and should not be resolved with superficial understanding of AI. Saying it is a black box is just not enough.
Also, a point worth reiterating is that the discussion has been limited to the pre-training stage. It does not delve into fine-tuning, RLHF, storage and most importantly, the output stage. Subsequent posts ‘might’ follow dealing with these topics building upon the ideas discussed here.

Great post but an observation on the Linguistic Functionality image. Semantics coming before syntax doesn’t make sense from a linguistic perspective, and I think it should be the reverse. The idea has to be established in some form (syntax) before it makes sense (semantics).
Thanks Anon for this wonderful insight. My idea of putting Semantic before Syntax was on account of taking tokens as the basic unit instead of sentences because that is how LLMs work. They look at the ‘meaning’ (if I can say it that way) of a token in the vector space first and on the basis of that meaning come up with something coherent so they apply syntactic rules.
However, I agree that human mind don’t work in that manner. In human mind it would go something like this- you have an idea > then you articulate in words (meaning of words is a subject of semantics) > then you arrange them as per the grammatical rules (that is syntax), > you look at the meaning of the phrase/sentence (meaning of phrases is again a subject of semantics) > then you arrive at the expression of the sentence. So semantics can sit after syntax as well, as you suggested, when we talk about human reasoning.
But then I will also say that I don’t think that there is any chronological sequence to human reasoning because in our mind both syntax and semantics are being looked at simultaneously. The meaning of the words affect the sentence structuring that we choose and sentence structure can also affect the meaning of the words or even the entire sentence itself.
Also, the framing that you have suggested would not fit in the scale/arrow of expressiveness that I have plotted. The arrangement of words i.e., (a phrase/sentence) is, comparatively speaking, more expressive than the meaning of the said sentence. I must hasten to add, though, that it is still not expressive enough to be copyrightable.
Thanks for the reply Shivam! The justification in your last para makes sense; perhaps from a copyright perspective the ordering makes sense.
Also, kudos for the research behind these posts. I have been monitoring this space for a long time now as an IP lawyer, but the arguments you have made are truly original and it was a pleasure reading them. I agree – it doesn’t seem like pre-training LLMs constitute copyright infringement. However, as you mentioned in the first post, ANI is also claiming the output to be infringement.
Would the same arguments apply to the second claim as well? My thinking is no. The process to reach a similar work may be different, but if the output is substantially similar, it could be still be copyright infringement. Would love to hear your thoughts on this though.
Thank you for your kind words! While writing and publishing may come across like a product made for mass consumption i.e., audience, when authors usually write, they have ‘a’ reader in mind. An engaged reader with whom an author is in constant conversation for refining the text and better articulating his/her views (I read about this in an essay called ‘Text as a site for interaction’ by Michael Hoey). You are the physical embodiment of that reader. I think I have a much better understanding about copyright, semantics and syntax after reflecting on your question. It is really refreshing and to hear back and get into discussion with a reader who actively thinks about and pushes back on what he/she reads. Kudos to you!
coming on output now (sigh!). I see that you say that ‘if’ the output is substantially similar. What if it is similar 1 time, 99 times it is not? What if there is no way of ensuring that after training, it produces only non-infringing text with 100% certainty? will you injunct all LLMs? A model is in the shoes of the infringer (a human) and copyright has no mechanism/mandate to say to say to a human that since you sometimes produce infringing content, you are hereby injuncted from producing any content. Infringement is a very case specific finding on something that has already happened. Even injunctions are in respect of identifiable works. A court can decide that a model has infringed in a specific case, but how will it decide that it ‘will’ infringe another work in the future? How will a court decide things like quantum of damages when the entire output of LLMs is based on probabilities.
I have some thoughts on output (the tech side of it as well) and hopefully if I get time in the near future, I will pen another post on it. What I said above in this comment are my very initial musings and I may feel tomorrow that I was wrong here. But the debate on AI and IP needs much more thought and depth of understanding than there currently is. I don’t think it is sufficient to say that everyone get a license from everyone without devising a robust legal mechanism to ‘judicially examine’ a party’s claim just because I ‘feel’ it is infringing. Heuristics won’t cut it.
But even before I write anything, I would love to hear your thoughts in detail if you have time and are available. You can reach out to me on social media apps like LinkedIn.
Appreciate the kind words Shivam 🙂
I have two points I wanted to make. First, on the question of injunction, there is little evidence of any ‘irreparable harm’ to ANI in this case, as the two companies do not compete with each other. If the Court does grant an interim here I would be shocked.
As for your point that injuncting an LLM only because it ‘may’ put out infringing content seems unfair, I agree completely. It doesn’t make sense and it doesn’t seem like that is what the Courts are inclined to do. But what if we look at the larger picture? What if the LLM sometimes outputs infringing content related to one copyright holder, sometimes for another copyright holder, a few times for a third one, and so on?
For the Courts, jurisprudence-wise, this would be completely path breaking. What I am asking them to do is to consider on merit the case for firms which are not party to the suit but still affected by it. Will any Court do so? I have my doubts, but if it just ANI or any other right holder being heard, OpenAI should win every time.
Brilliant analysis Shivam! It was truly refreshing to get such well-articulated, researched and nuanced techno-legal insights on this issue.
It would be further interesting to explore the possibilities of evaluating whether or not the syntactic – semantic dichotomy should be approached from the idea – expression lens. Would ANI be able to make a robust-enough case that the ‘syntax’ which OpenAI has derived after converting semantics gathered from internet into tokens, by its very nature, when processed and exposed to queries, through the probabilistic model you’ve explained, is prone/programmed to deliver outputs likely to be ‘substantially similar’ in terms of copyright law jurisprudence?
Per contra, would OpenAI be able to advocate a case for application of the ‘merger doctrine’, stating that the cases of substantial similarity are more likely to be cases where the idea and expression (in this case semantic and syntactic contents) are so intertwined and entangled, that the ways of expressing such ideas would be likely to be quite limited. In such cases, creation of content (albeit artificially) despite there being a degree of similarity, may not be stifled.
There might be instances to delve into these contours on either side, qualitatively and quantitatively.
Thank you for your reply Anon. I think it would be difficult to apply idea-expression dichotomy and merger doctrine when we talk about AI. We will have to refine, evolve, or expand (expand as it usually happens) these concepts. A couple of days back I was talking to someone and I heard about something called “detailed scenes de affair” which I did not know about.
I don’t think OpenAI will have the audacity to say that only ‘idea’ or something similar is taken. To be honest, these model cannot understand what an idea is. Everyone knows that something more than an idea is taken. But is that something more covered by copyright? that is the question. Also, I don’t think syntax constitutes expression. Language is not copyrightable. Copyrighted works are like a container. They have a lot of content, in addition to ideas, which is not copyrightable. It is a bit circular, because now we have come back to exactly what I say in the second part of the post.
Thank you Shivam. Apparently, yesterday, OpenAI (In the ANI – Open AI case (Delhi HC)) has, indeed, submitted (as per a third person account of the hearing I found online) that LLM converts data/text into non-expressive syntactical content. It was stated that works of ANI though used, were not used as expression (or for “expressive attributes”) but for statistical relevance, thereby not amounting to copyright infringement. Eager to see such submission being evaluated and countered in times to come. Let’s see how it goes. Interesting!
Heartiest Congrats Shivam! what a wonderfully lucid, articulate article with briliant analysis. Thanks especially for decoding for us novices what the pre-trained LML model of Chat GPT is. It was my sense too that given the amount of transformation in the final output of Chat GPT, it would difficult to assert copyright infringement.
Would also like to know your views on arrtistic expressions in the context of the recent controversy of Chat GPT 4.0 producing ‘Ghibli’ like art. As i am given to understand the output is not really Ghibli, but for the not so discerning viewer arguements of ‘substantial similarity’ can be raised. Given that is there copyright infringement?.
Thanks