In 2022, the world first met OpenAI’s ChatGPT, and people started to hear about large language models (LLMs) more regularly. There have been several leaps forward since the 1950s with AI; one of importance was the 2017 Google paper called ‘Attention Is All You Need’. It was a realisation moment that textual understanding would be enhanced with wider context, coupled with a focus on key elements; it meant that AIs were being developed with a wider understanding from a greater range of input learning. From this, generative AI took off. Google pioneered transformer architecture and in turn enabled ChatGPT, as the acronym means Generative Pre-Trained Transformer. LLMs work from parameters: internal, numerical values learned by an LLM during its training, that enable it to understand and generate content.
Transformer models are trained on colossal amounts of input data. The first ChatGPT trained on 10,000 books, creating 117 million parameters. By contrast, ChatGPT-3 trained on 175 billion parameters, and these models have not stopped growing. LLMs are now being used to generate literature of all kinds, as well as images. Whilst this is a rapid improvement, it depends on huge data sets to enable future improvements, and this is why there is such interest in copyright and what LLM developers are scraping from the work of others.
What are the issues with AI training?
The UK and other countries are struggling to determine what the future should be in the age of AI. They know large data sets are needed for training, but is it right to trample on the rights of others without rewarding them? The UK Government says it wants a plan to deliver a copyright and AI framework that rewards human creativity, incentivises innovation, and provides the legal certainty required for long-term growth in both sectors – fine words, but so far there is no proposed law or policy to achieve this.
Around the world there is concern that the work of creative people is being used without permission by those training their LLMs. In the UK, we await a decision of the English High Court on a case involving Getty Images that might guide us on what the current law is in relation to AI training by US-based developers.
In the US, the first big news case is now with us – Bartz v Anthropic – in the form of a settlement. Bartz and other authors brought a copyright infringement case against Anthropic, a company that offers an AI software service called Claude. The case concerned Anthropic’s unauthorised use of both pirated and purchased copies of books for the purpose of training LLMs that underpin Claude. Recently, Anthropic agreed to a settlement for around US$1.5 billion. As there are approximately 500,000 works in dispute, it amounts to US$3,000 per work, establishing in the process the need to compensate creatives, and the cost of doing so.
Others are concerned about their work – some small, some large – and we have now seen litigation started over the unauthorised use of a range of content in the training of AI models. For example:
- Disney and Universal have filed a case in California, claiming that Midjourney is using their creative investment by selling an AI image-generating service “that functions as a virtual vending machine, generating endless unauthorized copies of Disney’s and Universal’s copyrighted works”;
- Warner Brothers Discovery (WBD) has also sued Midjourney for using illegal copies of WBD’s copyrighted works, including Superman, Batman, Wonder Woman, Flash, Tweety, Bugs Bunny, and Scooby-Doo;
- Another lawsuit has been filed against Minimax by WBD, Disney, and NBC Universal.
Movie industry workers have also spoken out about the use of AI, which was part of the reason for Hollywood strikes in 2023–25. There has been real concern over the use of actors’ images and voices by model developers, and with the perceived threat that the adoption and integration of AI by studios poses to their professional livelihood.
Balancing AI innovation with copyright protection
Rewarding human creativity is very important, and it must be protected, or we may cease to enjoy new human-created works. There is, though, a huge and growing pressure that LLMs should be allowed to ignore copyright protections and train on the work of others. The interests of those who wish to train their models on the creative work of others have misled people into thinking there is a lack of clarity. There is not: it is very clear that you cannot use other people’s copyright without consent. Yet this argument from AI firms persists because it supports their desire for uncontrolled training and they claim it holds back AI adoption. This is not true; the law in the UK is clear and the issues raised are economic, not about legal clarity. The use of copyright under licence is perfectly possible, with renumeration to those that own it, and ultimately those who created it.
An alternative way to encourage AI training that respects copyright could be via a collecting society model. This enables creatives to declare their work for use via a society, and in doing so create a valuable accessible database of works for AI to train upon. Such a society gives the AI entity a single point of contact that is easier to deal with, and a licence and fair remuneration for copyright holders. Existing collecting society models for music and written works have been successful on the basis of modest remuneration to copyright holders. This approach creates an easier and less risky approach for those training, and a fair balance to creatives. Perhaps this will be the model of legal and equitable training in future.
If you have questions or concerns about AI, please contact James Tumbridge and Robert Peake.
This article is for general information purposes only and does not constitute legal or professional advice. It should not be used as a substitute for legal advice relating to your particular circumstances. Please note that the law may have changed since the date of this article.