The New York Times’ (NYT) legal proceedings against OpenAI and Microsoft has opened a new frontier in the ongoing legal challenges brought on by the use of copyrighted data to “train”, or improve generative AI.
There are already a variety of lawsuits against AI companies, including one brought by Getty Images against StabilityAI, which makes the Stable Diffusion online text-to-image generator. Authors George R.R. Martin and John Grisham have also brought legal cases against ChatGPT owner OpenAI over copyright claims. But the NYT case is not “more of the same” because it throws interesting new arguments into the mix.
The legal action focuses in on the value of the training data and a new question relating to reputational damage. It is a potent mix of trade marks and copyright and one which may test the fair use defences typically relied upon.
It will, no doubt, be watched closely by media organisations looking to challenge the usual “let’s ask for forgiveness, not permission” approach to training data. Training data is used to improve the performance of AI systems and generally consists of real world information, often drawn from the internet.
The lawsuit also presents a novel argument – not advanced by other, similar cases – that’s related to something called “hallucinations”, where AI systems generate false or misleading information but present it as fact. This argument could in fact be one of the most potent in the case.
The NYT case in particular raises three interesting takes on the usual approach. First, that due to their reputation for trustworthy news and information, NYT content has enhanced value and desirability as training data for use in AI.
Second, that due to its paywall, the reproduction of articles on request is commercially damaging. Third, that ChatGPT “hallucinations” are causing reputational damage to the New York Times through, effectively, false attribution.
This is not just another generative AI copyright dispute. The first argument presented by the NYT is that the training data used by OpenAI is protected by copyright, and so they claim the training phase of ChatGPT infringed copyright. We have seen this type of argument run before in other disputes.
The challenge for this type of attack is the fair use shield. In the US, fair use is a doctrine in law that permits the use of copyrighted material under certain circumstances, such as in news reporting, academic work and commentary.
OpenAI’s response so far has been very cautious, but a key tenet in a statement released by the company is that their use of online data does indeed fall under the principle of “fair use”.
Anticipating some of the difficulties that such a fair use defence could potentially cause, the NYT has adopted a slightly different angle. In particular, it seeks to differentiate its data from standard data. The NYT intends to use what it claims to be the accuracy, trustworthiness and prestige of its reporting. It claims that this creates a particularly desirable dataset.
It argues that as a reputable and trusted source, its articles have additional weight and reliability in training generative AI and are part of a data subset that is given additional weighting in that training.
It argues that by largely reproducing articles upon prompting, ChatGPT is able to deny the NYT, which is paywalled, visitors and revenue it would otherwise receive. This introduction of some aspect of commercial competition and commercial advantage seems intended to head off the usual fair use defence common to these claims.
It will be interesting to see whether the assertion of special weighting in the training data has an impact. If it does, it sets a path for other media organisations to challenge the use of their reporting in the training data without permission.
The final element of the NYT’s claim presents a novel angle to the challenge. It suggests that damage is being done to the NYT brand through the material that ChatGPT produces. While almost presented as an afterthought in the complaint, it may yet be the claim that causes Open AI the most difficulty.
This is the argument related to AI “hallucinations”. The NYT argues that this is compounded because ChatGPT presents the information as having come from the NYT.
The newspaper further suggests that consumers may act based on the summary given by ChatGPT, thinking the information comes from the NYT and is to be trusted. The reputational damage is caused because the newspaper has no control over what ChatGPT produces.
This is an interesting challenge to conclude with. “Hallucination” is a recognised issue with AI generated responses and the NYT is arguing that the reputational harm may not be easy to rectify.
The NYT claim opens a number of lines of novel attack which move the focus from copyright on to how the copyrighted data is presented to users by ChatGPT and the value of that data to the newspaper. This is much trickier for OpenAI to defend.
This case will be watched closely by other media publishers, especially those behind paywalls, and with particular regard to how it interacts with the usual fair use defence.
If the NYT dataset is recognised as having the “enhanced value” it claims to, it may pave the way for monetisation of that dataset in training AI rather than the “forgiveness, not permission” approach prevalent today.