Tuesday, June 24, 2025

New Study Suggests OpenAI Models Memorized Copyrighted Content, Rekindling Legal Debate

New Study Suggests OpenAI Models Memorized Copyrighted Content, Rekindling Legal Debate

A recent academic study has added fuel to the ongoing legal fire surrounding OpenAI, offering fresh evidence that some of its AI models may have been trained on copyrighted material without permission.

Researchers from the University of Washington, Stanford, and the University of Copenhagen developed a method to detect whether AI models like GPT-4 and GPT-3.5 have memorized specific pieces of text. This matters because OpenAI is currently facing lawsuits from authors, programmers, and rights-holders who accuse the company of illegally using their protected works—ranging from books to source code—as training data.

The core of the study hinges on the concept of “high-surprisal” words—unusual words that stand out in a sentence and are statistically unlikely to appear in certain contexts. For example, “radar” in the sentence “Jack and I sat perfectly still with the radar humming” qualifies as high-surprisal. Researchers removed these kinds of words from text excerpts and asked OpenAI's models to fill in the blanks. If the models guessed correctly, it implied a high likelihood they had seen the material during training.

The study's findings? GPT-4 and GPT-3.5 both demonstrated signs of memorizing portions of copyrighted fiction and New York Times articles, particularly content from a dataset known as BookMIA, which includes samples of copyrighted ebooks. While memorization of news content was less prevalent, it was still present.

Abhilasha Ravichander, a University of Washington doctoral student and study co-author, emphasized the implications:

“In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically... there is a real need for greater data transparency in the whole ecosystem.”

OpenAI, meanwhile, maintains its defense of fair use and has advocated for looser copyright restrictions on AI model training. The company does have licensing deals in place and allows opt-out mechanisms for rights-holders, but critics argue that these are insufficient in the face of mass ingestion of potentially protected works.

As courts weigh the legality of training AI on copyrighted content, this study offers concrete tools for auditing opaque models—and possibly tipping the legal balance.

related articles

Comments

No comments yet. Be the first to comment!

Leave a Comment