The Sign of the Times

Can the NYT’s suit against OpenAI stop the scrape?

Jan 05, 2024

Generative AI systems like ChatGPT are trained on people’s data. Naturally, few of those people are amused. Getty Images was one of the first out of the lawsuit gate, claiming that Stability AI was training its systems on millions of unlicensed images for which it usually charges up to $499 per use. Since then, angry authors — both fiction and non- — have sued for copyright infringement, and others have sought to block companies from, as they see it, using their intellectual property to train the systems that will eventually render them obsolete.

But the lawsuit that caught my eye over the holidays was The New York Times’ allegation that OpenAI and Microsoft’s LLMs “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style,” and that the defendants “seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.” Even worse for The Times, these chatbot-generated summaries rarely link to the original articles, costing news outlets a decrease in visitors, and therefore advertising revenue.

Some quick background: In April, The Times approached the companies to explore an “amicable resolution,” which might include technological guardrails, but were unsatisfied with the response. In mid-August, the company blocked OpenAI’s GPTBot, the crawler that scrapes pages for content that it uses to train its models, just weeks after it was introduced, and was already considering a case.

So, is what they’re saying true? It’s hard for an outsider to tell, as OpenAI is not disclosing what data set they used to train their language learning models. (Yikes!) Research has been done to see if certain books have been included in the data set used to train ChatGPT, as The Atlantic reported. The Times took a stab at dissecting Common Crawl and WebText2 (seen on page 26 of the complaint posted on its site), and they contend that The Times is one of the most highly represented proprietary sources. The Times also provides examples of how ChatGPT outputs its articles almost verbatim, using “minimal prompting.” However, I would like to see what their prompt is, as I have my doubts…

What are they hoping to achieve? Basically, they want the LLMs and the training data destroyed. But those models cost a lot of money and are being used in so many places. Could OpenAI survive it? Probably. Would it be suuuuuuuuper annoying? Absolutely. If we look at it another way, lots of people are licensing their data to OpenAI now, and they are not getting much money for it. Could this be a negotiating position to get a better deal? Maybe. At least they’re learning from the past by asking right away.

Should media companies consider a paywall for GPTBots? The tech companies could pay out of the billions in funding they’ve been getting. This idea highlights the delicious contradiction of companies like OpenAI proclaiming that this technology will cause a huge economic boon for all of us, yet they’re unwilling to pay a decent amount for the data to train it.

How big of a deal is this? If The Times is successful, it does call into question a bunch of things, such as what can these models train on? Isn’t it better for them to train on the highest quality data, rather than unverified sources? And if it fails, what does it spell for the countless newspapers and magazines that have already been hurt by the sea of free content? Also, what would this mean for the future of nonprofits like the Common Crawl, which makes data accessible to researchers?

Are there any solutions other than destroying the models and starting again? Sorta. People are working on machine unlearning, and different ways to do data deletion without having to start from scratch. Decremental learning involves updating models to unlearn specific data points, while data sharding and model reweighting allow you to divide up the knowledge in your model into shards, so when you need to unlearn something, you can just delete the related shard. But these are all still research questions.

When it comes to something like GPT-4 right now, it seems like the Chat is out of the bag. [SORRY!] At this point, it might not be a bad idea to set up a betting pool and a lawsuit tracker.

But honestly: Why should you care? Think of it from both sides: It’s important that these systems are trained on data that is accurate, balanced and inclusive. There’s enough disinformation and bias in the world as it is. (Speaking of bias, keep in mind that The Times covers OpenAI and Microsoft almost daily. It was almost shocking to open their Morning newsletter yesterday to read a Q&A with its lead AI reporters, who basically said that ChatGPT really isn’t that great yet.) We also need to ensure that intellectual property is respected, and that those (humans) who create it are compensated so that they can continue to do their work even after AI is integrated into our society. (Remember when it was called journalism — back before it became “content,” and now just “data”?) Because accuracy and trust will always be an issue…

Two things I know for sure: We need some real regulation here, ASAP. With heavyweights such as The Times and Getty Images leading the well-funded charge, this could help make it happen a little sooner.

Also: 2024 is going to be a wild year! I’m looking forward to covering it for you, as well as bringing you Season Two of the Technically Optimistic podcast starting next month.

Write and let me know which AI topics are on your mind (this minute): us@technicallyoptimistic.com.

Worth the Read

In its year-end report, the Supreme Court, Chief Justice Roberts weighs in on AI in the courtroom. While he says that AI will no doubt affect judicial work, it will not replace it, explaining that “Nuance matters: Much can turn on a shaking hand, a quivering voice, a change of inflection, a bead of sweat, a moment’s hesitation, a fleeting break in eye contact.”
In this Foreign Affairs op-ed, penned by Jen Easterly, the Director of the Cybersecurity and Infrastructure Security Agency at the U.S. Department of Homeland Security, and others, they lay out the threat that AI poses to democracy in the upcoming year, and offer advice on how to safeguard U.S. elections.
The New York Times reports that abusive partners are using cars’ apps for remote control and tracking in some pretty creepy ways — and the car companies are doing nothing to stop it. Just one more example of smart devices collecting data in ways that we don’t know about — and can’t control.
Some good news: AI is being employed to combat wildfires. Among the fascinating “firetech” startups is Pano, which uses mountaintop cameras to spot fires in real time, filtering the images through AI software trained to detect the difference between smoke and non-smoke images.

The Sign of the Times

Can the NYT’s suit against OpenAI stop the scrape?

Worth the Read

Discussion about this post