Hi, I’m Raffi, and welcome to my weekly newsletter. Each Friday, I break down the ever-changing conversation about technology and AI. If you’re not a subscriber, here’s what you missed this month:
A look at what the New York Times’ lawsuit against OpenAI and Microsoft could mean for what data is used for training algorithms.
What the players of Go can teach us about learning to love AI.
Subscribe to get access to these and all future posts, as well as previews of Season Two of the Technically Optimistic podcast.
As I mentioned in last week’s newsletter on The New York Times lawsuit against OpenAI and Microsoft, one of the datasets that is feeding NYT information to GPT-4 and other language learning models is Common Crawl. The Times’ complaint states that “The Common Crawl dataset includes at least 16 million unique records of content from The Times across News, Cooking, Wirecutter, and The Athletic, and more than 66 million total records of content from The Times.” That’s a lot of copyrighted material to use without consent.
Let’s step back and look at the problem — or shall we say the bigger issue. If you want to build a product or do research on things that require data from the web, well, you have to get that data. How? You have to “crawl” the web and record that data. High-quality web data is often expensive and/or restricted, limiting innovation and research — not to mention that training natural language systems (NLMs) requires diverse examples of human writing. Say, if you’re building a large language model.
Quick primer: “Crawl” refers to the process used by automated bots, known as web crawlers, to systematically browse the web for the purpose of indexing and archiving content. These crawlers visit web pages to capture content, links, and metadata such as keywords, descriptions, and page titles. Common Crawl's bots conduct monthly crawls, capturing billions of web pages, which are then processed and stored in a publicly accessible archive.
Crawling the web is annoying and hard. You have to set up those crawlers. You have to deal with all the nuisances of clicking on links and following them. You have to store the data, and it takes time, even for computers, to read hundreds, thousands, or millions of websites. So there is a need for extensive, accessible web data for AI research and development. This impacts smaller organizations and independent researchers who can't afford extensive data collection. But what can be said about tech companies like OpenAI that are getting billions in funding?
Common Crawl is one solution to this web data problem. Conceptualized in 2007 by Gil Elbaz, who is also known for co-founding Applied Semantics and developing the tech behind AdSense technology, which was purchased by Google in 2003 to underpin its advertising systems. Common Crawl Foundation is a nonprofit that provides free access to a petabyte-scale database of the web (read: a really large amount of data), and is supported by such influential advisors as Peter Norvig, contributing to its mission of openness and leveling the data-access playing field. Anyone visiting the Common Crawl site will find web pages, links, and metadata, including text and location data, with datasets available in various formats.
And people have built upon it! For example, the cybersecurity tool Attack Surface Discovery; the search engine and information discovery tool Alexandria Search; academics who are researching such fields as web content evolution, linguistics, and cultural content analysis; and natural language processing models for various applications, like sentiment analysis, topic modeling, and language translation services. Those last few sound like the underpinnings of large language models.
The debates around the ethics of profiting from publicly accessible data escalated all the way to Congress, which hosted a hearing on AI’s impact on journalism on Wednesday.
And then that brings us to the Times and OpenAI. Common Crawl is mentioned as a significant dataset used by OpenAI and other AI initiatives. The debates around the ethics of profiting from publicly accessible data escalated all the way to Congress, which hosted a Judiciary Committee hearing on AI’s impact on journalism on Wednesday. Wired reports that the hearing, in which media leaders made a case for imposing mandatory licensing fees, was a rare example of political alignment, with both sides agreeing that companies like OpenAI should pay for training data. (Senator Richard Blumenthal went as far as to describe Big Tech’s effect on local media as “literally eating away at the lifeblood of our democracy.”)
The sole dissenter, journalism professor Jeff Jarvis, countered that licensing fees would not protect the information ecosystem, but damage it. (Outside the hearing, critics claim that such fees could be payable only by the big companies…putting the startups that could diversify — and challenge — the giants at a disadvantage, and impacting the availability of public data for innovation.)
Crawling Toward the Future
In terms of the bigger picture with regard to data collection and AI, here are a bunch of things to think about:
How do the massive datasets exacerbate or mitigate issues of bias, misinformation, and the digital divide?
Is it ethical to use unlicensed data?
What should the guardrails be around informed consent, privacy, and the potential misuse of data, and who should regulate it?
How does the commodification of web data impact market competition?
Where does responsibility lie? Is it with data-gathering organizations like Common Crawl? Is it with players that use the data, like OpenAI? Is it with legislators for not legislating? Is it with users for turning a blind eye?
I would love to have a conversation with you about it. Please kick it off by emailing me at us@technicallyoptimistic.com.
I’ve been thinking about the need for proactive policies that balance innovation with protection of public interest. Here’s where I’ve landed: We should explore compensation models for data creators (this rings of micropayments – and I will say that this has been tried before and never works. Sigh.) We should also consider the potential need for legislation on equitable data use, which this week’s hearing touched upon with surprisingly impressive speed.
Will this lawsuit get us there? I’m not so sure. If The Times and OpenAI settle, then The Times will figure out compensation, and maybe that will become a pattern for the other big publishers. But there won’t be a template for other smaller content creators. If the case needs the judiciary, then we’re going to be waiting a very long time to come up with a ruling or laws – so nothing will really change in the short run.
But if free data providers such as Common Crawl cease to exist, the implications for innovation are huge. The biggest risk is that the best-resourced entities — those that can both spend billions on development and court cases from the likes of The Times — will be the ones that innovate and grow fastest and therefore gain the most users. Do we really want an information monopoly when we don’t trust the players (or their board games)?
Worth the Read
Following up on last week’s newsletter, OpenAI is claiming that a) The Times tricked ChatGPT into copying its articles (!) (see my comment last week that I want to see the prompts that the Times used), and b) It would still like to partner with the media giant.
Hollywood has yet to join the AI lawsuit trend. Are they waiting to see what happens with other media properties…or waiting to strike a deal on IP? As The Hollywood Reporter says, it’s now or never.
In response to Americans’ fear of deepfakes — 7 in 10 say it makes it harder to trust what they hear and see — McAffee has created a deepfake voice detector. Project Mockingbird can help detect scammers and cybercriminals from obtain or use sensitive information. Is the software itself AI-generated? Of course!
The latest deepfake celeb ad scam? Taylor Swift offering Le Creuset giveaways on Meta and TikTok. (Fake Martha Stewart and Fake Oprah were used, too.) While the singer is publicly a fan of the French cookware (and I am huge a fan of the singer), the whole thing was fake, down to the linked sites, designed to mimic legit outlets like the Food Network. The best factoid in this article? In 1990, Tom Waits sued Frito-Lay for using his voice without his permission and won $2.5M.