Hi, I’m Raffi, and welcome to my newsletter. Each Friday, I break down the ever-changing conversation about technology and AI. This week, I want to look at what exactly makes big tech poised to control AI, and how a potential workaround may have emerged. It’s nerdy — but important — stuff. I promise not to go too deep in the (data) weeds.
When it comes to large language models (LLMs), which require a vast amount of resources like talent, data, power and computers to train, one of the biggest bets out there is whether this is going to be the domain of companies with a lot of money and a lot of compute — therefore consolidating power — or there will be a diversity of smaller models. In other words, are we in a “few AIs rule them all” situation, or will more players have their own personal and smaller version? Will there only be a handful of big AIs in the world, or will a million flowers bloom?
I want to believe we are headed toward the latter. In fact, the whole basis of the Technically Optimistic newsletter and podcast is to equip you to be able to advocate for the latter. If we want to avoid a consolidation of power, we as citizens have to demand it.
The technical barrier to entry further reinforces the bigger players, who are buying all the expensive graphics processing units required for training, locking up compute and edging out the smaller people. (If you think about it, how will open source flourish in this world?) In recent months, however, one barrier has fallen that is making large language models slightly less cost-prohibitive to train. Warning: These changes may be deeply technical, but I’ll try to break it down.
Quick context for those who may be new to the topic: The way that these LLMs work is that they’re trained on huge corpuses of human-written data (see last week’s newsletter on Common Crawl) to look for patterns that will enable them to be able to “complete” a statement. (If X comes first, then Y comes second.) Take a question like, “Which state is New York City in?” A valid completion could be, “New York City is in New York State.” But so could “Which state is San Francisco in?,” because such lists of questions are fairly common in data corpuses.
Toward Eliminating the People Who Make LLMs Sound Like People…
One of the biggest things that makes these LLM products — such as ChatGPT, Claude and Bard — actually sound like people is a technique called reinforcement learning with human feedback (RLHF). After you train your LLM on a large amount of text, you then get the LLM that you’re training to generate multiple outputs or completions for every single question. Then you hire humans to compare pairs (or more) and rate which output is the best. (That second step is quite expensive, and we’ll talk about it in Season Two of the podcast, starting next month.)
As their names indicate, LLMs are, well, large, and they take a large amount of memory and computational power to store and execute these models. Then, in order to run RLHF, you have to: 1. Build an entirely new model – a reward model – to learn from the feedback. 2. Train this reward model on what is good and bad. 3. Use this new “reward” LLM to train the original LLM to behave and answer more like the humans who gave it feedback.
So if you think training the LLM is hard, RLHF may require even more computational power and skill to pull it off. Think about it: Not only are you building the large language model, but you have to train a whole other model as well! What RLHF is trying to do is to get the model to not only learn the language, but then actually write down the rules for how to put that language together. What if you could skip the step of sitting down to write down the rules and focus instead on playing the game over and over and over again, and just learn to play?
As we’re seeing more clearly every day, we need ways to spur more innovation early and often, and not get too far down the road of only having a few big AI models that rule the world.
The 2023 paper “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” actually does just that. It was a runner-up at NeurIPS last month (arguably the most important machine-learning academic conference of the year), but it may be a game-changer. The paper outlines a method to skip the training of this other reward model — the expensive step — and just focus on directly optimizing that initial LLM.
Direct preference optimization (DPO) requires not only less computation, but potentially less human intervention in the learning process. In short, it may take less talent and skill — and money — to create these models. This could be part of a real unlock to allow more than just the big companies to create new models. This could be a real step toward getting to that world of personal and smaller models — and maybe a boon to open source, too. It will be very interesting to watch.
Sounds great, but here’s what gets lost in DPO: That reward model step that got skipped? It encoded some higher-level rules of how language should be put together, and allowed humans to fine-tune it a bit better. That reward model is where we can iron out biases, hate speech and the like. DPO just does this directly into the model, so it may be more opaque on precisely how it works. I’ve definitely complained in the past that nobody really understands how these systems work. And so many people are working on explainability. This may be a step backwards on that.
While DPO isn’t full proof that big tech’s dominance is finally being threatened, it demonstrates that researchers are still directing their brilliance toward making these systems go faster and further, with a more accessible barrier to entry. And that is so important when it comes to helping diversify power.
As we’re seeing more clearly every day, we need ways to spur more innovation early and often, and not get too far down the road of only having a few big AI models that rule the world. One path protects and serves our interests. The other path is one in which we, our data and our attention feed big commercial incentives. Every single path to getting a diverse range of developers and innovators into this field is a good thing, and DPO may be one way in.
I’d love to know what you think! Please leave a comment, or email me at us@technicallyoptimistic.com.
Worth the Read
Europe just enacted the Digital Market Act, enabling Google users to choose which of its services share its data. Privacy International did a great explainer on X. Read all the way through!
Meanwhile, at Davos, Sam Altman told Bloomberg why OpenAI is banning the use of ChatGPT in campaigns as talk of AI and democracy were major topics of the summit. (More from Altman and Bill Gates on AI and elections here.)
OpenAI also funded 10 independent teams from around the world to come up with ideas and tools for AI governance. Some really interesting ideas to read through, including generative social choice, engaging underserved populations and collective dialogues for democratic policy development.
Nathan Grayson reads the fine print in SAG-AFTRA’s video game AI announcement. It’s…pretty much what you’d expect.