A step towards self-improving LLMs
Here, I outline a research agenda towards making LLMs self-improve, a key problem standing in the way between current technology and AGI.
If I look at GPTs/LLMs, three of the biggest problems I see with existing techniques are:
We need our models to be able to generate data by themselves, i.e. we need a recursive self-improvement loop. AlphaZero is the shining example of what’s possible here.
We need our models to be able to operate in new domains without requiring massive amounts of existing data. CLIP provides an option here, as does Internet Explorer (the paper, not the browser).
Auto regressive sampling. It’s slow, and suboptimal.
I have better ideas for how to tackle #1, so I’ll focus on that. #2 & #3 will come later.
There are other issues facing LLMs, such as:
Increasing the length of the context window
Figuring out how to train larger models
Figuring out how to train more efficient models (less parameters, less data, less energy)
Factual accuracy
Mitigating attacks that convince LLMs to exhibit harmful behaviour (”red-teaming”), e.g. prompt injection
I think these are fundamentally engineering problems that we’ll be able to figure out iteratively. For instance, context length has seen a lot of progress with subtle algorithmic improvements; if we combine those changes with the many arcane engineering optimizations that are out there, I think we’ll get to a point where context goes to 64k tokens or more, at which point we’ll be deep in the saturating point of the sigmoid. Or for factual accuracy- I think that retrieval will largely solve that once it’s incorporated into most models.
However, I’m probably wrong, and could very well end up writing a version of this post in 2034 talking about how the biggest problem facing AGI is prompt injections.
A path towards recursive self-improvement
GPTs work very well in one specific context: they are very, very good at finding text that is likely to follow other text in a way that appears natural to humans.
What they don’t do is come up with text that they haven’t seen before. Kinda. What they’re doing when we sample from them now is predict what they’ve seen during training. Sometimes these predictions produce text that hasn’t been written before (this can occur often, due to the combinatorial nature of token sampling). When this happens, it’s a happy accident. The model isn’t trying to select text that is novel or that accomplishes any goal other than following the preceding 2048 tokens (or whatever the context length is).
The obvious exception is when models are finetuned using RLHF. In RLHF, the models are explicitly trained to optimize a reward signal. In RLHF, the reward signal comes from a model trained to predict human feedback. Basically, humans are asked to choose between two samples of text, and then a model learns to predict which one is preferred.
Why does this matter? Predicting the next token works pretty well! And maybe we’re all just stochastic parrots? It matters because the biggest impediment to improving our models right now is the lack of data. The scaling law papers (Chinchilla, OpenAI) consistently point to the fact that we need to scale up the datasets we train LLMs on.
For instance, Chinchilla predicts that we’ll need 11 trillion tokens to optimally train a model the size of PaLM (i.e. 540B parameters). If we want to push past PaLM to a model with 1 trillion parameters, we’ll need 20T tokens!
That’s a lot of data. That’s so much data that it’s not clear that we can get that from existing sources. nostalgebraist argues that 1) we’ve basically exhausted the available data in structured domains like coding and 2) it’s starting to look like we’re running out of general-domain data. I find nostalgebraist compelling; the only counterargument I could see is that private data sources might be a rich vein of tokens, but I don’t see a clear path to getting access to them.
This lack of data is unfortunate because, according to Chinchilla’s scaling laws, we could see another ~8% reduction in training loss (1.93 → 1.77, delta of 0.16 in loss) for if we had infinite data while changing nothing else about Chinchilla. That’s a pretty substantial improvement when you consider that the improvement from Gopher to Chinchilla was only 2.9% (1.99 → 1.93, delta of 0.06 in loss), not to mention the fact that our models are already quite good— able to trick Google SWEs into believing they’re sentient, and scaring the Yud.
More data
The clear implication is that we need way more data! Our models are desperate for data. They’re lying on the beach gasping for more data to quench their ever-growing thirst.
But where will the data come from?
If we can scrape it we should. It’s not clear how much there is left to scrape. Especially at the largest research institutions like OpenAI, Google Brain, and DeepMind, I’m certain that they have teams of engineers working on scraping all possible data. There is some possibility to automate this process; the excellently named Internet explorer paper presented a model which crawls the web to get additional data to augment it’s dataset. Although letting a nascent AI loose on the internet would make Eliezer cry, it could be an excellent source of data, especially if one incorporates some sort of reinforcement learning style feedback loop to continually improve the manner in which the model searches the web.
The data problem is compounded by the fact that high quality data really matters. Experiments consistently show that deduplicating data increases performance substantially (https://arxiv.org/abs/2205.10487, https://arxiv.org/abs/2107.06499). Basically, I’m not convinced there is a lot more high quality data. Two exceptions might be commercial data (e.g. internal corporate documents), and all copyrighted text. But it would be extremely difficult to get access to either of these corpora.
Generate data
The solution to me seems to be self-evident: we should generate our own data. There has been some work about training LLMs on data they have generated (https://arxiv.org/abs/2210.11610, https://arxiv.org/abs/2212.08073). There are a few different techniques that seem promising here.
In Huang et. al, they use Chain of Thought (CoT) reasoning to generate additional data. Given a dataset of questions, they sample N answers that use CoT to generate an answer. At the end, they ask “The answer is “ and get an answer; they then find the majority answer and choose all texts that return the same answer as the most common answer, using this to generate additional data. In practice, there’s no reason to questions, although that is a particularly straight forward problem to apply this to; one could imagine, say, embedding all of the generated answers, clustering them, and keeping the answers in the biggest cluster, or employing RLAIF like Anthropic did in the Constitutional AI paper to select the answers to keep.
Anthropic employed a similar approach as the CoT reasoning in the Constitutional AI paper.
Another option is to use RLAIF (from Constitutional AI) to generate data. In this,
Throw compute at the problem
Yet another line of research involves throwing compute at the problem. We know that we can use a variety of techniques to soak up compute and improve outcomes. For instance, ensembling is a classic ML technique that strictly improves model performance. Given that we are already at the extreme limit of what’s possible to compute with transformers, it is almost certainly not possible to naively ensemble LLMs.
However, what we can do is use compute to apply search on top of our existing model outputs. If we can find a policy improvement operator, i.e. a function T that takes an existing distribution over tokens, π, and returns a new distribution, T(π), which improves our loss, then we can use T to improve our model. Some candidates:
Best-of-n
Beam search
Policy-driven search
“Best-of-n” (https://openai.com/research/measuring-goodharts-law) is a technique similar to ensembling in which we sample from our model N times, and use the sample with the highest score according to our objective function. This performs remarkably well (outperforming the RLHF model in the WebGPT paper(https://openai.com/research/webgpt)), is simple to implement, trivial to analyze mathematically, and trivially parallelizable, but makes inference N times more expensive. If I were OpenAI, I’d be caching the results of queries to their models and doing this for repeated queries
.
In the WebGPT paper, the authors found that best-of-16 resulted in an improvement in human preferences of 5% (60% → 65%), while going from 13B parameters to 175B parameters resulted in an improvement of 10% (~47% → 57%)
As both charts see roughly linear improvements in performance, and both models increase in cost roughly linearly, it seems to imply that a best-of-64 13B model would be better than a best-of-4 175B model, while having roughly the same cost in terms of compute. Given that the 13B model fits on a single GPU, this would substantially lower the overall compute needs of the system.
Another improvement operator is a NLP classic: beam search! In beam search, one performs a breadth-first search over the model outputs, with finite depth and width of the tree (e.g. it only keeps N successors at each level of the tree, and searches to a depth of M levels), with the final result being the sequence with the maximum objective score (typically log-likelihood). While a number of the LLMs do use beam search,
they don’t appear to report performance numbers, so I’m unable to include a comparison of how much it matters.A concern is that beam search lowers diversity, as it constricts the difference in tokens; this is especially problematic for byte-level tokenizers, like BPE, as the individual tokens might vary significantly. Sander Dieleman wrote about how strategies like beam search are “the culprit behind many of the pathologies that neural machine translation systems exhibit”.
The final candidate (or family of candidates) for the improvement operator is an option that I find very exciting: learning an algorithm to search the token tree. The idea is that we could do something like AlphaZero which would learn a policy + value function.
This would also allow us to change the reward function if we wanted to, rather than just using the standard log-likelihood. We could, for instance, directly train the reward function on human data. If you’re serving data to millions of users per day, you could just directly run RL on that, which is the case for the myriad of chat bots on the market today (Bing, ChatGPT, Claude, etc.).Next steps
Now, I am not employed by a lab studying AGI. So I do not have the resourcesGPUs to apply any of these strategies. If you’re inspired by any of these ideas and want to implement them, please do so. I’d love to hear from you.
I’d particularly love to hear from you if you disagree with me. What am I wrong about?
Results from eyeballing graph, not precise.
Examples include: GPT-{2,3}, which uses it during decoding for text generation, BERT, for language understanding tasks, T5, and XLNet.
Sander’s post is great. I was struggling to understand why beam search isn’t used more in practice, and his post did a great job helping me understand why.
Perhaps using MuZero with a smaller model as the recurrent function to save on compute.
I wonder if the model could suffer from initial errors in the training data and then the feedback-loop would make it become "super-wrong". I wonder how things such as Dataset poising and mallicious attacks might be played out, either in the data or in the RL by people.
Great post, this is in line with a lot of my thinking about how models can be made more capable in this time of pushing up against the ceilings of easily scrapable internet data.
I like your partitioning of LLM issues into core (exploration, self-improvement, data efficiency issues) and others which I agree are likely to be solved in the near future --- from what I have heard I believe context length, more efficient models, and factual accuracy at least are on course to be greatly advanced this year.
On data availability: my impression is that there is likely at least 10T tokens on the internet of high quality data, the issue is that much of it requires increasingly more effort to scrape, plus only a small fraction of it has permissive licenses if you are concerned about copyright lawsuits. There is likely significant data available from converting from other domains, as well. "OpenAI is using Whisper on all of YouTube to scrape data" is a bit of a meme but actually plausibly a way to get more decent quality tokens at the cost of roughly a few million $.
I agree though that in the long term, models which explore and generate their own data is the way to go though. This seems easiest in domains which are highly structured, like code, where you can also generate unit tests (https://arxiv.org/abs/2106.05784). Over the past few months there has also been progress in evaluation dataset generation too, with Anthropic's work moving towards LLM benchmark generation along many axes (albeit more targeted at RLHF models) (https://arxiv.org/abs/2212.09251) and I am hoping to work on an open-source pipline to generate questions/queries with more complexity for arbitrary evaluation axes and augmentation of current datasets over the next few months.
One of the main issues with doing this naively is that your generations will often lack sufficient diversity to really explore the domain well, particularly in RLHF tuned models which *currently* suffer from collapse to lower entropy distributions; a consistent complaint I've seen in papers on data generation is some mention of trying to maximise generated data diversity or that their method suffers from a lack of diversity using naive LM sampling techniques.
As a result, my preferred direction is in ways of sampling from LLM outputs in a way that incentivises increased diversity (an exploration prior is another way of viewing it). In contrast to a naive approach of generate -> finetune -> repeat (https://arxiv.org/abs/2207.14502), we can achieve potentially greater diversity and exploration of the search space by combining evolutionary algorithms with LLMs (https://arxiv.org/abs/2206.08896, https://arxiv.org/abs/2302.12170), and there is potential to combine this with RL (https://arxiv.org/abs/2302.06692) (and RLHF) in interesting ways to get around some of its weaknesses. I currently work with CarperAI, helping lead the open-endedness team on projects in this area: https://github.com/CarperAI/OpenELM
In general, I think there's a huge amount of potential in intelligently-guided open-ended search here. The Internet Explorer paper is a good direction, but I think you can do well in many domains (e.g. code) without the internet, particularly by exploiting the ability to co-evolve the agent and the evaluation environment, as in https://arxiv.org/abs/1901.01753, such that the difficulty of the environment scales with the agent's explore -> finetune loop.
We might also explore and generate data in the space of programs, as in this recent work https://arxiv.org/abs/2302.14838. With language models as an intelligent variation operator, the search procedure can "stay on the manifold of functionality" and guide the search in novel and high reward directions. In full generality, this is one option for a feedback loop to improve the capability of our most capable AI systems. See Minqi Jiang's work https://arxiv.org/abs/2211.07819 https://blog.minch.co/2022/11/15/software-squared.html for some thoughts about that.
Regarding your final paragraph about searching the token tree, a recent promising option is Speculative Sampling (https://arxiv.org/abs/2302.01318, https://arxiv.org/abs/2211.17192 (very funny that the DeepMind team was scooped by the Brain team here)), which uses a small model to generate a draft sequence then queries the larger model's logprobs for each token to decide whether to accept/reject, which seems promising as a way to provide significant speedups and could be generalised towards the direction you suggest with a tree search kind of setup.
P.S. I enjoyed this post a lot and look forward to seeing more of your thoughts on this blog!