6 Comments

I'm really interested in the user interface design of these human feedback systems. Reading 64 outputs to choose the best? How do people do that?

Expand full comment

I wonder if the model could suffer from initial errors in the training data and then the feedback-loop would make it become "super-wrong". I wonder how things such as Dataset poising and mallicious attacks might be played out, either in the data or in the RL by people.

Expand full comment

Great post, this is in line with a lot of my thinking about how models can be made more capable in this time of pushing up against the ceilings of easily scrapable internet data.

I like your partitioning of LLM issues into core (exploration, self-improvement, data efficiency issues) and others which I agree are likely to be solved in the near future --- from what I have heard I believe context length, more efficient models, and factual accuracy at least are on course to be greatly advanced this year.

On data availability: my impression is that there is likely at least 10T tokens on the internet of high quality data, the issue is that much of it requires increasingly more effort to scrape, plus only a small fraction of it has permissive licenses if you are concerned about copyright lawsuits. There is likely significant data available from converting from other domains, as well. "OpenAI is using Whisper on all of YouTube to scrape data" is a bit of a meme but actually plausibly a way to get more decent quality tokens at the cost of roughly a few million $.

I agree though that in the long term, models which explore and generate their own data is the way to go though. This seems easiest in domains which are highly structured, like code, where you can also generate unit tests (https://arxiv.org/abs/2106.05784). Over the past few months there has also been progress in evaluation dataset generation too, with Anthropic's work moving towards LLM benchmark generation along many axes (albeit more targeted at RLHF models) (https://arxiv.org/abs/2212.09251) and I am hoping to work on an open-source pipline to generate questions/queries with more complexity for arbitrary evaluation axes and augmentation of current datasets over the next few months.

One of the main issues with doing this naively is that your generations will often lack sufficient diversity to really explore the domain well, particularly in RLHF tuned models which *currently* suffer from collapse to lower entropy distributions; a consistent complaint I've seen in papers on data generation is some mention of trying to maximise generated data diversity or that their method suffers from a lack of diversity using naive LM sampling techniques.

As a result, my preferred direction is in ways of sampling from LLM outputs in a way that incentivises increased diversity (an exploration prior is another way of viewing it). In contrast to a naive approach of generate -> finetune -> repeat (https://arxiv.org/abs/2207.14502), we can achieve potentially greater diversity and exploration of the search space by combining evolutionary algorithms with LLMs (https://arxiv.org/abs/2206.08896, https://arxiv.org/abs/2302.12170), and there is potential to combine this with RL (https://arxiv.org/abs/2302.06692) (and RLHF) in interesting ways to get around some of its weaknesses. I currently work with CarperAI, helping lead the open-endedness team on projects in this area: https://github.com/CarperAI/OpenELM

In general, I think there's a huge amount of potential in intelligently-guided open-ended search here. The Internet Explorer paper is a good direction, but I think you can do well in many domains (e.g. code) without the internet, particularly by exploiting the ability to co-evolve the agent and the evaluation environment, as in https://arxiv.org/abs/1901.01753, such that the difficulty of the environment scales with the agent's explore -> finetune loop.

We might also explore and generate data in the space of programs, as in this recent work https://arxiv.org/abs/2302.14838. With language models as an intelligent variation operator, the search procedure can "stay on the manifold of functionality" and guide the search in novel and high reward directions. In full generality, this is one option for a feedback loop to improve the capability of our most capable AI systems. See Minqi Jiang's work https://arxiv.org/abs/2211.07819 https://blog.minch.co/2022/11/15/software-squared.html for some thoughts about that.

Regarding your final paragraph about searching the token tree, a recent promising option is Speculative Sampling (https://arxiv.org/abs/2302.01318, https://arxiv.org/abs/2211.17192 (very funny that the DeepMind team was scooped by the Brain team here)), which uses a small model to generate a draft sequence then queries the larger model's logprobs for each token to decide whether to accept/reject, which seems promising as a way to provide significant speedups and could be generalised towards the direction you suggest with a tree search kind of setup.

P.S. I enjoyed this post a lot and look forward to seeing more of your thoughts on this blog!

Expand full comment