A discussion of 4 seminal image generation papers
> In particular, it’s not clear to me why we need to go from CLIP to CLIP
I also found this confusing when I read the DALLE2 paper (see https://www.lesswrong.com/posts/XCtFBWoMeFwG8myYh/dalle2-comments).
Not long after that paper, Google came out with Imagen (https://arxiv.org/abs/2205.11487), which is reportedly better than DALLE2 (in a head to head comparison) despite using a much more obvious approach:
- conditioning with cross-attention to a text encoder, as in GLIDE
- but, using a powerful pretrained text encoder rather than training one end-to-end from scratch
The Imagen paper focuses on the case where the text encoder is T5, but they also tried using CLIP's text encoder, and got very similar (if slightly worse) results.
Before I saw Imagen, I thought "this unCLIP idea does not make sense theoretically, but apparently it helps in practice." After I saw Imagen, I wasn't even sure anymore than it helped it practice. (It is superior to GLIDE, which doesn't have the benefit of access to CLIP, but it can be beaten by a more relevant baseline that does have access to CLIP and uses it in the obvious way.)
The Imagen approach, of GLIDE-style conditioning with a pretrained CLIP text encoder, was also used independently in
- Katherine Crawson's v-diffusion (https://github.com/crowsonkb/v-diffusion-pytorch), developed before DALLE2 or even GLIDE existed
- Stable Diffusion, which Crawson also worked on