← All Days
Day 3 — Wed, Apr 1
Building CLIP from scratch: reading the paper properly
- Building CLIP from scratch. Will try to do as much detailed work as possible. Have no idea how much training it would require but this is the way to actually learn it
- Starting with reading the paper. I do understand what the paper is about from first read but I still have questions
- In the past image models are trained on supervised labels but this is honestly pointless as we never get zero-shot capability that way. This is why I have to understand this paper properly
How CLIP actually works
Two encoders, one shared space, and why language makes image understanding robust
- The model contains two encoders: text and image. These two encoders get representations from the same space which makes the image understanding robust
- The reason for having language training for the image model is that understanding the ambiguity of language makes training much better. For example, training on "cat sitting on sofa" as a full sentence. A normal image model would just get "cat", "sofa", "sitting" as separate labels
- But by allowing the model to learn the feature space just by looking at the picture, we allow our model to learn the ambiguity of this space. The sentence carries structure that isolated labels throw away
Highlights from first read
Scaling, passive learning, and training efficiency
- The authors state it is easy to scale this model compared to normal image models since there is no one-to-one mapping. Or rather, no one-to-N mapping required
- The language side can learn passively from supervision contained in the vast amount of text on the internet. You don't need hand-labeled data
- Training efficiency was key since we are training essentially two models. My question: does that mean you are training everything from scratch? The authors obviously trained every model from scratch. But what happens when I, a non-researcher, try to build this system on pre-trained models?
The key insight of the paper
One paragraph that describes the entire paper
- "Given a batch of N (image, text) pairs, CLIP is trained to predict which of the NxN possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximise the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N²-N incorrect images."
- I want to know how this batch construction technique works. And how multi-class N-pair loss works. I want to learn about the actual architecture of this loss function
Things I don't understand yet
Specific gaps to fill from the paper
- Image representation via manifold learning (page 1)
- Contrastive representation learning (page 4)
- Difference between non-linear projection and linear projection maps, specifically for the purpose of this model (page 4)
- Scaling of the model. What do they mean by increasing only in one dimension (length or width)? How does it affect performance? (page 5)