Day 3: Building CLIP from scratch: reading the paper properly — Search Images with Words

Building CLIP from scratch. Will try to do as much detailed work as possible. Have no idea how much training it would require but this is the way to actually learn it
Starting with reading the paper. I do understand what the paper is about from first read but I still have questions
In the past image models are trained on supervised labels but this is honestly pointless as we never get zero-shot capability that way. This is why I have to understand this paper properly

Two encoders, one shared space, and why language makes image understanding robust

The model contains two encoders: text and image. These two encoders get representations from the same space which makes the image understanding robust
The reason for having language training for the image model is that understanding the ambiguity of language makes training much better. For example, training on "cat sitting on sofa" as a full sentence. A normal image model would just get "cat", "sofa", "sitting" as separate labels
But by allowing the model to learn the feature space just by looking at the picture, we allow our model to learn the ambiguity of this space. The sentence carries structure that isolated labels throw away

Scaling, passive learning, and training efficiency

The authors state it is easy to scale this model compared to normal image models since there is no one-to-one mapping. Or rather, no one-to-N mapping required
The language side can learn passively from supervision contained in the vast amount of text on the internet. You don't need hand-labeled data
Training efficiency was key since we are training essentially two models. My question: does that mean you are training everything from scratch? The authors obviously trained every model from scratch. But what happens when I, a non-researcher, try to build this system on pre-trained models?

One paragraph that describes the entire paper

"Given a batch of N (image, text) pairs, CLIP is trained to predict which of the NxN possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximise the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N²-N incorrect images."
I want to know how this batch construction technique works. And how multi-class N-pair loss works. I want to learn about the actual architecture of this loss function

Specific gaps to fill from the paper

Image representation via manifold learning (page 1)
Contrastive representation learning (page 4)
Difference between non-linear projection and linear projection maps, specifically for the purpose of this model (page 4)
Scaling of the model. What do they mean by increasing only in one dimension (length or width)? How does it affect performance? (page 5)