Day 2: CLIP's contrastive loss: how two encoders learn one space — Search Images with Words

Two questions I wanted to answer today: how do both models get trained, and how does the loss function make them share the same space
Stuck with intuitive understanding for now. Low sleep today so kept it light. The maths goes deeper but I'll come back to it
Next step: understand the training procedure itself, then go deeper into VLMs intuitively

Two encoders, two embeddings, one shared space

There are two encoders and each one creates its own embedding. Different architectures, different modalities, different embedding spaces by default
The goal is to provide structure to both of them in a shared geometric space. The loss function is what bridges the gap
Cosine similarity conveys the distance between two vectors. If an "incorrect" text embedding has higher distance to the "correct" image embedding, we get high loss. That part is simple
But for each text vector $t_n$ (where $n \in N$), there are image vectors $I_n$, and the distances can be different for the other $I_{N-1}$ vectors. This is where the concept of negative loss comes in
Negative loss is slightly more complicated. Didn't go deep into it today, will look into it later

The final CLIP loss is symmetric

There are two separate losses: image-to-text ($\ell_{i \to t}$) and text-to-image ($\ell_{t \to i}$). The distance is measured in both directions
The final loss is just the average of these two: $L = \frac{\ell_{i \to t} + \ell_{t \to i}}{2}$
This symmetry makes sense. The model needs to be good at matching in both directions, not just one