← All Days
Day 2 — Tue, Mar 31
CLIP's contrastive loss: how two encoders learn one space
- Two questions I wanted to answer today: how do both models get trained, and how does the loss function make them share the same space
- Stuck with intuitive understanding for now. Low sleep today so kept it light. The maths goes deeper but I'll come back to it
- Next step: understand the training procedure itself, then go deeper into VLMs intuitively
How the contrastive loss works
Two encoders, two embeddings, one shared space
- There are two encoders and each one creates its own embedding. Different architectures, different modalities, different embedding spaces by default
- The goal is to provide structure to both of them in a shared geometric space. The loss function is what bridges the gap
- Cosine similarity conveys the distance between two vectors. If an "incorrect" text embedding has higher distance to the "correct" image embedding, we get high loss. That part is simple
- But for each text vector $t_n$ (where $n \in N$), there are image vectors $I_n$, and the distances can be different for the other $I_{N-1}$ vectors. This is where the concept of negative loss comes in
- Negative loss is slightly more complicated. Didn't go deep into it today, will look into it later
Two losses, one average
The final CLIP loss is symmetric
- There are two separate losses: image-to-text ($\ell_{i \to t}$) and text-to-image ($\ell_{t \to i}$). The distance is measured in both directions
- The final loss is just the average of these two: $L = \frac{\ell_{i \to t} + \ell_{t \to i}}{2}$
- This symmetry makes sense. The model needs to be good at matching in both directions, not just one