← All Days
Day 1 — Mon, Mar 30
CLIP: the paper that married text and images
- New project: text-to-image search. Should be straightforward to get working, but I want to actually understand the models, not just use them
- Goal is to compare both CLIP-style models and modern VLMs for the task. See what is different from how the human brain does it. How reliable are they, how well do they perform
- Starting with the most foundational paper for this: CLIP (Radford et al., 2021). It is the combination of a text encoder and an image encoder, and what comes out is a shared representation that can be used elsewhere
How CLIP works
Contrastive Language-Image Pre-training: a dual-encoder that learns to place text and images in the same vector space
- Two separate encoders that don't share weights: a Transformer for text, and either a ResNet or Vision Transformer for images
- Trained on 400 million (image, text) pairs from the internet. Training signal is contrastive: given a batch of N pairs, maximize cosine similarity for the N correct pairings, minimize it for the N²-N incorrect ones
- The result is a shared embedding space. "A photo of a dog" lands near actual dog images. This is what makes text-to-image search possible
- Zero-shot classification falls out for free: encode class labels as text, encode the image, pick the nearest. No fine-tuning needed
- Key insight: natural language supervision scales better than fixed label taxonomies. You don't need ImageNet categories. You just need the internet
What I read
- Roboflow CLIP intro — decent overview but surface level. Good for first exposure, not enough for understanding the mechanism
- Towards Data Science explainer — more detailed but still not rigorous enough. Skips over the contrastive loss details and the ablation studies
- Neither blog was as rigorous as I wanted. Going straight to the paper next. It is pretty straightforward but there is stuff I need to sit with
Next steps
- How do two separate encoders end up conveying the same thing? Completely different architectures, completely different modalities. The contrastive loss bridges them, but what does the shared space actually look like?
- Get something running. Pinecone for the vector store, then build the image understanding pipeline on top
- Zero-shot transfer: how does it work across domains? How does backbone choice (ResNet vs ViT) affect it? The paper has ablations on this
- Other applications beyond search (generation, segmentation) to explore later. Text-to-image search is a solid starting point