Day 1: CLIP: the paper that married text and images — Search Images with Words

New project: text-to-image search. Should be straightforward to get working, but I want to actually understand the models, not just use them
Goal is to compare both CLIP-style models and modern VLMs for the task. See what is different from how the human brain does it. How reliable are they, how well do they perform
Starting with the most foundational paper for this: CLIP (Radford et al., 2021). It is the combination of a text encoder and an image encoder, and what comes out is a shared representation that can be used elsewhere

Contrastive Language-Image Pre-training: a dual-encoder that learns to place text and images in the same vector space

Two separate encoders that don't share weights: a Transformer for text, and either a ResNet or Vision Transformer for images
Trained on 400 million (image, text) pairs from the internet. Training signal is contrastive: given a batch of N pairs, maximize cosine similarity for the N correct pairings, minimize it for the N²-N incorrect ones
The result is a shared embedding space. "A photo of a dog" lands near actual dog images. This is what makes text-to-image search possible
Zero-shot classification falls out for free: encode class labels as text, encode the image, pick the nearest. No fine-tuning needed
Key insight: natural language supervision scales better than fixed label taxonomies. You don't need ImageNet categories. You just need the internet

Roboflow CLIP intro — decent overview but surface level. Good for first exposure, not enough for understanding the mechanism
Towards Data Science explainer — more detailed but still not rigorous enough. Skips over the contrastive loss details and the ablation studies
Neither blog was as rigorous as I wanted. Going straight to the paper next. It is pretty straightforward but there is stuff I need to sit with

How do two separate encoders end up conveying the same thing? Completely different architectures, completely different modalities. The contrastive loss bridges them, but what does the shared space actually look like?
Get something running. Pinecone for the vector store, then build the image understanding pipeline on top
Zero-shot transfer: how does it work across domains? How does backbone choice (ResNet vs ViT) affect it? The paper has ablations on this
Other applications beyond search (generation, segmentation) to explore later. Text-to-image search is a solid starting point