Day 4: Zero-shot classification: from embeddings to probabilities — Search Images with Words

Goal today was to understand the experiment section properly so I have a clear picture of zero-shot learning and how things are done in practice. This will give me something concrete to start working on the actual code
Built my own CLIP model from scratch and trained it on the Flickr30k dataset. Trained on my M4 Pro MacBook Pro
The zero-shot pipeline is clearer now. It is enough to build a text-to-image search system

How CLIP goes from raw inputs to a prediction without ever seeing labeled training data for the task

Start with a zero-shot image classifier using a computer vision model. The image encoder produces an image embedding
Then match the image embedding against text embeddings using cosine similarity search. This gives raw similarity scores between the image and each candidate text label
Scale the similarity scores by a learned temperature parameter. Temperature controls how peaked or flat the distribution is. Higher temperature means softer probabilities, lower means sharper
Normalize the scaled scores into a probability distribution via softmax. Now you have actual probabilities over the candidate labels. Pick the highest one
The whole thing works because both encoders were trained to place matching pairs close together in the shared embedding space. At inference time you just measure distance

Understanding how text encodings work without pre-shared vectors

Wanted to understand the concept of a hypernetwork. It is another way of thinking about what happens at inference
The key insight: we haven't shared any vectors anywhere ahead of time. The text encoder produces vectors on the fly from whatever text you give it. There is no fixed label set
These text vectors then get matched against image vector representations which are already trained on the ambiguous, rich feature space that language supervision provides
This is what makes zero-shot transfer possible. You can write any text query and get meaningful similarity scores against images the model has never seen paired with that specific text

From-scratch implementation trained on a real dataset

Chose Flickr30k as the training dataset. It has 31,000 images with five captions each. Small enough to train locally, rich enough to learn real associations
Completed the initial coding and trained it on my M4 Pro MacBook Pro. Apple Silicon unified memory handled the dual encoder training fine with the MPS backend
This is the next concrete step from Day 3. I said I wanted to build CLIP from scratch and now I have a working model

Recall metrics on Flickr30k after training

Text-to-image is slightly harder than image-to-text. Makes sense since multiple images can plausibly match a text description, but each image has more unique visual detail to anchor on