← All Days
Day 4 — Tue, Apr 7

Zero-shot classification: from embeddings to probabilities

  • Goal today was to understand the experiment section properly so I have a clear picture of zero-shot learning and how things are done in practice. This will give me something concrete to start working on the actual code
  • Built my own CLIP model from scratch and trained it on the Flickr30k dataset. Trained on my M4 Pro MacBook Pro
  • The zero-shot pipeline is clearer now. It is enough to build a text-to-image search system

Zero-shot image classification pipeline

How CLIP goes from raw inputs to a prediction without ever seeing labeled training data for the task

  • Start with a zero-shot image classifier using a computer vision model. The image encoder produces an image embedding
  • Then match the image embedding against text embeddings using cosine similarity search. This gives raw similarity scores between the image and each candidate text label
  • Scale the similarity scores by a learned temperature parameter. Temperature controls how peaked or flat the distribution is. Higher temperature means softer probabilities, lower means sharper
  • Normalize the scaled scores into a probability distribution via softmax. Now you have actual probabilities over the candidate labels. Pick the highest one
  • The whole thing works because both encoders were trained to place matching pairs close together in the shared embedding space. At inference time you just measure distance

Hypernetworks and on-the-fly representations

Understanding how text encodings work without pre-shared vectors

  • Wanted to understand the concept of a hypernetwork. It is another way of thinking about what happens at inference
  • The key insight: we haven't shared any vectors anywhere ahead of time. The text encoder produces vectors on the fly from whatever text you give it. There is no fixed label set
  • These text vectors then get matched against image vector representations which are already trained on the ambiguous, rich feature space that language supervision provides
  • This is what makes zero-shot transfer possible. You can write any text query and get meaningful similarity scores against images the model has never seen paired with that specific text

Building CLIP on Flickr30k

From-scratch implementation trained on a real dataset

  • Chose Flickr30k as the training dataset. It has 31,000 images with five captions each. Small enough to train locally, rich enough to learn real associations
  • Completed the initial coding and trained it on my M4 Pro MacBook Pro. Apple Silicon unified memory handled the dual encoder training fine with the MPS backend
  • This is the next concrete step from Day 3. I said I wanted to build CLIP from scratch and now I have a working model

Training results

Recall metrics on Flickr30k after training

MetricScore
Image-to-Text Recall@158.70%
Image-to-Text Recall@585.80%
Text-to-Image Recall@154.50%
Text-to-Image Recall@584.80%
  • Text-to-image is slightly harder than image-to-text. Makes sense since multiple images can plausibly match a text description, but each image has more unique visual detail to anchor on