Search Images with Words
In ProgressThe Spark
You look at a dog and you just know it's a dog. You don't think about it. Light hits your retina, something fires in your brain, and the word is just there. That's insane if you actually stop and think about it. Pixels became a word. How? Does the brain think in pictures first and then assign language to them? Or does language shape what we see? There's no clean answer, but the question keeps pulling me in.
And then there's the machine side. When a VLM processes an image and a sentence together, what's the shared representation that both modalities are actually living in? A vision encoder produces one kind of feature. A language model produces another. Somewhere these have to meet in a space that's useful for both. How does that shared space form? Is it learned end-to-end or forced through a projection? And once you have it, can you fine-tune it for a specific domain and actually trust the output? What breaks, and how do people deal with that in production?
I'm comfortable on the vision side. Detection, segmentation, tracking, that stuff makes sense to me. But I've never properly understood the moment where vision meets language. And now I'm stuck on a more practical question too: if I actually want to solve a vision-language problem, what do I reach for? Do I need a VLM? Do I need a segmentation model like SAM? A foundation model like DINO? Some combination? How do I even decide?
I opened my phone, typed "birthday cake" into Photos, and it found every birthday cake I've ever photographed. I never tagged any of them. My phone did this offline. Google does the same thing but needs a data centre. Same problem, completely different engineering. I want to understand all of it. The neuroscience, the models, the engineering, the tradeoffs. And I want to build something with it.
The Rules
- 2 hours/day, weekdays only
- Open the laptop, think about how vision and language interact, write down what I find
- No fixed syllabus. Follow whatever pulls me in
Questions driving this experiment
How does thinking work?
- Does the brain think in images first and then assign words? Or does language shape perception? What's the actual pathway from seeing to naming?
- When a VLM processes an image and text together, what is the shared representation? How does a space form that's useful for both vision and language, and is it learned end-to-end or forced through projection?
What models exist and when do you use what?
- What exactly is a VLM and how is it different from a vision model with a text head bolted on? What makes it actually multimodal?
- How did the field move from contrastive models (CLIP-style) to generative VLMs (feeding images into language models)? Why?
- When do I need a VLM vs a segmentation model like SAM vs a foundation model like DINO? How do I decide?
Can you actually trust them?
- Are VLMs reliable enough for production? What are the known failure modes and how do people deal with them in industry?
- Where does the alignment break? Compositionality, negation, spatial reasoning. Why, and what can you do about it?
- Can you fine-tune these models? How? Is it practical for a real use case or do you need massive resources?
The engineering
- Why does Apple run image search on-device while Google needs a data centre? What are the actual constraints?
- Can I build a working text-to-image search system from scratch and understand every layer of it?
CLIP: the paper that married text and images
Starting with the foundational CLIP paper. Two separate encoders (text + image) trained with contrastive learning to produce a shared embedding space. Read introductory blogs but they weren't rigorous enough, so going straight to the paper next.
Read more →CLIP's contrastive loss: how two encoders learn one space
Focused on the loss function side of CLIP. Two encoders produce two separate embeddings, and the contrastive loss is what forces them into a shared geometric space. Intuitive understanding only today. Low sleep so didn't go deep into the maths.
Read more →Building CLIP from scratch: reading the paper properly
Decided to build CLIP from scratch. No idea how much training it will require but this experiment seems to be the only case. Spent the day reading the paper front to back. Understand the high level but have specific gaps around batch construction, contrastive loss architecture, and model scaling.
Read more →Zero-shot classification: from embeddings to probabilities
Focused on understanding the experiment section of the CLIP paper to get a clear idea of zero-shot learning and how things are actually done. This gives me enough to start working on the actual code. Also started building CLIP from scratch on the Flickr30k dataset, planning to train on my M4 Pro MacBook Pro.
Read more →