← All Notebooks

Search Images with Words

In Progress

10 days (2 weeks) · 1 hr / day, weekdays only · Started March 30, 2026

The Spark

You look at a dog and you just know it's a dog. You don't think about it. Light hits your retina, something fires in your brain, and the word is just there. That's insane if you actually stop and think about it. Pixels became a word. How? Does the brain think in pictures first and then assign language to them? Or does language shape what we see? There's no clean answer, but the question keeps pulling me in.

And then there's the machine side. When a VLM processes an image and a sentence together, what's the shared representation that both modalities are actually living in? A vision encoder produces one kind of feature. A language model produces another. Somewhere these have to meet in a space that's useful for both. How does that shared space form? Is it learned end-to-end or forced through a projection? And once you have it, can you fine-tune it for a specific domain and actually trust the output? What breaks, and how do people deal with that in production?

I'm comfortable on the vision side. Detection, segmentation, tracking, that stuff makes sense to me. But I've never properly understood the moment where vision meets language. And now I'm stuck on a more practical question too: if I actually want to solve a vision-language problem, what do I reach for? Do I need a VLM? Do I need a segmentation model like SAM? A foundation model like DINO? Some combination? How do I even decide?

I opened my phone, typed "birthday cake" into Photos, and it found every birthday cake I've ever photographed. I never tagged any of them. My phone did this offline. Google does the same thing but needs a data centre. Same problem, completely different engineering. I want to understand all of it. The neuroscience, the models, the engineering, the tradeoffs. And I want to build something with it.

The Rules

  • 2 hours/day, weekdays only
  • Open the laptop, think about how vision and language interact, write down what I find
  • No fixed syllabus. Follow whatever pulls me in

Questions driving this experiment

How does thinking work?

  • Does the brain think in images first and then assign words? Or does language shape perception? What's the actual pathway from seeing to naming?
  • When a VLM processes an image and text together, what is the shared representation? How does a space form that's useful for both vision and language, and is it learned end-to-end or forced through projection?

What models exist and when do you use what?

  • What exactly is a VLM and how is it different from a vision model with a text head bolted on? What makes it actually multimodal?
  • How did the field move from contrastive models (CLIP-style) to generative VLMs (feeding images into language models)? Why?
  • When do I need a VLM vs a segmentation model like SAM vs a foundation model like DINO? How do I decide?

Can you actually trust them?

  • Are VLMs reliable enough for production? What are the known failure modes and how do people deal with them in industry?
  • Where does the alignment break? Compositionality, negation, spatial reasoning. Why, and what can you do about it?
  • Can you fine-tune these models? How? Is it practical for a real use case or do you need massive resources?

The engineering

  • Why does Apple run image search on-device while Google needs a data centre? What are the actual constraints?
  • Can I build a working text-to-image search system from scratch and understand every layer of it?
4 / 10 days