← All Days
Day 7 — Sat, Mar 28

The complete architecture — how vision flows through the brain

  • Final day — consolidating everything from Days 1–6 into two comprehensive architectural diagrams, synthesising what we've learned about how the brain sees
  • The visual cortex is not a single pipeline — it's two parallel streams that diverge from V1, each with its own purpose, speed, and computational strategy. The split begins at the retina itself: M-cells feed the dorsal stream, P-cells feed the ventral stream
  • The dorsal stream is fast (~30–50 ms), memoryless, and action-oriented — it evolved to keep us alive in real-time. It never "sees" objects, only motion, space, and prediction errors
  • The ventral stream is slow (~100–150 ms), hierarchical, and meaning-oriented — it evolved to help us understand and remember what we see. It actively constructs perception at every stage
  • Both streams are interconnected — the dorsal stream tells the ventral stream where to look (triggering saccades), while the ventral stream tells the dorsal stream what to expect (top-down predictions). They form a closed loop, not independent pipelines
  • Three recurring themes define visual processing: (1) the brain uses relative perception, never absolute — colour, brightness, depth are all computed by comparison; (2) feedback outnumbers feedforward — higher regions constantly predict what lower regions should see; (3) perception is a controlled hallucination — the brain fills gaps, constructs illusory edges, and resolves ambiguity at every stage

Dorsal Stream — Complete Architecture

The "Where / How" pathway. Uses M-cells (fast, transient). Operates in real-time with no memory. Purpose: spatial awareness, motion tracking, and guiding action.

Dorsal Stream — "Where / How" Pathway M-cells (fast, transient, change-detecting) · real-time · no memory · action-oriented Latency: ~30–50 ms Memory: None Neurons need change to fire Retina (M-cells / Magnocellular) Large receptive fields · fast transient response · low spatial freq Detects changes in the scene — does not sustain signal V1 — Primary Visual Cortex Orientation-selective neurons: simple cells → complex cells Motion direction selectivity · broad spatial tuning Saccades: 3/sec involuntary micro-movements keep image shifting on retina V2 / V3 Global disparity · coarse depth maps · motion boundary extraction Separates object motion from background motion MT / V5 — Middle Temporal Area Mechanism: Motion Energy Bank of directional filters tracking movement vectors Neurons perceive motion vectors, not objects themselves Damage → akinetopsia: world appears as series of frozen stills MST — Medial Superior Temporal Mechanism: Optical Flow Computes self-motion from visual field changes Expansion/contraction patterns → heading direction · speed estimation PPC — Posterior Parietal Cortex Mechanism: Predictive Coding Tracks prediction error: where object should be vs where it is 3D spatial model: distance, speed, relative position of everything The brain's GPS — builds and updates real-time world map Motor Cortex / Pre-Motor "How" → reach, grasp, dodge, catch Frontal Eye Fields (FEF) "Where" → gaze direction, saccades efference copy — cancels self-generated motion
  • The dorsal stream is remarkably fast because it uses M-cells — large neurons that respond to changes, not sustained input. They sacrifice detail for speed
  • Each stage adds a layer of spatial understanding: raw motion → flow fields → spatial maps → motor commands
  • The efference copy mechanism is crucial — without it, every eye movement would make the world appear to spin. The brain pre-cancels self-generated motion
  • PPC doesn't just track objects — it builds a full 3D spatial model of the environment, updated in real-time
  • The dorsal stream's output splits into two: the "how" pathway (grasping, reaching — via motor cortex) and the "where" pathway (gaze direction — via frontal eye fields)
  • Key insight: the dorsal stream never "sees" objects. A ball flying toward your face triggers a dodge reflex without you ever consciously identifying it as a ball
  • M-cells vs P-cells is a fundamental design decision made at the retina itself — the split between the two streams doesn't begin at the cortex, it begins at the eye. M-cells have large receptive fields and respond only to transient changes (onset/offset), which is why the dorsal stream is blind to static detail but lightning-fast at detecting motion
  • Visual cortex neurons go silent if the image is perfectly stable on the retina — they need change to fire. So our eyes make tiny involuntary saccades (~3 per second) to keep the image shifting. Without these micro-movements, vision would literally fade to nothing. The dorsal stream depends on this constant micro-motion
  • MT/V5 neurons don't track objects — they track motion vectors. They function as a bank of directional filters, each tuned to a specific direction and speed. Damage to MT causes akinetopsia: the world appears as a series of frozen snapshots rather than continuous motion. Pouring coffee becomes impossible because you can't see the liquid rising
  • MST takes MT's local motion signals and computes global optical flow — the pattern of visual motion across the entire field. Expansion patterns mean you're moving forward, contraction means backward. This is how you know your heading direction even with your eyes closed for a moment
  • Predictive coding in PPC is remarkably efficient: PPC doesn't encode where every object is — it only encodes the prediction error, the difference between where things should be and where they actually are. If an object moves exactly as expected, PPC barely fires. Only surprises get encoded. This is why we notice when something moves unexpectedly but can ignore predictable motion
  • The dorsal stream also feeds into the ventral stream — it tells the ventral stream where to direct attention. When the dorsal stream detects something moving in the periphery, it triggers a saccade to that location, and the ventral stream then identifies what it is. The two streams are not independent pipelines — they form a closed loop

Ventral Stream — Complete Architecture

The "What" pathway. Uses P-cells (slow, sustained, high-resolution). Hierarchical assembly from edges to meaning. Purpose: identification, recognition, and memory.

Ventral Stream — "What" Pathway P-cells (slow, sustained, high-resolution) · hierarchical assembly · builds meaning Latency: ~100–150 ms Strategy: Hierarchical Relative perception only Retina → LGN (Parvocellular) P-cells: small receptive fields · colour-sensitive · fine detail Slow but sustained signal — unlike M-cells which only fire at changes V1 — Edge Detection (Low-Level) Simple cells: ON/OFF responses to oriented edges (like CNN kernels) Complex cells: combine simple cells → position-invariant detection Contextual modulation: horizontal connections between same-orientation neurons End-stopping: short lines = object edges, long lines = background Evolution: mice pre-cortex → cats simple cells → primates skip to complex cells V2 — Contour Integration, Depth & Segmentation Assembles V1 edges into continuous contour boundaries Illusory contours: perceives edges that don't exist (Kanizsa triangles) Border ownership: determines figure vs ground for each edge Global disparity (vs V1 local) · disparity capture propagation Da Vinci stereopsis: absence of information = depth signal "Perception is a controlled hallucination" — V2 actively constructs reality V4 — Shape, Colour Constancy & Invariance The bridge between features and objects 1. Colour constancy — factors out illumination, computes relative colour 2. Shape extraction — geometric properties of curves, contours, angles 3. Invariance begins — tolerates small position, rotation, size changes Heavy top-down from IT · Damage → achromatopsia: total colour loss, everything else intact Posterior IT — Object Perception Where visual features become objects Complete object representation — not features, but things Cortical columns (~400 μm wide) that overlap and share knowledge Population coding: firing pattern across neurons = vector embedding Full invariance: same neuron fires regardless of size, position, rotation Face-selective regions · long-range horizontal connections · ~30,000 objects from shared columns Anterior IT — Semantic Binding Where seeing becomes knowing Connects percepts to stored semantic knowledge and memory "I see a face" → "this is my mother" — perception meets meaning Category-specific regions: living vs non-living, faces vs tools vs places Damage → associative agnosia (can't name) · prosopagnosia (face identity) Amygdala Emotional significance: threat? reward? "does what I'm seeing matter to me?" Hippocampus Contextual memory: when, where, with whom stores the full episode — not just the percept Top-down feedback: more fibres downward than upward Full pipeline: ~100–150 ms from photon to contextualised memory Like a CNN: edges (V1) → contours (V2) → shapes (V4) → objects (IT) → meaning (Ant. IT) concrete edges shapes objects meaning
  • The ventral stream is the brain's deep neural network — each stage builds increasingly abstract representations, from edges to objects to meaning
  • V1 and V2 handle intermediate processing: edges, contours, depth, border ownership, figure-ground segmentation — the brain's feature extraction layers
  • V4 is the critical bridge — it computes colour constancy, extracts shapes, and begins transformation invariance. Damage here causes achromatopsia (total colour blindness)
  • IT cortex is where perception happens. Posterior IT builds complete object representations using overlapping columns (~400 μm wide) and population coding. No single neuron says "face" — the pattern across thousands of neurons encodes it
  • Anterior IT connects perception to meaning — this is where "seeing" becomes "knowing." Damage here causes associative agnosia: you can draw an object perfectly but can't name it
  • The final stage connects to the amygdala (emotional significance) and hippocampus (contextual memory) — vision becomes fully integrated cognition
  • Feedback connections outnumber feedforward ones — higher regions constantly predict what lower regions should see, making recognition faster through top-down priors
  • The entire journey from photon hitting the retina to a contextualised, emotionally-tagged memory takes roughly 100–150 ms for the ventral stream
  • V1's simple and complex cells are the foundation — simple cells have ON/OFF receptive fields tuned to specific orientations (like 45° edge detectors), while complex cells combine multiple simple cells to achieve position invariance. This is the first abstraction. The evolution of this is striking: mice detect edges before the cortex, cats use simple cells, but primates skip straight to complex cells — as edge detection moved closer to the cortex, it gained plasticity and connectivity
  • Contextual modulation is a key V1 mechanism that resolves ambiguity: a neuron doesn't work alone — it communicates with neighbouring neurons detecting the same orientation via horizontal connections. If neighbours are quiet (noise), the neuron fires stronger (real edge). If neighbours are also active, it suppresses (likely texture/background). This is how we see the outer leaves of a tree sharply but inner leaves remain blurry — we infer their edges but don't actually perceive them
  • End-stopping (inhibitory surround) is another V1 mechanism: inhibitory zones sit at both ends of a receptive field. Short lines stay within the excitatory zone → strong fire (object edge). Long lines extend into inhibitory zones → weak fire (likely background, which tends to have uniform longer lines). The brain is actively distinguishing figure from ground at the very first cortical stage
  • V2 is where the brain begins actively constructing reality rather than passively recording it. It perceives illusory contours (Kanizsa triangles — you see a triangle where no edges exist), determines border ownership (which side of an edge is figure vs ground), and handles global disparity for depth. V2 also performs amodal completion — when a cup blocks part of a book, V2 fills in the hidden portion. Each eye sees a different hidden part, and the brain combines them into one complete object
  • Depth perception uses three types of specialised neurons: tuned excitatory (fires at one specific depth), tuned inhibitory (silent at fixation plane, fires everywhere else), and near/far cells (coarsely categorise as closer or farther). The brain also uses disparity capture — starting from high-confidence depth points (clear edges, strong texture), it propagates depth estimates outward to ambiguous regions. Da Vinci stereopsis goes further: features visible to only one eye (the occluded region) become depth signals. Absence of information is itself information
  • V4 computes colour constancy by comparing colour relative to surroundings, never in isolation — a red apple looks red in sunlight and under fluorescent light because V4 factors out illumination across the whole scene. The brain never uses absolute "pixel values" — a grey box on a white background looks darker, but the same grey on a black background looks lighter. All perception is relative, contextual, comparative
  • The overlapping column architecture in IT is fundamentally different from V1's discrete columns. V1 has hard borders between orientation columns (45° neurons and 90° neurons are clearly separated). IT columns blend into each other — they share representations so that the cortex, with limited neurons, can recognise ~30,000 different objects. Recognition is distributed across many columns via long-range horizontal connections, not localised to one. This is why face recognition is robust to damage — no single column holds the entire representation
  • Population coding in IT works like vector embeddings: no single neuron says "cat" — instead, thousands of neurons fire at different rates, and the pattern across the population encodes the object. The firing pattern for a cat is more similar to a dog than to a chair (vector similarity). Ambiguous or partially occluded objects can still be recognised because the distributed code is robust to noise
  • The agnosias reveal the architecture through what breaks: apperceptive agnosia (posterior IT damaged) — sees edges and colours but can't assemble them into objects, can't draw from memory. Associative agnosia (anterior IT damaged) — can see and draw objects perfectly but can't name them or say what they're for, the link to meaning is severed. Prosopagnosia — can recognise that something is a face, read expressions, but can't assign it to a person. Category-specific agnosias show that IT organises objects by category — some patients lose living things but keep tools, others lose fruits specifically. Living things share traits (eyes, limbs, organic texture) and are grouped together; non-living things (rigid geometry, manufactured surfaces) are processed separately
  • The amygdala and hippocampus represent the final transition from perception to cognition. The amygdala assigns emotional valence — is this threatening? rewarding? The hippocampus stores the episode — not just what you saw, but when, where, and who you were with. This is where visual perception becomes a fully contextualised, emotionally-tagged memory. The entire ventral stream exists to serve this endpoint: from photon to meaning

Reflections — What the Brain Teaches Us About Computer Vision

  • CNNs are directly inspired by the ventral stream's hierarchical assembly: edges → contours → shapes → objects. Hubel & Wiesel's simple/complex cells from the 1960s are the intellectual ancestors of convolutional filters
  • The brain's attention mechanism (top-down modulation reaching all the way to V1) predates transformer attention by millions of years — but the principle is the same: not all information is equally important
  • Population coding in IT cortex is vector similarity — the same idea behind embedding spaces in modern AI. Similar objects have similar firing patterns, just as similar concepts have similar vectors
  • The brain uses far more feedback than feedforward connections — modern AI is only beginning to explore this with iterative refinement and diffusion models
  • Contextual modulation (horizontal connections between same-level neurons) is something attention mechanisms in deep learning still lack — neurons talk to their neighbours, not just to higher/lower layers
  • The brain's efficiency is remarkable: it encodes only edges and fills in surfaces, uses relative (not absolute) measurements, and reuses feature detectors across categories. ~30,000 objects recognised with overlapping columns sharing knowledge
  • Perhaps the most profound insight: perception is a controlled hallucination. The brain doesn't passively record reality — it actively constructs it, filling gaps, predicting patterns, and resolving ambiguity at every stage