Day 7: The complete architecture — how vision flows through the brain — Understanding the Visual Cortex

Final day — consolidating everything from Days 1–6 into two comprehensive architectural diagrams, synthesising what we've learned about how the brain sees
The visual cortex is not a single pipeline — it's two parallel streams that diverge from V1, each with its own purpose, speed, and computational strategy. The split begins at the retina itself: M-cells feed the dorsal stream, P-cells feed the ventral stream
The dorsal stream is fast (~30–50 ms), memoryless, and action-oriented — it evolved to keep us alive in real-time. It never "sees" objects, only motion, space, and prediction errors
The ventral stream is slow (~100–150 ms), hierarchical, and meaning-oriented — it evolved to help us understand and remember what we see. It actively constructs perception at every stage
Both streams are interconnected — the dorsal stream tells the ventral stream where to look (triggering saccades), while the ventral stream tells the dorsal stream what to expect (top-down predictions). They form a closed loop, not independent pipelines
Three recurring themes define visual processing: (1) the brain uses relative perception, never absolute — colour, brightness, depth are all computed by comparison; (2) feedback outnumbers feedforward — higher regions constantly predict what lower regions should see; (3) perception is a controlled hallucination — the brain fills gaps, constructs illusory edges, and resolves ambiguity at every stage

Dorsal Stream — Complete Architecture

The "Where / How" pathway. Uses M-cells (fast, transient). Operates in real-time with no memory. Purpose: spatial awareness, motion tracking, and guiding action.

The dorsal stream is remarkably fast because it uses M-cells — large neurons that respond to changes, not sustained input. They sacrifice detail for speed
Each stage adds a layer of spatial understanding: raw motion → flow fields → spatial maps → motor commands
The efference copy mechanism is crucial — without it, every eye movement would make the world appear to spin. The brain pre-cancels self-generated motion
PPC doesn't just track objects — it builds a full 3D spatial model of the environment, updated in real-time
The dorsal stream's output splits into two: the "how" pathway (grasping, reaching — via motor cortex) and the "where" pathway (gaze direction — via frontal eye fields)
Key insight: the dorsal stream never "sees" objects. A ball flying toward your face triggers a dodge reflex without you ever consciously identifying it as a ball
M-cells vs P-cells is a fundamental design decision made at the retina itself — the split between the two streams doesn't begin at the cortex, it begins at the eye. M-cells have large receptive fields and respond only to transient changes (onset/offset), which is why the dorsal stream is blind to static detail but lightning-fast at detecting motion
Visual cortex neurons go silent if the image is perfectly stable on the retina — they need change to fire. So our eyes make tiny involuntary saccades (~3 per second) to keep the image shifting. Without these micro-movements, vision would literally fade to nothing. The dorsal stream depends on this constant micro-motion
MT/V5 neurons don't track objects — they track motion vectors. They function as a bank of directional filters, each tuned to a specific direction and speed. Damage to MT causes akinetopsia: the world appears as a series of frozen snapshots rather than continuous motion. Pouring coffee becomes impossible because you can't see the liquid rising
MST takes MT's local motion signals and computes global optical flow — the pattern of visual motion across the entire field. Expansion patterns mean you're moving forward, contraction means backward. This is how you know your heading direction even with your eyes closed for a moment
Predictive coding in PPC is remarkably efficient: PPC doesn't encode where every object is — it only encodes the prediction error, the difference between where things should be and where they actually are. If an object moves exactly as expected, PPC barely fires. Only surprises get encoded. This is why we notice when something moves unexpectedly but can ignore predictable motion
The dorsal stream also feeds into the ventral stream — it tells the ventral stream where to direct attention. When the dorsal stream detects something moving in the periphery, it triggers a saccade to that location, and the ventral stream then identifies what it is. The two streams are not independent pipelines — they form a closed loop

Ventral Stream — Complete Architecture

The "What" pathway. Uses P-cells (slow, sustained, high-resolution). Hierarchical assembly from edges to meaning. Purpose: identification, recognition, and memory.

The ventral stream is the brain's deep neural network — each stage builds increasingly abstract representations, from edges to objects to meaning
V1 and V2 handle intermediate processing: edges, contours, depth, border ownership, figure-ground segmentation — the brain's feature extraction layers
V4 is the critical bridge — it computes colour constancy, extracts shapes, and begins transformation invariance. Damage here causes achromatopsia (total colour blindness)
IT cortex is where perception happens. Posterior IT builds complete object representations using overlapping columns (~400 μm wide) and population coding. No single neuron says "face" — the pattern across thousands of neurons encodes it
Anterior IT connects perception to meaning — this is where "seeing" becomes "knowing." Damage here causes associative agnosia: you can draw an object perfectly but can't name it
The final stage connects to the amygdala (emotional significance) and hippocampus (contextual memory) — vision becomes fully integrated cognition
Feedback connections outnumber feedforward ones — higher regions constantly predict what lower regions should see, making recognition faster through top-down priors
The entire journey from photon hitting the retina to a contextualised, emotionally-tagged memory takes roughly 100–150 ms for the ventral stream
V1's simple and complex cells are the foundation — simple cells have ON/OFF receptive fields tuned to specific orientations (like 45° edge detectors), while complex cells combine multiple simple cells to achieve position invariance. This is the first abstraction. The evolution of this is striking: mice detect edges before the cortex, cats use simple cells, but primates skip straight to complex cells — as edge detection moved closer to the cortex, it gained plasticity and connectivity
Contextual modulation is a key V1 mechanism that resolves ambiguity: a neuron doesn't work alone — it communicates with neighbouring neurons detecting the same orientation via horizontal connections. If neighbours are quiet (noise), the neuron fires stronger (real edge). If neighbours are also active, it suppresses (likely texture/background). This is how we see the outer leaves of a tree sharply but inner leaves remain blurry — we infer their edges but don't actually perceive them
End-stopping (inhibitory surround) is another V1 mechanism: inhibitory zones sit at both ends of a receptive field. Short lines stay within the excitatory zone → strong fire (object edge). Long lines extend into inhibitory zones → weak fire (likely background, which tends to have uniform longer lines). The brain is actively distinguishing figure from ground at the very first cortical stage
V2 is where the brain begins actively constructing reality rather than passively recording it. It perceives illusory contours (Kanizsa triangles — you see a triangle where no edges exist), determines border ownership (which side of an edge is figure vs ground), and handles global disparity for depth. V2 also performs amodal completion — when a cup blocks part of a book, V2 fills in the hidden portion. Each eye sees a different hidden part, and the brain combines them into one complete object
Depth perception uses three types of specialised neurons: tuned excitatory (fires at one specific depth), tuned inhibitory (silent at fixation plane, fires everywhere else), and near/far cells (coarsely categorise as closer or farther). The brain also uses disparity capture — starting from high-confidence depth points (clear edges, strong texture), it propagates depth estimates outward to ambiguous regions. Da Vinci stereopsis goes further: features visible to only one eye (the occluded region) become depth signals. Absence of information is itself information
V4 computes colour constancy by comparing colour relative to surroundings, never in isolation — a red apple looks red in sunlight and under fluorescent light because V4 factors out illumination across the whole scene. The brain never uses absolute "pixel values" — a grey box on a white background looks darker, but the same grey on a black background looks lighter. All perception is relative, contextual, comparative
The overlapping column architecture in IT is fundamentally different from V1's discrete columns. V1 has hard borders between orientation columns (45° neurons and 90° neurons are clearly separated). IT columns blend into each other — they share representations so that the cortex, with limited neurons, can recognise ~30,000 different objects. Recognition is distributed across many columns via long-range horizontal connections, not localised to one. This is why face recognition is robust to damage — no single column holds the entire representation
Population coding in IT works like vector embeddings: no single neuron says "cat" — instead, thousands of neurons fire at different rates, and the pattern across the population encodes the object. The firing pattern for a cat is more similar to a dog than to a chair (vector similarity). Ambiguous or partially occluded objects can still be recognised because the distributed code is robust to noise
The agnosias reveal the architecture through what breaks: apperceptive agnosia (posterior IT damaged) — sees edges and colours but can't assemble them into objects, can't draw from memory. Associative agnosia (anterior IT damaged) — can see and draw objects perfectly but can't name them or say what they're for, the link to meaning is severed. Prosopagnosia — can recognise that something is a face, read expressions, but can't assign it to a person. Category-specific agnosias show that IT organises objects by category — some patients lose living things but keep tools, others lose fruits specifically. Living things share traits (eyes, limbs, organic texture) and are grouped together; non-living things (rigid geometry, manufactured surfaces) are processed separately
The amygdala and hippocampus represent the final transition from perception to cognition. The amygdala assigns emotional valence — is this threatening? rewarding? The hippocampus stores the episode — not just what you saw, but when, where, and who you were with. This is where visual perception becomes a fully contextualised, emotionally-tagged memory. The entire ventral stream exists to serve this endpoint: from photon to meaning

Reflections — What the Brain Teaches Us About Computer Vision

CNNs are directly inspired by the ventral stream's hierarchical assembly: edges → contours → shapes → objects. Hubel & Wiesel's simple/complex cells from the 1960s are the intellectual ancestors of convolutional filters
The brain's attention mechanism (top-down modulation reaching all the way to V1) predates transformer attention by millions of years — but the principle is the same: not all information is equally important
Population coding in IT cortex is vector similarity — the same idea behind embedding spaces in modern AI. Similar objects have similar firing patterns, just as similar concepts have similar vectors
The brain uses far more feedback than feedforward connections — modern AI is only beginning to explore this with iterative refinement and diffusion models
Contextual modulation (horizontal connections between same-level neurons) is something attention mechanisms in deep learning still lack — neurons talk to their neighbours, not just to higher/lower layers
The brain's efficiency is remarkable: it encodes only edges and fills in surfaces, uses relative (not absolute) measurements, and reuses feature detectors across categories. ~30,000 objects recognised with overlapping columns sharing knowledge
Perhaps the most profound insight: perception is a controlled hallucination. The brain doesn't passively record reality — it actively constructs it, filling gaps, predicting patterns, and resolving ambiguity at every stage