April 5, 2026

The Visual Cortex as a Computer Vision Architecture: Dorsal and Ventral Streams

A week spent studying how the brain processes vision as a computer vision engineer. Two comprehensive architectural diagrams, synthesising what the brain actually does across the dorsal and ventral visual streams. Everything, rewritten for engineers who build vision systems.

computer visionneurosciencevisual cortexCNNarchitecture

The Split You Did Not Know Was in Your Design

The visual cortex is not a single pipeline. It is two parallel streams that diverge from V1, each with its own purpose, speed, and computational strategy. The split begins at the retina: M-cells (magnocellular) feed the dorsal stream, P-cells (parvocellular) feed the ventral stream. This is not a cortical decision. It is a hardware decision made at the sensor level.

Computer vision engineers typically treat their input as a unified tensor. The brain ships two separate signal types from the first layer. This design choice propagates all the way to final outputs: one stream for where and how, one for what.

Dorsal Stream — "Where / How"

Retina

V2 / V3

MT / V5motion energy

MSToptical flow

PPCpredictive coding

Motor cortex Frontal eye field efference copy

Ventral Stream — "What"

Retina

LGN

V1fine detail, colour, high freq

V2contours, figure-ground, illusionary edges

V4shapes, colour constancy, curves

IT (posterior)complex shapes

Anterior ITrecognition

Amygdala Hippocampus emotional relevance & contextual memory

Three principles recur throughout the entire system:

The brain uses relative perception, never absolute. Colour, brightness, and depth are computed by comparison, never by raw value.

looks darker

looks lighter

same grey — different perception depending on background

Feedback outnumbers feedforward. Higher regions constantly predict what lower regions should see. The ratio of descending to ascending fibres in visual cortex is roughly 10:1.

Perception is a controlled hallucination. The brain fills gaps, constructs illusory edges, and resolves ambiguity at every processing stage. It does not record. It infers.

The Dorsal Stream: Real-Time Spatial Inference

Latency: 30 to 50 ms. Memory: none. Purpose: keep the organism alive.

The dorsal ("where/how") stream is an online system. It never stores state. It processes motion vectors, spatial maps, and prediction errors, then outputs motor commands. It does not recognise objects. A ball flying at your face triggers a dodge reflex before conscious identification occurs.

Retina to V1

M-cells have large receptive fields and respond only to transient changes (onset/offset of signal). They fire on change and go silent on sustained input. This means if the image on the retina were perfectly stable, all M-cell-driven neurons would stop firing. Vision would fade. To prevent this, the visual system generates involuntary micro-saccades at approximately 3 per second, continuously shifting the image on the retina. The dorsal stream depends on constant motion in the input signal by design.

V1 adds orientation selectivity (simple cells to complex cells) and motion direction selectivity, with broad spatial frequency tuning appropriate for a system that cares about movement rather than fine detail.

MT/V5: Motion Energy, Not Object Tracking

MT contains a bank of directional filters. Each neuron is tuned to a specific direction and speed. These neurons track motion vectors, not objects. The output of MT is a field of movement vectors, not a scene description.

The clinical evidence is unambiguous. Damage to MT causes akinetopsia: the world appears as a series of frozen snapshots. Pouring liquid into a cup becomes impossible because the rising level is invisible. Continuous motion is a computed product of this area, not a given.

For CV engineers: optical flow computation and motion estimation are the MT equivalent. Most deep models treat motion as an afterthought, concatenating frames or using temporal convolutions. MT suggests motion deserves its own dedicated pathway with its own feature representation, operating at lower spatial resolution and higher temporal resolution than the object recognition pathway.

MST: Global Optical Flow and Heading Estimation

MST takes local motion vectors from MT and integrates them into global optical flow. Expansion patterns signal forward movement. Contraction patterns signal backward movement. Rotation patterns signal turning. This is how heading direction is computed from visual information alone, without any inertial signal.

This is ego-motion estimation. The brain has a dedicated cortical area that does nothing else.

PPC: Predictive Coding and the 3D World Model

The Posterior Parietal Cortex does not encode where every object is. It encodes prediction error, the difference between where objects should be and where they actually are. If an object moves exactly as predicted, PPC barely fires. Only surprises are encoded. This is remarkably efficient: the 3D world model is maintained as a residual rather than as a full state.

PPC also handles efference copy: a copy of outgoing motor commands is fed back to cancel the expected visual consequence of self-generated movement. Every eye movement should make the world appear to spin. The efference copy mechanism subtracts the predicted visual shift before it reaches higher areas. Without it, stable visual perception during saccades would be impossible.

PPC output splits into two pathways: the "how" pathway (reach, grasp, dodge, via motor cortex) and the "where" pathway (gaze direction, via frontal eye fields).

The Ventral Stream: Hierarchical Object Recognition

Latency: 100 to 150 ms. Strategy: hierarchical feature composition. Purpose: build meaning.

The ventral ("what") stream is the architecture that Hubel and Wiesel described in the 1960s, which directly inspired convolutional neural networks. It builds representations hierarchically from edges to contours to shapes to objects to semantic categories. Every stage is an abstraction over the previous one.

V1: Edge Detection, Contextual Modulation, and End-Stopping

V1 simple cells have oriented ON/OFF receptive fields. They are edge detectors with a specific orientation preference. Complex cells combine multiple simple cells to achieve position invariance for a given orientation. This is the first abstraction: from local contrast to orientation-invariant edge.

Two V1 mechanisms are worth particular attention for engineers.

Contextual modulation via horizontal connections: a V1 neuron communicates with neighbouring neurons detecting the same orientation. If neighbours are quiet (suggesting noise or isolated edge), the neuron fires more strongly (likely a real edge). If neighbours are also active (suggesting uniform texture), suppression occurs. This is spatial context used to disambiguate signal from noise at the lowest processing level. Deep learning models using local receptive fields lack this within-layer lateral communication.

45° ⟷ 45° ⟷ 45°

horizontal connections between same-orientation neurons

~ ~ ~ neighbours quiet → fire stronger ✓ real edge

↗ ↗ ↗ neighbours active → noise, suppress

●

Classical

responds only to own field

→

●

Non-Classical

influenced by surround via horizontal connections

End-stopping: inhibitory zones sit at both ends of each receptive field. Short line segments that fall within the excitatory zone produce strong responses (object edges). Long lines that extend into the inhibitory zones produce weak responses (background texture, which tends to consist of longer, more uniform lines). The network is doing figure-ground disambiguation at V1, the very first cortical stage.

− ＋ receptive field ＋ −

——— short line → strong fire (object edge)

———————————— long line → weak fire (likely background)

The evolutionary trajectory here is interesting: in mice, edge detection occurs precortically. In cats, simple cells in V1 handle it. In primates, the system skips straight to complex cells. As edge detection moved into the cortex, it gained plasticity and connectivity to top-down feedback. The tradeoff was computational cost for adaptability.

V2: Contour Integration and Active Construction

V2 assembles V1 edge responses into continuous contour boundaries. It also perceives illusory contours: the Kanizsa triangle, where three Pac-Man shapes induce the perception of a triangle with clear edges despite no physical edges existing. V2 constructs those edges from surrounding context.

no edges drawn — yet you see a triangle

V2 also handles border ownership: for any given edge, V2 neurons encode which side of the edge belongs to the figure and which to the background. This is not a local property of the edge. It requires integration of information across the scene.

↑

→

↓

←

figure (owns the border) ground (continues behind)

For depth, V2 implements several mechanisms. Tuned excitatory neurons fire at one specific disparity depth. Tuned inhibitory neurons fire everywhere except at the fixation plane. Near/far cells coarsely categorise relative depth.

Tuned Excitatory

near

far

fires at one specific depth only

Tuned Inhibitory

near

far

silent at fixation plane, fires everywhere else

Near / Far Cells

near

far

← near cells far cells →

V2 also performs disparity capture: starting from high-confidence depth anchors (sharp edges, strong texture), depth estimates propagate outward to ambiguous regions.

high
confidence

← ← propagate ● propagate → →

depth estimates spread outward from high-confidence anchor points

Da Vinci stereopsis is a particularly elegant mechanism: features visible to only one eye (the region occluded from the other eye's perspective) become depth signals. The absence of binocular information is itself information. Regions visible to only the left eye must be to the left of an occluding surface. The brain uses this systematically.

Key insight: Each eye sees a sliver of background the other can't — the occluded region itself becomes a depth signal. No feature matching needed — absence of information = depth.

Amodal completion operates here too. When an object is partially occluded, V2 fills in the hidden contour. Each eye sees a slightly different hidden region, and these are combined into one completed object representation.

Left eye sees

right side visible, left hidden by cup

Right eye sees

left side visible, right hidden by cup

Brain perceives

one complete book — gap filled in

The phrase "perception is a controlled hallucination" applies most precisely here. V2 is not passively recording edges. It is actively constructing a scene model.

V4: Colour Constancy, Shape Extraction, and the Start of Invariance

V4 computes colour by comparing wavelength information relative to the surrounding scene, not by measuring absolute values. A red apple looks red under tungsten light and under daylight because V4 factors out the illuminant across the scene. The computational unit is a ratio, not an absolute measurement.

Colour Constancy

sunlight

fluorescent

factors out illumination — computes colour relative to surroundings

Shape & Curves

extracts geometric properties — contours, curves, angles

Invariance Begins

△ △ △ ≈

small changes in position, rotation & size are tolerated

V4 damaged → achromatopsia: total loss of colour, everything else survives

This is a direct demonstration of the relative-perception principle: the brain never uses raw pixel values. A mid-grey patch on a white background looks darker than the same patch on a black background. All chromatic and luminance perception is contextual.

V4 also extracts shape properties (curvature, angle, geometric structure) and begins building transformation invariance, tolerating small changes in position, rotation, and scale. Damage to V4 causes achromatopsia: complete colour blindness while all other visual function remains intact. The colour computation is modular and localised.

IT Cortex: Population Codes and Distributed Object Representation

shape, colour constancy
invariance begins

→

Posterior IT

complete object
representation (perception)

→

Anterior IT

percepts connect
to meaning

→

Amygdala / Hippo

vision becomes
cognition

features

meaning

Posterior IT is where visual features become objects. Cortical columns approximately 400 micrometres wide contain neurons tuned to complete object representations. Critically, these columns overlap and share representations. The architecture is fundamentally different from V1, where orientation columns have hard borders. IT columns blend. Long-range horizontal connections allow a population of overlapping columns to represent approximately 30,000 distinct objects with a finite number of neurons.

V1 Columns (Distinct)

45°

90°

135°

hard borders between columns

IT Columns (Overlapping)

face

hand

body

columns blend — shared knowledge

~400 μm wide vertical stack · full depth of cortex · ~30,000 objects recognised

No single neuron encodes "face." The object is encoded as a firing pattern across thousands of neurons, a population code. This is directly analogous to vector embeddings: the firing pattern for a cat is more similar to the pattern for a dog than the pattern for a chair. Partially occluded or degraded inputs still produce recognisable patterns because the code is distributed and robust to noise.

cat

≈ high

dog

≈ low

chair

← each bar = one neuron's firing rate →

similar objects have similar firing patterns — vector similarity in the brain

Posterior IT neurons are fully invariant: the same neuron fires for an object regardless of size, position, or rotation. Full transformation invariance is achieved at this stage, not earlier.

Anterior IT connects the percept to semantic knowledge. This is where "I see a face" becomes "this is my mother." It contains category-specific regions: living things versus non-living things, faces versus tools versus places.

The agnosias reveal the architecture through what breaks:

Apperceptive Agnosia

Posterior IT damaged

✓ edges, colours, motion

✓ V1 & V2 intact

✗ can't form objects

✗ can't draw from memory

Associative Agnosia

Anterior IT damaged

✓ can see & draw objects

✓ perception intact

✗ can't name objects

✗ can't say what it's for

Prosopagnosia

Face regions damaged

✓ sees faces

✓ reads expressions

✗ can't assign identity

✗ "whose face is this?"

Category-specific agnosia: some patients lose living things but retain tools. Others lose fruits specifically. Living things cluster together in IT because they share perceptual properties (eyes, limbs, organic texture). Non-living things (rigid geometry, manufactured surfaces) are processed separately. The categorical organisation of IT is not arbitrary.

Living Things

eyes, limbs, organic texture, bilateral symmetry

animals, faces, plants, veggies/fruits

Non-Living Things

rigid geometry, manufactured surfaces, functional parts

tools, vehicles, buildings, instruments

separate IT regions — damage to one category leaves others intact

Final Stage: Amygdala and Hippocampus

The amygdala assigns emotional valence: is this threatening or rewarding? The hippocampus stores the episode: not just what was seen, but when, where, and the surrounding context. The entire 100 to 150 ms journey from photon to cortex ends here, in a fully contextualised, emotionally-tagged memory.

Implications for Computer Vision Engineering

Feedforward is not enough. The ratio of feedback to feedforward connections in biological visual cortex is approximately 10:1. Higher areas continuously predict what lower areas should see. This is top-down modulation, and it reaches all the way back to V1. Transformers introduced attention mechanisms that partially replicate this. Diffusion models use iterative refinement. But most standard CNN architectures remain entirely feedforward, which is the opposite of the biological design.

Separate your motion pathway. The two-stream architecture is not arbitrary. Motion information (fast, low spatial frequency, transient) and object information (slow, high spatial frequency, sustained) have different computational requirements. Forcing both through the same backbone with the same temporal stride is a design constraint, not a design choice.

Relative features generalise better than absolute ones. V4's colour constancy computation uses ratios. V1's contextual modulation uses neighbourhood comparisons. Batch normalisation and layer normalisation in deep networks partially capture this. Instance normalisation captures it more directly. The biological evidence suggests that relative measurement is not just a normalisation trick but a core representational principle.

Population codes are vector embeddings. IT cortex has been doing nearest-neighbour retrieval over distributed firing patterns since long before the embedding space literature existed. The geometry of representation matters: similar objects cluster together, not because of an explicit loss function applied at a single layer, but because of how distributed codes develop across overlapping columns with shared connectivity.

Contextual modulation between same-level units is largely absent in deep learning. V1 neurons communicate laterally with neighbours detecting the same orientation. This within-layer communication resolves ambiguity (edge versus texture) without going up and down the hierarchy. Transformer self-attention across spatial positions is the closest modern equivalent, but it operates globally rather than in a structured local neighbourhood.

● ● ● ● ● ● ● ● ●

instant pop-out ✓

6 6 6 9 6 6 6 6 6

serial search needed ✗

Prediction error is more efficient than full-state encoding. PPC encodes residuals, not positions. The brain's world model updates by storing only what was unexpected. This is the core insight behind predictive coding frameworks in neuroscience, and it maps directly onto residual connections in ResNets and the update mechanisms in Kalman filters. Predictable motion is cheap to represent; surprises are expensive.

The architecture tells you what breaks independently. The clean dissociations in agnosia, achromatopsia, and akinetopsia reveal that colour, motion, form, and semantic identity are computed in genuinely separate circuits. This modularity is not accidental. Building monolithic end-to-end models that conflate these computations makes interpretability and targeted failure analysis harder. Modular architectures with dedicated pathways for different visual properties are not just biologically motivated. They are interpretable by construction.

You can find all of my research on this topic here: https://arshad221b.github.io/tiny-experiments/neuroscience/

Thanks for reading!