The Visual Cortex as a Computer Vision Architecture: Dorsal and Ventral Streams

A week spent studying how the brain processes vision as a computer vision engineer. Two comprehensive architectural diagrams, synthesising what the brain actually does across the dorsal and ventral visual streams. Everything, rewritten for engineers who build vision systems.

The Split You Did Not Know Was in Your Design

The visual cortex is not a single pipeline. It is two parallel streams that diverge from V1, each with its own purpose, speed, and computational strategy. The split begins at the retina: M-cells (magnocellular) feed the dorsal stream, P-cells (parvocellular) feed the ventral stream. This is not a cortical decision. It is a hardware decision made at the sensor level.

Computer vision engineers typically treat their input as a unified tensor. The brain ships two separate signal types from the first layer. This design choice propagates all the way to final outputs: one stream for where and how, one for what.

Dorsal Stream — "Where / How"

Retina
V1
V2 / V3
MT / V5motion energy
MSToptical flow
PPCpredictive coding
Motor cortex Frontal eye field efference copy

Ventral Stream — "What"

Retina
LGN
V1fine detail, colour, high freq
V2contours, figure-ground, illusionary edges
V4shapes, colour constancy, curves
IT (posterior)complex shapes
Anterior ITrecognition
Amygdala Hippocampus emotional relevance & contextual memory

Three principles recur throughout the entire system:

The brain uses relative perception, never absolute. Colour, brightness, and depth are computed by comparison, never by raw value.

looks darker
=
looks lighter
same grey — different perception depending on background

Feedback outnumbers feedforward. Higher regions constantly predict what lower regions should see. The ratio of descending to ascending fibres in visual cortex is roughly 10:1.

Perception is a controlled hallucination. The brain fills gaps, constructs illusory edges, and resolves ambiguity at every processing stage. It does not record. It infers.


The Dorsal Stream: Real-Time Spatial Inference

Latency: 30 to 50 ms. Memory: none. Purpose: keep the organism alive.

The dorsal ("where/how") stream is an online system. It never stores state. It processes motion vectors, spatial maps, and prediction errors, then outputs motor commands. It does not recognise objects. A ball flying at your face triggers a dodge reflex before conscious identification occurs.

Dorsal Stream — "Where / How" Pathway M-cells (fast, transient, change-detecting) · real-time · no memory · action-oriented Latency: ~30–50 ms Memory: None Neurons need change to fire Retina (M-cells / Magnocellular) Large receptive fields · fast transient response · low spatial freq Detects changes in the scene — does not sustain signal V1 — Primary Visual Cortex Orientation-selective neurons: simple cells → complex cells Motion direction selectivity · broad spatial tuning Saccades: 3/sec involuntary micro-movements keep image shifting on retina V2 / V3 Global disparity · coarse depth maps · motion boundary extraction Separates object motion from background motion MT / V5 — Middle Temporal Area Mechanism: Motion Energy Bank of directional filters tracking movement vectors Neurons perceive motion vectors, not objects themselves Damage → akinetopsia: world appears as series of frozen stills MST — Medial Superior Temporal Mechanism: Optical Flow Computes self-motion from visual field changes Expansion/contraction patterns → heading direction · speed estimation PPC — Posterior Parietal Cortex Mechanism: Predictive Coding Tracks prediction error: where object should be vs where it is 3D spatial model: distance, speed, relative position of everything The brain's GPS — builds and updates real-time world map Motor Cortex / Pre-Motor "How" → reach, grasp, dodge, catch Frontal Eye Fields (FEF) "Where" → gaze direction, saccades efference copy — cancels self-generated motion

Retina to V1

M-cells have large receptive fields and respond only to transient changes (onset/offset of signal). They fire on change and go silent on sustained input. This means if the image on the retina were perfectly stable, all M-cell-driven neurons would stop firing. Vision would fade. To prevent this, the visual system generates involuntary micro-saccades at approximately 3 per second, continuously shifting the image on the retina. The dorsal stream depends on constant motion in the input signal by design.

V1 adds orientation selectivity (simple cells to complex cells) and motion direction selectivity, with broad spatial frequency tuning appropriate for a system that cares about movement rather than fine detail.

MT/V5: Motion Energy, Not Object Tracking

MT contains a bank of directional filters. Each neuron is tuned to a specific direction and speed. These neurons track motion vectors, not objects. The output of MT is a field of movement vectors, not a scene description.

The clinical evidence is unambiguous. Damage to MT causes akinetopsia: the world appears as a series of frozen snapshots. Pouring liquid into a cup becomes impossible because the rising level is invisible. Continuous motion is a computed product of this area, not a given.

For CV engineers: optical flow computation and motion estimation are the MT equivalent. Most deep models treat motion as an afterthought, concatenating frames or using temporal convolutions. MT suggests motion deserves its own dedicated pathway with its own feature representation, operating at lower spatial resolution and higher temporal resolution than the object recognition pathway.

MST: Global Optical Flow and Heading Estimation

MST takes local motion vectors from MT and integrates them into global optical flow. Expansion patterns signal forward movement. Contraction patterns signal backward movement. Rotation patterns signal turning. This is how heading direction is computed from visual information alone, without any inertial signal.

This is ego-motion estimation. The brain has a dedicated cortical area that does nothing else.

PPC: Predictive Coding and the 3D World Model

The Posterior Parietal Cortex does not encode where every object is. It encodes prediction error, the difference between where objects should be and where they actually are. If an object moves exactly as predicted, PPC barely fires. Only surprises are encoded. This is remarkably efficient: the 3D world model is maintained as a residual rather than as a full state.

PPC also handles efference copy: a copy of outgoing motor commands is fed back to cancel the expected visual consequence of self-generated movement. Every eye movement should make the world appear to spin. The efference copy mechanism subtracts the predicted visual shift before it reaches higher areas. Without it, stable visual perception during saccades would be impossible.

PPC output splits into two pathways: the "how" pathway (reach, grasp, dodge, via motor cortex) and the "where" pathway (gaze direction, via frontal eye fields).


The Ventral Stream: Hierarchical Object Recognition

Latency: 100 to 150 ms. Strategy: hierarchical feature composition. Purpose: build meaning.

The ventral ("what") stream is the architecture that Hubel and Wiesel described in the 1960s, which directly inspired convolutional neural networks. It builds representations hierarchically from edges to contours to shapes to objects to semantic categories. Every stage is an abstraction over the previous one.

Ventral Stream — "What" Pathway P-cells (slow, sustained, high-resolution) · hierarchical assembly · builds meaning Latency: ~100–150 ms Strategy: Hierarchical Relative perception only Retina → LGN (Parvocellular) P-cells: small receptive fields · colour-sensitive · fine detail Slow but sustained signal — unlike M-cells which only fire at changes V1 — Edge Detection (Low-Level) Simple cells: ON/OFF responses to oriented edges (like CNN kernels) Complex cells: combine simple cells → position-invariant detection Contextual modulation: horizontal connections between same-orientation neurons End-stopping: short lines = object edges, long lines = background Evolution: mice pre-cortex → cats simple cells → primates skip to complex cells V2 — Contour Integration, Depth & Segmentation Assembles V1 edges into continuous contour boundaries Illusory contours: perceives edges that don't exist (Kanizsa triangles) Border ownership: determines figure vs ground for each edge Global disparity (vs V1 local) · disparity capture propagation Da Vinci stereopsis: absence of information = depth signal "Perception is a controlled hallucination" — V2 actively constructs reality V4 — Shape, Colour Constancy & Invariance The bridge between features and objects 1. Colour constancy — factors out illumination, computes relative colour 2. Shape extraction — geometric properties of curves, contours, angles 3. Invariance begins — tolerates small position, rotation, size changes Heavy top-down from IT · Damage → achromatopsia: total colour loss, everything else intact Posterior IT — Object Perception Where visual features become objects Complete object representation — not features, but things Cortical columns (~400 μm wide) that overlap and share knowledge Population coding: firing pattern across neurons = vector embedding Full invariance: same neuron fires regardless of size, position, rotation Face-selective regions · long-range horizontal connections · ~30,000 objects from shared columns Anterior IT — Semantic Binding Where seeing becomes knowing Connects percepts to stored semantic knowledge and memory "I see a face" → "this is my mother" — perception meets meaning Category-specific regions: living vs non-living, faces vs tools vs places Damage → associative agnosia (can't name) · prosopagnosia (face identity) Amygdala Emotional significance: threat? reward? "does what I'm seeing matter to me?" Hippocampus Contextual memory: when, where, with whom stores the full episode — not just the percept Top-down feedback: more fibres downward than upward Full pipeline: ~100–150 ms from photon to contextualised memory Like a CNN: edges (V1) → contours (V2) → shapes (V4) → objects (IT) → meaning (Ant. IT) concrete edges shapes objects meaning

V1: Edge Detection, Contextual Modulation, and End-Stopping

V1 simple cells have oriented ON/OFF receptive fields. They are edge detectors with a specific orientation preference. Complex cells combine multiple simple cells to achieve position invariance for a given orientation. This is the first abstraction: from local contrast to orientation-invariant edge.

Two V1 mechanisms are worth particular attention for engineers.

Contextual modulation via horizontal connections: a V1 neuron communicates with neighbouring neurons detecting the same orientation. If neighbours are quiet (suggesting noise or isolated edge), the neuron fires more strongly (likely a real edge). If neighbours are also active (suggesting uniform texture), suppression occurs. This is spatial context used to disambiguate signal from noise at the lowest processing level. Deep learning models using local receptive fields lack this within-layer lateral communication.

45° 45° 45°
horizontal connections between same-orientation neurons
~ ~ ~ neighbours quiet → fire stronger ✓ real edge
↗ ↗ ↗ neighbours active → noise, suppress
Classical
responds only to own field
Non-Classical
influenced by surround via horizontal connections

End-stopping: inhibitory zones sit at both ends of each receptive field. Short line segments that fall within the excitatory zone produce strong responses (object edges). Long lines that extend into the inhibitory zones produce weak responses (background texture, which tends to consist of longer, more uniform lines). The network is doing figure-ground disambiguation at V1, the very first cortical stage.

+ receptive field +
——— short line → strong fire (object edge)
———————————— long line → weak fire (likely background)

The evolutionary trajectory here is interesting: in mice, edge detection occurs precortically. In cats, simple cells in V1 handle it. In primates, the system skips straight to complex cells. As edge detection moved into the cortex, it gained plasticity and connectivity to top-down feedback. The tradeoff was computational cost for adaptability.

V2: Contour Integration and Active Construction

V2 assembles V1 edge responses into continuous contour boundaries. It also perceives illusory contours: the Kanizsa triangle, where three Pac-Man shapes induce the perception of a triangle with clear edges despite no physical edges existing. V2 constructs those edges from surrounding context.

no edges drawn — yet you see a triangle

V2 also handles border ownership: for any given edge, V2 neurons encode which side of the edge belongs to the figure and which to the background. This is not a local property of the edge. It requires integration of information across the scene.

figure (owns the border) ground (continues behind)

For depth, V2 implements several mechanisms. Tuned excitatory neurons fire at one specific disparity depth. Tuned inhibitory neurons fire everywhere except at the fixation plane. Near/far cells coarsely categorise relative depth.

Tuned Excitatory
near
far
fires at one specific depth only
Tuned Inhibitory
near
far
silent at fixation plane, fires everywhere else
Near / Far Cells
near
far
← near cells far cells →

V2 also performs disparity capture: starting from high-confidence depth anchors (sharp edges, strong texture), depth estimates propagate outward to ambiguous regions.

?
?
high
confidence
?
?
← ← propagate propagate → →
depth estimates spread outward from high-confidence anchor points

Da Vinci stereopsis is a particularly elegant mechanism: features visible to only one eye (the region occluded from the other eye's perspective) become depth signals. The absence of binocular information is itself information. Regions visible to only the left eye must be to the left of an occluding surface. The brain uses this systematically.

background wall L R occluder L R ~6 cm only right eye sees this only left eye sees this
Key insight: Each eye sees a sliver of background the other can't — the occluded region itself becomes a depth signal. No feature matching needed — absence of information = depth.

Amodal completion operates here too. When an object is partially occluded, V2 fills in the hidden contour. Each eye sees a slightly different hidden region, and these are combined into one completed object representation.

Left eye sees
right side visible, left hidden by cup
+
Right eye sees
left side visible, right hidden by cup
=
Brain perceives
one complete book — gap filled in

The phrase "perception is a controlled hallucination" applies most precisely here. V2 is not passively recording edges. It is actively constructing a scene model.

V4: Colour Constancy, Shape Extraction, and the Start of Invariance

V4 computes colour by comparing wavelength information relative to the surrounding scene, not by measuring absolute values. A red apple looks red under tungsten light and under daylight because V4 factors out the illuminant across the scene. The computational unit is a ratio, not an absolute measurement.

1
Colour Constancy
sunlight
=
fluorescent
factors out illumination — computes colour relative to surroundings
2
Shape & Curves
extracts geometric properties — contours, curves, angles
3
Invariance Begins
small changes in position, rotation & size are tolerated
V4 damaged → achromatopsia: total loss of colour, everything else survives

This is a direct demonstration of the relative-perception principle: the brain never uses raw pixel values. A mid-grey patch on a white background looks darker than the same patch on a black background. All chromatic and luminance perception is contextual.

V4 also extracts shape properties (curvature, angle, geometric structure) and begins building transformation invariance, tolerating small changes in position, rotation, and scale. Damage to V4 causes achromatopsia: complete colour blindness while all other visual function remains intact. The colour computation is modular and localised.

IT Cortex: Population Codes and Distributed Object Representation

V4
shape, colour constancy
invariance begins
Posterior IT
complete object
representation (perception)
Anterior IT
percepts connect
to meaning
Amygdala / Hippo
vision becomes
cognition
features
meaning

Posterior IT is where visual features become objects. Cortical columns approximately 400 micrometres wide contain neurons tuned to complete object representations. Critically, these columns overlap and share representations. The architecture is fundamentally different from V1, where orientation columns have hard borders. IT columns blend. Long-range horizontal connections allow a population of overlapping columns to represent approximately 30,000 distinct objects with a finite number of neurons.

V1 Columns (Distinct)
45°
90°
135°
hard borders between columns
vs
IT Columns (Overlapping)
face
hand
body
columns blend — shared knowledge
~400 μm wide vertical stack · full depth of cortex · ~30,000 objects recognised

No single neuron encodes "face." The object is encoded as a firing pattern across thousands of neurons, a population code. This is directly analogous to vector embeddings: the firing pattern for a cat is more similar to the pattern for a dog than the pattern for a chair. Partially occluded or degraded inputs still produce recognisable patterns because the code is distributed and robust to noise.

cat
≈ high
dog
≈ low
chair
← each bar = one neuron's firing rate →
similar objects have similar firing patterns — vector similarity in the brain

Posterior IT neurons are fully invariant: the same neuron fires for an object regardless of size, position, or rotation. Full transformation invariance is achieved at this stage, not earlier.

Anterior IT connects the percept to semantic knowledge. This is where "I see a face" becomes "this is my mother." It contains category-specific regions: living things versus non-living things, faces versus tools versus places.

The agnosias reveal the architecture through what breaks:

Apperceptive Agnosia
Posterior IT damaged
✓ edges, colours, motion
✓ V1 & V2 intact
✗ can't form objects
✗ can't draw from memory
Associative Agnosia
Anterior IT damaged
✓ can see & draw objects
✓ perception intact
✗ can't name objects
✗ can't say what it's for
Prosopagnosia
Face regions damaged
✓ sees faces
✓ reads expressions
✗ can't assign identity
✗ "whose face is this?"

Category-specific agnosia: some patients lose living things but retain tools. Others lose fruits specifically. Living things cluster together in IT because they share perceptual properties (eyes, limbs, organic texture). Non-living things (rigid geometry, manufactured surfaces) are processed separately. The categorical organisation of IT is not arbitrary.

Living Things
eyes, limbs, organic texture, bilateral symmetry
animals, faces, plants, veggies/fruits
Non-Living Things
rigid geometry, manufactured surfaces, functional parts
tools, vehicles, buildings, instruments
separate IT regions — damage to one category leaves others intact

Final Stage: Amygdala and Hippocampus

The amygdala assigns emotional valence: is this threatening or rewarding? The hippocampus stores the episode: not just what was seen, but when, where, and the surrounding context. The entire 100 to 150 ms journey from photon to cortex ends here, in a fully contextualised, emotionally-tagged memory.


Implications for Computer Vision Engineering

Feedforward is not enough. The ratio of feedback to feedforward connections in biological visual cortex is approximately 10:1. Higher areas continuously predict what lower areas should see. This is top-down modulation, and it reaches all the way back to V1. Transformers introduced attention mechanisms that partially replicate this. Diffusion models use iterative refinement. But most standard CNN architectures remain entirely feedforward, which is the opposite of the biological design.

Separate your motion pathway. The two-stream architecture is not arbitrary. Motion information (fast, low spatial frequency, transient) and object information (slow, high spatial frequency, sustained) have different computational requirements. Forcing both through the same backbone with the same temporal stride is a design constraint, not a design choice.

Relative features generalise better than absolute ones. V4's colour constancy computation uses ratios. V1's contextual modulation uses neighbourhood comparisons. Batch normalisation and layer normalisation in deep networks partially capture this. Instance normalisation captures it more directly. The biological evidence suggests that relative measurement is not just a normalisation trick but a core representational principle.

Population codes are vector embeddings. IT cortex has been doing nearest-neighbour retrieval over distributed firing patterns since long before the embedding space literature existed. The geometry of representation matters: similar objects cluster together, not because of an explicit loss function applied at a single layer, but because of how distributed codes develop across overlapping columns with shared connectivity.

Contextual modulation between same-level units is largely absent in deep learning. V1 neurons communicate laterally with neighbours detecting the same orientation. This within-layer communication resolves ambiguity (edge versus texture) without going up and down the hierarchy. Transformer self-attention across spatial positions is the closest modern equivalent, but it operates globally rather than in a structured local neighbourhood.

instant pop-out ✓
6 6 6 9 6 6 6 6 6
serial search needed ✗

Prediction error is more efficient than full-state encoding. PPC encodes residuals, not positions. The brain's world model updates by storing only what was unexpected. This is the core insight behind predictive coding frameworks in neuroscience, and it maps directly onto residual connections in ResNets and the update mechanisms in Kalman filters. Predictable motion is cheap to represent; surprises are expensive.

The architecture tells you what breaks independently. The clean dissociations in agnosia, achromatopsia, and akinetopsia reveal that colour, motion, form, and semantic identity are computed in genuinely separate circuits. This modularity is not accidental. Building monolithic end-to-end models that conflate these computations makes interpretability and targeted failure analysis harder. Modular architectures with dedicated pathways for different visual properties are not just biologically motivated. They are interpretable by construction.

You can find all of my research on this topic here: https://arshad221b.github.io/tiny-experiments/neuroscience/

Thanks for reading!