Day 5: Pretraining, feedforward, residual connections, and layer norm — Overview of Modern Nets

← All Days

I need to understand RAG and other things but I'll do that sequentially. Start with GPT. It's all about the decoder model (next token prediction). We have to understand pretraining first. The main difference between pre-training and GPT is pretraining it
Pretraining consumes almost 90% of the compute. So that's where we start

Study roadmap

The path from pretraining to modern reasoning models

What is
pretraining?

→

GPT
evolution

→

Reasoning models
(thinking models)

gemini · claude

→

Diff techniques

MoE & others

Pretraining

The thing that eats 90% of compute

Pre-training works on next token prediction. All we need to learn is the next token in the sequence. This practically gives us enormous data because there's just too much text available in books, reddit, and the web
Pre-training uses Causal Language Modeling (CLM)
It uses cross-entropy loss. Simple but I want to go deeper into loss functions later

GPT 1 to 3 and Chinchilla

As we progressed from GPT-1 to GPT-3, the objective remained the same. Only data quality and parameter count increased. Also they added 100x more compute on each generation
Chinchilla insight (DeepMind 2022): proved that we should train smaller models on bigger data rather than larger models on smaller data. Chinchilla (70B) outperformed GPT-3 (175B) using much more data. Since then we've been following these scaling laws

What's left in the core transformer

Four blocks I haven't covered yet

Feed forward

→

Residual conn.

→

Layer norm

→

Cross attention

Feedforward network

The main thing feedforward offers is non-linearity. The attention mechanism computes a weighted sum of values — a linear combination. Without feedforward, the model can't do complex feature transformations
When we pass through the feedforward block (and during backpropagation), the mixed-up information from the attention block gets much more contextualised and reformed
It's as simple as two linear layers with a non-linearity (ReLU or GELU) in between. One layer has 4x the size of the other to give the embedding (512) much more space (2048) to play in

Residual connections

Lets information flow all the way to the last layer during backpropagation. Without them gradients vanish as the network gets deep. Pretty steady and stupidly simple — just add the input back to the output of each sublayer
Post GPT-2, we add layer norm before each sublayer (pre-layernorm) instead of after (post-layernorm). This makes training much stabler

Layer norm vs batch norm

Different axes, different stability

We have many layers and each layer depends on the layer before. To update weights of layer 5 we have to update layer 6 and so on. As we go deeper the process becomes much slower and computationally intense. This is called internal covariate shift — each layer's input distribution keeps changing because the layers before it keep updating
Layer norm solves this. Instead of normalising across the batch (columns, like batch norm), it normalises across each token's features (the row)

dim 1

dim 2

dim 3

dim 4

cat

0.28

0.38

0.5

0.4

bat

0.32

0.5

0.85

0.9

sat

0.9

0.2

0.3

0.8

→ Layer norm normalises this row (each token independently)

↓ Batch norm normalises this column (across the batch)

Batch norm → makes the data stable Layer norm → makes the learning stable