← All Days
Day 5 — Fri, May 1
Pretraining, feedforward, residual connections, and layer norm
- I need to understand RAG and other things but I'll do that sequentially. Start with GPT. It's all about the decoder model (next token prediction). We have to understand pretraining first. The main difference between pre-training and GPT is pretraining it
- Pretraining consumes almost 90% of the compute. So that's where we start
Study roadmap
The path from pretraining to modern reasoning models
What is
pretraining?
pretraining?
→
GPT
evolution
evolution
→
Reasoning models
(thinking models)
(thinking models)
gemini · claude
→
Diff techniques
MoE & others
Pretraining
The thing that eats 90% of compute
- Pre-training works on next token prediction. All we need to learn is the next token in the sequence. This practically gives us enormous data because there's just too much text available in books, reddit, and the web
- Pre-training uses Causal Language Modeling (CLM)
- It uses cross-entropy loss. Simple but I want to go deeper into loss functions later
GPT 1 to 3 and Chinchilla
- As we progressed from GPT-1 to GPT-3, the objective remained the same. Only data quality and parameter count increased. Also they added 100x more compute on each generation
- Chinchilla insight (DeepMind 2022): proved that we should train smaller models on bigger data rather than larger models on smaller data. Chinchilla (70B) outperformed GPT-3 (175B) using much more data. Since then we've been following these scaling laws
What's left in the core transformer
Four blocks I haven't covered yet
Feed forward
→
Residual conn.
→
Layer norm
→
Cross attention
Feedforward network
- The main thing feedforward offers is non-linearity. The attention mechanism computes a weighted sum of values — a linear combination. Without feedforward, the model can't do complex feature transformations
- When we pass through the feedforward block (and during backpropagation), the mixed-up information from the attention block gets much more contextualised and reformed
- It's as simple as two linear layers with a non-linearity (ReLU or GELU) in between. One layer has 4x the size of the other to give the embedding (512) much more space (2048) to play in
Residual connections
- Lets information flow all the way to the last layer during backpropagation. Without them gradients vanish as the network gets deep. Pretty steady and stupidly simple — just add the input back to the output of each sublayer
- Post GPT-2, we add layer norm before each sublayer (pre-layernorm) instead of after (post-layernorm). This makes training much stabler
Layer norm vs batch norm
Different axes, different stability
- We have many layers and each layer depends on the layer before. To update weights of layer 5 we have to update layer 6 and so on. As we go deeper the process becomes much slower and computationally intense. This is called internal covariate shift — each layer's input distribution keeps changing because the layers before it keep updating
- Layer norm solves this. Instead of normalising across the batch (columns, like batch norm), it normalises across each token's features (the row)
dim 1
dim 2
dim 3
dim 4
cat
0.28
0.38
0.5
0.4
bat
0.32
0.5
0.85
0.9
sat
0.9
0.2
0.3
0.8
→ Layer norm normalises this row (each token independently)
↓ Batch norm normalises this column (across the batch)
Batch norm → makes the data stable
Layer norm → makes the learning stable