← All Days
Day 5 — Fri, May 1

Pretraining, feedforward, residual connections, and layer norm

  • I need to understand RAG and other things but I'll do that sequentially. Start with GPT. It's all about the decoder model (next token prediction). We have to understand pretraining first. The main difference between pre-training and GPT is pretraining it
  • Pretraining consumes almost 90% of the compute. So that's where we start

Study roadmap

The path from pretraining to modern reasoning models

What is
pretraining?
GPT
evolution
Reasoning models
(thinking models)
gemini · claude
Diff techniques
MoE & others

Pretraining

The thing that eats 90% of compute

  • Pre-training works on next token prediction. All we need to learn is the next token in the sequence. This practically gives us enormous data because there's just too much text available in books, reddit, and the web
  • Pre-training uses Causal Language Modeling (CLM)
  • It uses cross-entropy loss. Simple but I want to go deeper into loss functions later

GPT 1 to 3 and Chinchilla

  • As we progressed from GPT-1 to GPT-3, the objective remained the same. Only data quality and parameter count increased. Also they added 100x more compute on each generation
  • Chinchilla insight (DeepMind 2022): proved that we should train smaller models on bigger data rather than larger models on smaller data. Chinchilla (70B) outperformed GPT-3 (175B) using much more data. Since then we've been following these scaling laws

What's left in the core transformer

Four blocks I haven't covered yet

Feed forward
Residual conn.
Layer norm
Cross attention

Feedforward network

  • The main thing feedforward offers is non-linearity. The attention mechanism computes a weighted sum of values — a linear combination. Without feedforward, the model can't do complex feature transformations
  • When we pass through the feedforward block (and during backpropagation), the mixed-up information from the attention block gets much more contextualised and reformed
  • It's as simple as two linear layers with a non-linearity (ReLU or GELU) in between. One layer has 4x the size of the other to give the embedding (512) much more space (2048) to play in

Residual connections

  • Lets information flow all the way to the last layer during backpropagation. Without them gradients vanish as the network gets deep. Pretty steady and stupidly simple — just add the input back to the output of each sublayer
  • Post GPT-2, we add layer norm before each sublayer (pre-layernorm) instead of after (post-layernorm). This makes training much stabler

Layer norm vs batch norm

Different axes, different stability

  • We have many layers and each layer depends on the layer before. To update weights of layer 5 we have to update layer 6 and so on. As we go deeper the process becomes much slower and computationally intense. This is called internal covariate shift — each layer's input distribution keeps changing because the layers before it keep updating
  • Layer norm solves this. Instead of normalising across the batch (columns, like batch norm), it normalises across each token's features (the row)
dim 1
dim 2
dim 3
dim 4
cat
0.28
0.38
0.5
0.4
bat
0.32
0.5
0.85
0.9
sat
0.9
0.2
0.3
0.8
Layer norm normalises this row (each token independently)
Batch norm normalises this column (across the batch)
Batch norm → makes the data stable Layer norm → makes the learning stable