← All Days
Day 1 — Mon, Apr 27
The encoder block, BPE, and how tokens are made
- Goal: learn the transformer from scratch again for interviews. Not surface level. Intuitive understanding of every piece
- Drew my study graph: embedding → positional encoding → multi-head attention → feedforward layers. With branches into scaled attention score and QKV method
- Starting with the encoder block. 6 layers in the original transformer. Residual connections skip each sublayer. Layer normalization wraps each sublayer: LayerNorm(x + sublayer(x)). I know layer norm is important but I need to understand why
Study graph
embedding
→
positional
encoding
encoding
→
multi-head
attention
attention
scaled attention score · QKV method
→
feedforward
layers
layers
BPE (Byte Pair Encoding)
A smart tokenizer where everything looks so much better than word counts
- Pretty simple. Break each word into characters and combine the most frequent pair. Keep merging until you hit the target vocabulary size
- For example: newest → ['n','e','w','e','s','t'], oldest → ['o','l','d','e','s','t'], creative → ['c','r','e','a','t','i','v','e']
- Combine these, you can see 'e' and 's' and 't' have high frequency. So we combine them together. Keep merging tokens until we get the final number of tokens we want. We can decide how many merges we want
- Frequent words become single tokens: 'the' → ['the']. Rare words get split into meaningful pieces: ['un','happiness']. Unknown words don't crash the system. Morphology emerges naturally: un-, en-, pre-, -ing, -tion, -est
- GPT-2 onwards don't use simple BPE but a version called byte-level BPE. They convert every token to UTF-8 bytes first (256 possible byte values, 0-255) and then start combining them together. This handles Chinese and other scripts that regular BPE can't
Positional embedding
- In the original transformer they used sine-cosine waves to determine the position. This is for the 'every' dimension. So if we have 20 dimensions we will have 10 sine and 10 cosine waves
- This is smart but not being used anymore because it is fixed and we do need the learnable parameters
- In ViT we have learned positional embeddings instead of the unlearnable embeddings. I still don't fully understand how learned positional embeddings work yet
- I want to learn what RoPE is. First, why sine-cosine fails (RoPE is the fix for these):
- 1) Tokens at similar positions in two different sentences get similar positional information regardless of content. 'John meeting Merry' vs 'We saw John greet Merry' — distances are similar so sine-cosine doesn't help distinguish them
- 2) On bigger context windows it breaks down — the fixed waves weren't designed to generalise that far
Input embeddings and what I need to understand
- I need to understand the input embeddings as well. I need to understand BPE (used in original transformer)
- The layer normalization is important but I don't fully understand why yet