Day 1: The encoder block, BPE, and how tokens are made — Overview of Modern Nets

Goal: learn the transformer from scratch again for interviews. Not surface level. Intuitive understanding of every piece
Drew my study graph: embedding → positional encoding → multi-head attention → feedforward layers. With branches into scaled attention score and QKV method
Starting with the encoder block. 6 layers in the original transformer. Residual connections skip each sublayer. Layer normalization wraps each sublayer: LayerNorm(x + sublayer(x)). I know layer norm is important but I need to understand why

embedding

→

positional
encoding

→

multi-head
attention

scaled attention score · QKV method

→

feedforward
layers

A smart tokenizer where everything looks so much better than word counts

Pretty simple. Break each word into characters and combine the most frequent pair. Keep merging until you hit the target vocabulary size
For example: newest → ['n','e','w','e','s','t'], oldest → ['o','l','d','e','s','t'], creative → ['c','r','e','a','t','i','v','e']
Combine these, you can see 'e' and 's' and 't' have high frequency. So we combine them together. Keep merging tokens until we get the final number of tokens we want. We can decide how many merges we want
Frequent words become single tokens: 'the' → ['the']. Rare words get split into meaningful pieces: ['un','happiness']. Unknown words don't crash the system. Morphology emerges naturally: un-, en-, pre-, -ing, -tion, -est
GPT-2 onwards don't use simple BPE but a version called byte-level BPE. They convert every token to UTF-8 bytes first (256 possible byte values, 0-255) and then start combining them together. This handles Chinese and other scripts that regular BPE can't

In the original transformer they used sine-cosine waves to determine the position. This is for the 'every' dimension. So if we have 20 dimensions we will have 10 sine and 10 cosine waves
This is smart but not being used anymore because it is fixed and we do need the learnable parameters
In ViT we have learned positional embeddings instead of the unlearnable embeddings. I still don't fully understand how learned positional embeddings work yet
I want to learn what RoPE is. First, why sine-cosine fails (RoPE is the fix for these):
1) Tokens at similar positions in two different sentences get similar positional information regardless of content. 'John meeting Merry' vs 'We saw John greet Merry' — distances are similar so sine-cosine doesn't help distinguish them
2) On bigger context windows it breaks down — the fixed waves weren't designed to generalise that far

I need to understand the input embeddings as well. I need to understand BPE (used in original transformer)
The layer normalization is important but I don't fully understand why yet