Overview of Modern Nets | Notebooks

The Spark

I know transformers. I've used them. I've built things with them. But if you sat me down in an interview and asked me to explain the full architecture from tokenizer to output, I'd stumble on specifics. Not because I don't understand the concepts, but because I never sat with each piece long enough to explain it simply. That gap between "I get it" and "I can teach it" is exactly what bothers me.

So this is a revision. Not a first pass. I'm going back through the original transformer, BPE, positional encodings, attention, masking, BERT vs GPT, the GPT family, and whatever else comes up. The goal is intuitive understanding. No heavy maths for now. Just making sure every block in the architecture makes sense to me at the level where I can draw it on a whiteboard and explain why each piece is there.

6 / 10 days

Day 1 Mon, Apr 27

The encoder block, BPE, and how tokens are made

Starting from the encoder side of the original transformer. Residual connections, layer norm, BPE tokenization, and why byte-level BPE replaced the original. Drew the full study graph for what I need to cover.

Day 2 Tue, Apr 28

Multi-head attention and why we use dot products

Focused on Q, K, V and the attention mechanism. Why dot products measure alignment, why we scale, and how multiple heads let the model attend to different things at once.

Day 3 Wed, Apr 29

Masking, padding, and BERT vs GPT

Two types of masking in transformers. Padding masks for the encoder, causal masks for the decoder. Then compared BERT (encoder-only, bidirectional) and GPT (decoder-only, autoregressive). Now I understand the fundamental split.

Day 4 Thu, Apr 30

The transformer architecture, drawn from scratch

Drew the full transformer architecture diagram. Encoder layer: input embedding → positional encoding → multi-head attention → add & norm → feedforward → add & norm. Decoder adds cross-attention. Revisited the checklist of what I still need to cover.

Day 5 Fri, May 1

Pretraining, feedforward, residual connections, and layer norm

Mapped out the full learning path from pretraining to reasoning models. Covered what pretraining actually does, Chinchilla scaling laws, feedforward as non-linearity, residual connections for gradient flow, and why layer norm beats batch norm for transformers.

Day 6 Sun, May 3

LLaMA, Mistral, and the road to reasoning models

Mapped the modern model landscape. Reasoning models think in intermediate steps. Multimodals just convert embeddings across modalities. LLaMA stayed small but beat GPT via Chinchilla + RMSNorm + RoPE + SwiGLU. Mistral doubled data, added sliding window attention and GQA. Going straight to the chain of thought paper.