Day 4: The transformer architecture, drawn from scratch — Overview of Modern Nets

My goal for today is to deep dive into transformer models. Nothing else
Checklist: what are encoder models, what are decoder models, what are the training strategies, what are LLM objectives in computer vision

Drew the complete architecture from the original paper

I want to learn the transformer from scratch again for interviews. Starting with: 1) Tokenizer and BPE

encoder layer

Add & norm

↑

Feed forward

↑

Add & norm

↑

Multi-head attention

↑

⊕ positional encoding

↑

Input embedding

↑

Inputs

decoder block

Output

↑

Softmax

↑

Linear

↑

Add & norm

↑

Feed forward

↑

Add & norm

↑

Multi-head attention (cross)

↑

Add & norm

↑

Multi-head attention (masked)

↑

⊕ positional encoding

↑

Output embedding

↑

Outputs (shifted right)