Day 3: Masking, padding, and BERT vs GPT — Overview of Modern Nets — Arshad Kazi

Good enough understanding of multi-head attention now. Today's flow: understanding masking → GPT and BERT → modern variations
Two types of masking in transformers

understanding
masking

→

GPT & BERT

→

modern
variations

Encoder mask: padding mask

When we are doing the processing we are usually training in batches. The batch looks like this:
[The] [cat] [sat] [PAD] [PAD]
[I] [love] [cats] [PAD] [PAD]
The [PAD] tokens contribute nothing. So we set their attention score to negative infinity, which when passed through softmax creates the value zero. These things don't contribute to the output

Decoder masking: causal mask

Where we block the transformer from seeing the future (only next token prediction). The attention map becomes only lower triangular

BERT vs GPT

BERT is encoder-only model. GPT is decoder-only model
BERT uses the bidirectional encoder. The transformer masking learns randomly on each side (random masking). It is good for understanding the context. Classification, NER
GPT is for next token prediction. Always predicts only the next token. This is why we get only the next token in GPT
Now I want to dive into GPT-2 and GPT-3 and other things. I want to understand how to make these models work

The GPT family study plan

Should be much better to understand now

GPT is Generative Pre-trained Transformer. It is decoder-only model. It works on the super pre-trained model. There are a lot of methods to do this

GPT
meaning

→

Diff in GPT,
GPT-2 & 3+

attention · masking · data · parallelisation

→

Scaling &
nuances