← All Days
Day 3 — Wed, Apr 29
Masking, padding, and BERT vs GPT
- Good enough understanding of multi-head attention now. Today's flow: understanding masking → GPT and BERT → modern variations
- Two types of masking in transformers
understanding
masking
masking
→
GPT & BERT
→
modern
variations
variations
Encoder mask: padding mask
- When we are doing the processing we are usually training in batches. The batch looks like this:
- [The] [cat] [sat] [PAD] [PAD]
- [I] [love] [cats] [PAD] [PAD]
- The [PAD] tokens contribute nothing. So we set their attention score to negative infinity, which when passed through softmax creates the value zero. These things don't contribute to the output
Decoder masking: causal mask
- Where we block the transformer from seeing the future (only next token prediction). The attention map becomes only lower triangular
BERT vs GPT
- BERT is encoder-only model. GPT is decoder-only model
- BERT uses the bidirectional encoder. The transformer masking learns randomly on each side (random masking). It is good for understanding the context. Classification, NER
- GPT is for next token prediction. Always predicts only the next token. This is why we get only the next token in GPT
- Now I want to dive into GPT-2 and GPT-3 and other things. I want to understand how to make these models work
The GPT family study plan
Should be much better to understand now
- GPT is Generative Pre-trained Transformer. It is decoder-only model. It works on the super pre-trained model. There are a lot of methods to do this
GPT
meaning
meaning
→
Diff in GPT,
GPT-2 & 3+
GPT-2 & 3+
attention · masking · data · parallelisation
→
Scaling &
nuances
nuances