← All Days
Day 3 — Wed, Apr 29

Masking, padding, and BERT vs GPT

  • Good enough understanding of multi-head attention now. Today's flow: understanding masking → GPT and BERT → modern variations
  • Two types of masking in transformers
understanding
masking
GPT & BERT
modern
variations

Encoder mask: padding mask

  • When we are doing the processing we are usually training in batches. The batch looks like this:
  • [The] [cat] [sat] [PAD] [PAD]
  • [I] [love] [cats] [PAD] [PAD]
  • The [PAD] tokens contribute nothing. So we set their attention score to negative infinity, which when passed through softmax creates the value zero. These things don't contribute to the output

Decoder masking: causal mask

  • Where we block the transformer from seeing the future (only next token prediction). The attention map becomes only lower triangular

BERT vs GPT

  • BERT is encoder-only model. GPT is decoder-only model
  • BERT uses the bidirectional encoder. The transformer masking learns randomly on each side (random masking). It is good for understanding the context. Classification, NER
  • GPT is for next token prediction. Always predicts only the next token. This is why we get only the next token in GPT
  • Now I want to dive into GPT-2 and GPT-3 and other things. I want to understand how to make these models work

The GPT family study plan

Should be much better to understand now

  • GPT is Generative Pre-trained Transformer. It is decoder-only model. It works on the super pre-trained model. There are a lot of methods to do this
GPT
meaning
Diff in GPT,
GPT-2 & 3+
attention · masking · data · parallelisation
Scaling &
nuances