← All Days
Day 2 — Tue, Apr 28

Multi-head attention and why we use dot products

  • Multi-head attention is about finding the attention inside the embedding. Three things here: Q, K, and V
  • Query: what am I looking for? Key: what do I contain? Value: what do I actually share if someone attends to me? Each of these have their own vector embeddings and all of them are learnable

The dot product and why it matters

How Q, K, V come together to form attention

  • After going through the 3B1B videos I want to put down a few things. A query would be a question (say in a sentence): 'hey how am I connected to the nouns?' Key would be the nouns surrounding that token. Value would be how much do I have to move my query vector to align with the query it was written on
  • I know all this but this is too basic. I don't know how the scaling works
  • The equation: Attention(Q,K,V) = softmax(QKT / √dk) · V. The softmax part is called the attention matrix (before the value)
  • We have dot product here. Dot product represents how much alignment there is between Q and K. Before softmax, we divide by √dk to scale the scores down — higher dimensions produce larger dot products which push softmax into tiny-gradient regions, so dividing by √dk keeps things stable. Softmax then converts those scaled scores into probabilities
  • Then we do this for the V value. V is what each token actually shares when someone attends to it. And this is only for a single attention head. We have multiple attention heads
  • After the attention head I need to understand masked attention and current tricks