Day 2: Multi-head attention and why we use dot products — Overview of Modern Nets

Multi-head attention is about finding the attention inside the embedding. Three things here: Q, K, and V
Query: what am I looking for? Key: what do I contain? Value: what do I actually share if someone attends to me? Each of these have their own vector embeddings and all of them are learnable

How Q, K, V come together to form attention

After going through the 3B1B videos I want to put down a few things. A query would be a question (say in a sentence): 'hey how am I connected to the nouns?' Key would be the nouns surrounding that token. Value would be how much do I have to move my query vector to align with the query it was written on
I know all this but this is too basic. I don't know how the scaling works
The equation: Attention(Q,K,V) = softmax(QK^T / √d_k) · V. The softmax part is called the attention matrix (before the value)
We have dot product here. Dot product represents how much alignment there is between Q and K. Before softmax, we divide by √d_k to scale the scores down — higher dimensions produce larger dot products which push softmax into tiny-gradient regions, so dividing by √d_k keeps things stable. Softmax then converts those scaled scores into probabilities
Then we do this for the V value. V is what each token actually shares when someone attends to it. And this is only for a single attention head. We have multiple attention heads
After the attention head I need to understand masked attention and current tricks