← All Days
Day 6 — Sun, May 3

LLaMA, Mistral, and the road to reasoning models

  • Newer models follow the same base transformer. The question is what pathway to take: VLMs, reasoning models, MoE. Want to understand how it all connects at high level first
  • Drew the model tree: encoder and decoder both come from the base transformer. Decoder branch → LLaMA/Mistral → then reasoning models, multimodals, MoE

Model landscape (high level)

  • Both encoder and decoder come from the base transformer
  • Decoder branch → LLaMA / Mistral → then reasoning models, multimodals, MoE

Reasoning models

  • They don't just spit out the results. They try to do the thinking part first. Thinking as in doing the basic intermediate steps before answering
  • There are many techniques that make this happen. I should read the chain of thought paper (2023) to understand this. Going there directly since that's where I want to go deeper

Multimodals and image generation

  • Multimodals are easier to understand. They just convert embeddings from one form to another and feed into the rest. Similar to what I've done in CLIP
  • Image generation is a totally different thing. Not getting into diffusion models yet

What LLaMA did

Smaller model, beat GPT-3 on 1 trillion tokens (Chinchilla insight)

  • Model size was smaller yet outperformed GPT. Followed the Chinchilla paper: train smaller models on more data
  • RMSNorm instead of LayerNorm
  • RoPE instead of learned positional embeddings
  • SwiGLU instead of standard FFN

What Mistral did differently

  • Doubled training data
  • Sliding window attention to extend context length
  • Added GQA (Grouped Query Attention) for 70B KV cache

Study map forward

  • Understand thinking (LLaMA, Mistral, paper overview) → Read chain of thought → Go deep into multihead + vision → Overview of diffusion
  • Sub-track: Understanding GQA & KV cache → Understanding RoPE & SwiGLU