Day 6: LLaMA, Mistral, and the road to reasoning models — Overview of Modern Nets — Arshad Kazi

Newer models follow the same base transformer. The question is what pathway to take: VLMs, reasoning models, MoE. Want to understand how it all connects at high level first
Drew the model tree: encoder and decoder both come from the base transformer. Decoder branch → LLaMA/Mistral → then reasoning models, multimodals, MoE

Model landscape (high level)

Both encoder and decoder come from the base transformer
Decoder branch → LLaMA / Mistral → then reasoning models, multimodals, MoE

Reasoning models

They don't just spit out the results. They try to do the thinking part first. Thinking as in doing the basic intermediate steps before answering
There are many techniques that make this happen. I should read the chain of thought paper (2023) to understand this. Going there directly since that's where I want to go deeper

Multimodals and image generation

Multimodals are easier to understand. They just convert embeddings from one form to another and feed into the rest. Similar to what I've done in CLIP
Image generation is a totally different thing. Not getting into diffusion models yet

What LLaMA did

Smaller model, beat GPT-3 on 1 trillion tokens (Chinchilla insight)

Model size was smaller yet outperformed GPT. Followed the Chinchilla paper: train smaller models on more data
RMSNorm instead of LayerNorm
RoPE instead of learned positional embeddings
SwiGLU instead of standard FFN

What Mistral did differently

Doubled training data
Sliding window attention to extend context length
Added GQA (Grouped Query Attention) for 70B KV cache

Study map forward

Understand thinking (LLaMA, Mistral, paper overview) → Read chain of thought → Go deep into multihead + vision → Overview of diffusion
Sub-track: Understanding GQA & KV cache → Understanding RoPE & SwiGLU