DK-009 — From Pretraining to MoE and Agentic Systems
🧩 MoE, Finetuning, RAG, and Training: How LLMs Are Actually Built
Modern LLM development is full of buzzwords.
Mixture-of-Experts (MoE).
Finetuning.
RAG.
Pretraining.
Most confusion comes from mixing what these techniques do with what people hope they do.
This chapter explains:
- what MoE really is
- what each LLM improvement method actually changes
- when each method works
- when it fundamentally cannot work
1️⃣ What Is Mixture-of-Experts (MoE)?
MoE is conditional computation.
Instead of activating the entire model for every token, MoE activates only a subset.
1.1 Dense Models (Baseline)
In a dense transformer:
$$ \mathbf{y} = f(\mathbf{x}; \theta) $$
All parameters are used every time.
Pros:
- simple
- stable
Cons:
- expensive
- hard to scale beyond hardware limits
1.2 MoE Core Idea
MoE decomposes the model:
- many experts: E_1, E_2, E_N
- a router (gating network)
For each token:
$$ \mathbf{y} = \sum_{i \in \mathcal{S}} g_i(\mathbf{x}) E_i(\mathbf{x}) $$
Where:
- S = selected experts (e.g. 8 out of 384)
- gi = routing weights
1.3 Why MoE Exists
MoE lets us build:
- 1T parameter models
- while computing only ~30B parameters per token
This is how:
- GPT-style models scale
- training cost stays (barely) manageable
1.4 MoE Is Not Free
MoE introduces:
- routing instability
- load imbalance
- expert collapse
- communication overhead
MoE improves capacity, not intelligence.
2️⃣ Pretraining: The Foundation
Pretraining teaches a model language itself.
What it learns:
- grammar
- facts
- patterns
- implicit reasoning heuristics
What it does NOT learn:
- your private data
- your business rules
- real-time information
Pretraining is expensive and irreversible.
3️⃣ Finetuning: Changing Behavior, Not Knowledge
Finetuning continues training on new data:
$$ \theta’ = \theta - \alpha \nabla_\theta \mathcal{L}_{\text{task}} $$
3.1 What Finetuning Is Good At
- style adaptation
- instruction following
- tone control
- domain biasing
Example:
- legal writing
- medical summarization
- customer support tone
3.2 What Finetuning Is Bad At
- memorizing large knowledge bases
- updating fast-changing facts
- precise retrieval
Finetuning compresses information into weights.
Compression causes forgetting.
3.3 LoRA: Efficient Finetuning
LoRA freezes base weights and adds low-rank adapters:
$$ W’ = W + BA $$
Pros:
- cheap
- reversible
- modular
Cons:
- limited expressivity
4️⃣ Retrieval-Augmented Generation (RAG)
RAG separates:
- knowledge storage
- language generation
4.1 RAG Architecture
- Embed documents
- Store in vector database
- Retrieve relevant chunks
- Inject into prompt
- Generate answer
4.2 Embedding Step
Each chunk is mapped.
Similarity:
$$ \text{sim}(\mathbf{q}, \mathbf{e}_i) = \cos(\mathbf{q}, \mathbf{e}_i) $$
4.3 Why RAG Works
- avoids catastrophic forgetting
- updates instantly
- preserves factual accuracy
- auditable sources
4.4 Why RAG Fails
RAG does NOT:
- reason across many documents
- plan multi-step solutions
- resolve contradictions
RAG retrieves text. The model still guesses.
5️⃣ Training vs Finetuning vs RAG (Truth Table)
| Method | Changes Weights | Adds Knowledge | Real-Time Update |
|---|---|---|---|
| Pretraining | ✅ | ✅ | ❌ |
| Finetuning | ✅ | ⚠️ | ❌ |
| LoRA | ⚠️ | ⚠️ | ❌ |
| RAG | ❌ | ✅ | ✅ |
No method replaces the others.
6️⃣ Which Method Is “Best”?
Wrong question.
The correct question:
What part of the system is broken?
6.1 Use Pretraining When
- building a foundation model
- massive compute available
- general capability needed
6.2 Use Finetuning When
- behavior is wrong
- format is wrong
- style is wrong
6.3 Use RAG When
- knowledge changes
- traceability matters
- data is large and sparse
6.4 Use Agents When
- tasks span time
- decisions affect future states
- planning is required
7️⃣ The Real LLM Stack (Systems View)
A real system looks like:
$$ \text{LLM} + \text{Memory} + \text{Retrieval} + \text{Tools} + \text{Control Loop} $$
No single technique is sufficient.
8️⃣ Final Reality Check
- MoE scales capacity, not understanding
- Finetuning shapes behavior, not truth
- RAG supplies facts, not reasoning
- Training teaches language, not agency
🧠 Closing Insight
Intelligence is not in the model.
Intelligence emerges from the system.
🧩 Vocabulary, Tokens, and Embeddings (The Real Basics)
Before talking about LLMs, agents, or reasoning,
we must understand how text becomes numbers.
Everything starts here.
1️⃣ Is One Vocabulary Entry One Word?
Not necessarily.
A vocabulary (vocab) is a list of tokens, not words.
A token can be:
- a full word
- part of a word
- punctuation
- a space
- a symbol
Example: “Lionel Messi”
Depending on the tokenizer, this can become:
"Lionel"+" Messi""Lion"+"el"+" Mess"+"i""Li"+"on"+"el"+" Mess"+"i"
So:
❌ 1 vocab ≠ 1 word
✅ 1 vocab = 1 token unit
2️⃣ What Is a Token?
A token is the smallest unit the model processes.
Formally:
$$ \text{text} \rightarrow \text{tokens} = {t_1, t_2, \dots, t_n} $$
Example sentence:
“Lionel Messi is the best footballer”
Might tokenize as:
["Lionel", " Messi", " is", " the", " best", " football", "er"]
The model never sees letters or words —
it sees token IDs.
Example:
"Lionel" → 18372
" Messi" → 91827
3️⃣ Why Tokens Instead of Words?
Because:
- languages differ
- words are ambiguous
- new words appear
Tokens allow:
- multilingual support
- efficient compression
- shared subwords across languages
Example:
- English: football
- Spanish: fútbol
- Name: Messi
The tokenizer learns reusable pieces.
4️⃣ From Token to Vector
Once we have token IDs, we convert them to vectors.
This is done using an embedding matrix.
Embedding Lookup
Each token ID maps to a vector:
$$ \text{Embedding}: \text{token_id} \rightarrow \mathbf{v} \in \mathbb{R}^d $$
Example (simplified):
"Lionel" → [0.12, -0.88, 0.34, ..., 0.05]
"Messi" → [0.91, 0.11, 0.77, ..., -0.42]
Typical dimensions:
- 768
- 1024
- 4096+
- even 8192+
5️⃣ What Is an Embedding?
An embedding is a vector that represents meaning.
It encodes:
- semantics
- relationships
- context potential
Important:
Embeddings do NOT store definitions
They store positions in meaning space
Lionel Messi in Embedding Space
Conceptually:
-
“Messi” is close to:
- “football”
- “Argentina”
- “Barcelona”
- “GOAT”
-
Far from:
- “quantum mechanics”
- “cooking recipe”
- “neural network”
Distance matters.
6️⃣ Vector, Embedding, Latent Space (Same Family)
- Vector: a list of numbers
- Embedding: a vector with learned meaning
- Latent space: the space where embeddings live
Formally:
$$ \mathbf{v}_{\text{Messi}} \in \mathbb{R}^d $$
Where closeness implies semantic similarity.
7️⃣ What Does an Embedding Look Like?
It looks like nothing human-readable.
Example (tiny fake embedding):
Messi →
[ 0.83, -1.12, 0.44, 0.09, -0.31 ]
8️⃣ Key Takeaways (Do Not Skip)
- Vocabulary ≠ words
- Tokens are subword units
- Tokens become vectors
- Vectors live in embedding space
- Meaning = geometry, not text
🧠 Final Intuition
Humans read words.
Models navigate vector spaces.
🔍 Attention: How Vectors Talk to Each Other
After tokenization and embeddings,
we have vectors — but isolated vectors mean nothing.
Attention is the mechanism that lets vectors interact.
This chapter explains attention numerically and intuitively.
1️⃣ The Core Problem
We start with embeddings:
$$ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n $$
Each vector represents a token.
Problem:
- each vector knows only itself
- language meaning depends on relationships
Example:
“Lionel Messi scored a goal”
The meaning of “scored” depends on “Messi”.
2️⃣ Attention Is Weighted Interaction
Attention answers one question:
Which other tokens matter for this token?
Mathematically:
$$ \text{Attention}(\mathbf{q}, \mathbf{k}, \mathbf{v}) $$
Where:
- Query (Q) = what I am looking for
- Key (K) = what I offer
- Value (V) = what I contribute
3️⃣ Creating Q, K, V
Each embedding is linearly projected:
$$ \mathbf{q}_i = W_Q \mathbf{x}_i $$
$$ \mathbf{k}_i = W_K \mathbf{x}_i $$
$$ \mathbf{v}_i = W_V \mathbf{x}_i $$
Same vector → three different roles.
4️⃣ Attention Scores (Who Listens to Whom)
For token ( i ), compute similarity to token ( j ):
$$ \alpha_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} $$
This measures relevance.
Example:
- “scored” attends strongly to “Messi”
- weakly to “a”
5️⃣ Softmax: Turning Scores into Weights
Normalize scores:
$$ w_{ij} = \text{softmax}(\alpha_{ij}) $$
Now:
- weights sum to 1
- attention becomes probabilistic focus
6️⃣ Mixing Values (Talking Happens Here)
Interpretation:
“Token i becomes a mixture of other tokens.”
This is contextualization.
7️⃣ Lionel Messi Example
Sentence:
“Lionel Messi won the Ballon d’Or”
Token “won”:
- attends strongly to “Messi”
- moderately to “Ballon”
- weakly to “the”
Result:
- “won” now encodes who won what
8️⃣ Self-Attention vs Cross-Attention
- Self-attention: tokens attend to each other
- Cross-attention: tokens attend to external memory (e.g. RAG)
Same math. Different source.
9️⃣ Multi-Head Attention
One attention head is limited.
So we use many:
$$ \text{MultiHead} = \text{Concat}(\text{head}_1, \dots, \text{head}_h) $$
Each head learns:
- syntax
- coreference
- semantics
- long-range relations
🧠 Key Insight
Attention does not store meaning.
It routes meaning.
🧠 From Embeddings to Thought
Embeddings are static.
Thought is dynamic.
This chapter explains how repeated attention + transformation turns vectors into reasoning.
1️⃣ One Layer Is Not Thought
After one attention layer:
$$ \mathbf{H}^{(1)} = \text{Attention}(\mathbf{X}) $$
We get:
- contextualized vectors
- shallow understanding
But no reasoning yet.
2️⃣ Stacking Layers = Iterative Refinement
Transformers stack layers:
$$ \mathbf{H}^{(l+1)} = \text{FFN}(\text{Attention}(\mathbf{H}^{(l)})) $$
Each layer:
- refines representations
- integrates broader context
- abstracts meaning
3️⃣ Feedforward Networks (Nonlinearity)
FFN:
$$ \text{FFN}(\mathbf{x}) = W_2 \sigma(W_1 \mathbf{x}) $$
Purpose:
- mix features
- create nonlinear concepts
- enable abstraction
4️⃣ Thought as Trajectory in Latent Space
A “thought” is not a symbol.
It is a path:
$$ \mathbf{x}^{(0)} \rightarrow \mathbf{x}^{(1)} \rightarrow \dots \rightarrow \mathbf{x}^{(L)} $$
Each layer moves the vector through latent space.
5️⃣ Example: Simple Reasoning
Prompt:
“Messi is older than Neymar. Who is younger?”
Early layers:
- identify entities
Middle layers:
- encode comparison
Later layers:
- resolve answer
No rule engine.
Just geometry evolving.
6️⃣ Why This Feels Like Thinking
Because:
- representations become more abstract
- irrelevant details are suppressed
- relationships dominate
This mirrors human cognition functionally, not biologically.
7️⃣ Why It Sometimes Fails
Because:
- reasoning is approximate
- errors compound
- no explicit truth-checking
Thought is simulated, not guaranteed.
8️⃣ Chain-of-Thought Is Externalized Latent Process
CoT exposes:
$$ \text{hidden transformations} \rightarrow \text{text} $$
Open models allow this because:
- transparency
- debuggability
- research value
🧠 Final Insight
LLMs do not “think” symbolically.
They evolve representations.
🤖 Foundations of Agentic AI and Large Language Models
Artificial Intelligence today is no longer about single predictions.
Modern AI systems:
- reason
- plan
- act
- observe
- adapt
At the center of this shift is the combination of:
- Large Language Models (LLMs)
- Agentic workflows
- System-level design
This chapter is a true basic foundation, written for nerds who want to understand, not just use.
1️⃣ What Is an AI Agent?
An AI agent is not a model.
An agent is a system that uses a model to interact with an environment over time.
Formally, an agent:
- receives observations
- maintains internal state
- selects actions
- receives feedback
- updates itself
A minimal agent loop:
Observation → Reasoning → Action → Environment → Observation
Agent vs Model
| Model | Agent |
|---|---|
| Stateless | Stateful |
| Single output | Continuous loop |
| No tools | Tool-using |
| No memory | Memory-aware |
An LLM becomes agentic only when embedded in this loop.
2️⃣ What Does an Agent Actually Do?
An agent typically performs tasks like:
- searching information
- writing code
- running programs
- analyzing data
- making decisions
- coordinating tools
Example:
User: "Analyze my dataset and plot trends"
Agent:
1. Understand task
2. Plan steps
3. Load data
4. Run code
5. Inspect output
6. Adjust
7. Respond
This is goal-directed behavior.
3️⃣ What Is a Token?
LLMs do not see words.
They see tokens.
A token is a discrete unit produced by a tokenizer.
Example:
"Artificial intelligence is powerful"
→ ["Artificial", " intelligence", " is", " powerful"]
Sometimes:
- one word = one token
- one word = many tokens
- symbols, code, spaces are tokens
Are More Tokens Better?
No.
Token count affects:
- cost
- latency
- memory usage
What matters is information density, not raw token count.
4️⃣ Token → Vector → Embedding
Each token is mapped to a vector.
This mapping is called an embedding.
Formally:
$$ \text{Embedding}: \mathcal{V} \rightarrow \mathbb{R}^d $$
Where:
- V = vocabulary
- d = embedding dimension
What Is a Vector?
A vector is just a list of numbers:
$$ \mathbf{v} = [v_1, v_2, \dots, v_d] $$
Each dimension encodes latent features:
- syntax
- semantics
- style
- function
5️⃣ What Is Latent Space?
Latent space is the geometry of meaning.
In latent space:
- similar concepts are close
- different concepts are far apart
Distance is often measured by cosine similarity:
$$ \text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} $$
What Is “z”?
In ML papers, ( z ) usually denotes a latent variable.
In LLMs:
- embeddings
- hidden states
- attention outputs
are all forms of latent representations.
6️⃣ Probability in LLMs
LLMs model probability distributions over tokens.
At each step:
$$ P(x_t \mid x_1, x_2, \dots, x_{t-1}) $$
The model predicts:
“Given everything so far, what token is most likely next?”
Training minimizes cross-entropy loss:
$$ \mathcal{L} = -\sum_t \log P(x_t) $$
7️⃣ Why LLMs Feel “Smart”
Because reasoning emerges from:
- scale
- representation
- optimization
Not because the model “understands” like a human.
8️⃣ Architecture: Mixture-of-Experts (MoE)
MoE replaces one big network with many specialists.
Instead of using all parameters every time, the model selects a few experts per token.
Key Parameters Explained
Architecture: Mixture-of-Experts (MoE)
A sparse architecture with expert routing.
Total Parameters: 1T
$$ 1T = 1{,}000{,}000{,}000{,}000 $$
These include all experts combined.
Activated Parameters: 32B
Only a subset is used per token:
$$ \text{Compute} \propto 32B \ll 1T $$
This makes MoE scalable.
Number of Layers: 61
Total transformer layers.
Number of Dense Layers: 1
One shared layer used by all tokens.
Attention Hidden Dimension: 7168
Size of token representation inside attention.
MoE Hidden Dimension (per Expert): 2048
Each expert is smaller and specialized.
Number of Experts: 384
Total pool of experts.
Selected Experts per Token: 8
Router chooses 8 experts per token.
Number of Shared Experts: 1
Always-active expert for stability.
Vocabulary Size: 160K
Number of tokens the model understands.
Context Length: 256K
Maximum tokens in one forward pass.
Attention Mechanism: MLA
Multi-head attention optimized for long context.
Activation Function: SwiGLU
$$ \text{SwiGLU}(x) = (xW_1 \odot \sigma(xW_2))W_3 $$
Smooth, stable, expressive.
Vision Encoder: MoonViT
Visual encoder for multimodal inputs.
Parameters of Vision Encoder: 400M
Separate vision model feeding into LLM.
9️⃣ What Tasks Are LLMs Trained For?
Primary objective:
- Next-token prediction
Emergent abilities:
- reasoning
- coding
- translation
- summarization
- planning
- tool use
These arise from generalization, not explicit programming.
🔟 How Are LLMs Evaluated?
Evaluation uses benchmarks:
- MMLU
- GSM8K (math)
- HumanEval (code)
- BIG-bench
- Agent task suites
Metrics include:
- accuracy
- pass@k
- reasoning depth
- tool success rate
1️⃣1️⃣ Hardware Required to Train Models
Training requires massive compute.
Example for 100B+ models:
- NVIDIA H100 GPUs
- Thousands of GPUs
- Millions of GPU-hours
Approximate relation:
$$ \text{Training Cost} \propto \text{Parameters} \times \text{Tokens} $$
1️⃣2️⃣ Why You Can Ask Anything and Get Answers
LLMs work because they learn:
- patterns
- abstractions
- relationships
They do not retrieve answers like a database.
They generate answers probabilistically.
1️⃣3️⃣ From LLMs to Agentic AI
An agent wraps the LLM with:
- memory
- tools
- control logic
- safety constraints
This transforms:
language modeling
into
decision-making
🧠 Final Mental Model
- Tokens are symbols
- Embeddings are meaning
- Latent space is geometry
- Probability drives generation
- MoE enables scale
- Agents enable action
🧭 Closing Thought
LLMs are not magic.
They are:
- mathematics
- optimization
- systems engineering
But when combined correctly, they form agentic AI systems.
And that is where modern AI truly begins.
🔤 Tokens, Embeddings, and How Language Becomes Numbers
Large Language Models do not understand words.
They do not understand sentences.
They do not understand languages.
They understand numbers.
This chapter explains—step by step—how:
- text becomes tokens
- tokens become vectors
- vectors become embeddings
- embeddings are trained
- multiple languages coexist in one model
This is the core mechanical foundation of LLMs.
1️⃣ What Is a Token?
A token is the smallest unit of text that a language model processes.
A token is not:
- necessarily a word
- necessarily a character
- necessarily a syllable
It is a unit defined by a tokenizer.
2️⃣ Examples: Words, Subwords, and Symbols
Consider the sentence:
"Language models are powerful."
A tokenizer might produce:
["Language", " models", " are", " powerful", "."]
Each item above is one token.
Subword Tokenization Example
Now consider:
"tokenization"
This may become:
["token", "ization"]
Or even:
["tok", "en", "ization"]
Why?
Because tokenizers optimize for frequency and efficiency, not linguistics.
3️⃣ Why Not Just Use Words?
Using full words causes problems:
- Vocabulary explodes
- Rare words are unseen
- New words cannot be handled
Instead, modern LLMs use subword tokenization.
4️⃣ Byte Pair Encoding (BPE)
Most LLMs use BPE or variants.
The idea is simple:
- Start with characters
- Merge frequent pairs
- Repeat until vocabulary size is reached
Formally, BPE minimizes:
$$ \text{Total Tokens} = \sum_{i} \text{Length}(x_i) $$
subject to a fixed vocabulary size.
5️⃣ Tokens Across Languages
LLMs do not have separate vocabularies per language.
They use one shared vocabulary.
Example:
| Language | Tokenization |
|---|---|
| English | Subwords |
| Thai | Character-like chunks |
| Chinese | Characters |
| Japanese | Mixed Kanji + Kana |
| Code | Keywords + symbols |
Example: Chinese
"人工智能"
→ ["人", "工", "智", "能"]
Each character is already meaningful, so tokenization is straightforward.
Example: Thai
Thai has no spaces:
"ภาษาไทยยากไหม"
Tokenizer output may look like:
["ภาษา", "ไทย", "ยาก", "ไหม"]
Learned statistically, not linguistically.
6️⃣ Vocabulary Size
Vocabulary size determines how many unique tokens exist.
Typical values:
- 32K
- 50K
- 100K
- 160K
- 200K+
A larger vocabulary means:
- fewer tokens per sentence
- larger embedding tables
- higher memory cost
7️⃣ From Token to ID
Each token is mapped to an integer ID.
Example:
"language" → 48321
This is just a lookup.
8️⃣ Token → Vector (Embedding)
Token IDs are mapped to vectors via an embedding matrix.
Formally:
$$ E \in \mathbb{R}^{|\mathcal{V}| \times d} $$
Where:
- V = vocabulary size
- d = embedding dimension
Embedding Lookup
Given token ID ( i ):
$$ \mathbf{e}_i = E[i] $$
This vector represents the token in continuous space.
9️⃣ What Is an Embedding?
An embedding is:
- a dense vector
- learned during training
- representing semantic and syntactic properties
🔟 What Is a Vector?
A vector is a list of real numbers:
$$ \mathbf{v} = [v_1, v_2, \dots, v_d] $$
Each dimension has no human-interpretable meaning alone.
Meaning emerges from relative geometry.
1️⃣1️⃣ What Is Latent Space?
Latent space is the space formed by embeddings.
In latent space:
- distance encodes similarity
- directions encode relationships
Distance is often measured by cosine similarity:
$$ \text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} $$
1️⃣2️⃣ Tokens in Context: Position Matters
Embeddings alone ignore order.
Transformers add positional information:
$$ \mathbf{h}_t = \mathbf{e}_t + \mathbf{p}_t $$
Where:
- pt = positional encoding
This allows the model to distinguish:
"dog bites man"
vs
"man bites dog"
1️⃣3️⃣ How Tokens Are Used During Training
Training objective:
$$ P(x_t \mid x_1, \dots, x_{t-1}) $$
For a sequence:
"The cat sat"
Training pairs are:
"The" → "cat"
"The cat" → "sat"
Loss Function
Cross-entropy loss:
$$ \mathcal{L} = -\log P(x_t) $$
The model is penalized if the correct next token has low probability.
1️⃣4️⃣ How Embeddings Are Learned
Embeddings are not pretrained separately.
They are learned end-to-end.
During backpropagation:
- gradients flow into embedding vectors
- frequent tokens update often
- rare tokens update less
This is why:
- common words are well-shaped
- rare words are noisier
1️⃣5️⃣ Multilingual Training
Training data mixes languages.
The model learns:
- shared structure (logic, syntax)
- language-specific patterns
This creates cross-lingual embeddings.
1️⃣6️⃣ Are More Tokens Better?
No.
More tokens means:
- more compute
- more memory
- slower inference
Better tokenization means:
- fewer tokens
- richer embeddings
Quality beats quantity.
1️⃣7️⃣ Summary Mental Model
- Text → tokens
- Tokens → IDs
- IDs → vectors
- Vectors → latent space
- Latent space → probability
- Probability → language generation
🧠 Final Intuition
Language models do not store sentences.
They store geometric relationships between tokens.
Meaning is not memorized.
It is emergent.
🧠 Modern Large Language Models (LLMs)
Large Language Models are no longer just “big neural networks that predict the next word”.
They are:
- Reasoning engines
- Tool-using agents
- Modular systems
- Open-weight infrastructures
This article is a deep but foundational recap of modern LLMs—written for people who already speak ML, but feel the ecosystem is moving too fast to track.
If you’ve ever asked:
“Wait… what exactly is MoE, LoRA, RAG, quantization, or agentic LLMs?”
This is for you.
1️⃣ What Does “Model Tree” Mean?
When browsing open models (e.g. on HuggingFace), you often see a structure like:
openai/gpt-oss-120b
├── Adapters
├── Finetunes
├── Merges
├── Quantizations
This is not chaos.
It is evolution.
Think of a model tree as a genetic family:
- Base model → the pretrained brain
- Adapters → plug-in skills
- Finetunes → retrained personalities
- Merges → hybrid offspring
- Quantizations → compressed forms
2️⃣ Base Model: gpt-oss-120b
Model size: 120B parameters
Tensor type: BF16 / U8
License: Apache 2.0
What is “120B parameters”?
A parameter is a learned scalar value inside the neural network.
120B = 120,000,000,000 parameters
Memory footprint (roughly):
$$ \text{Memory} \approx \text{Parameters} \times \text{Bytes per parameter} $$
- BF16 → ~2 bytes → ~240 GB
- FP32 → ~4 bytes → ~480 GB
This is why compression, sharding, and MoE exist.
3️⃣ Tensor Types: BF16 vs U8
| Type | Meaning | Usage |
|---|---|---|
| BF16 | Brain Floating Point 16 | Training / high-quality inference |
| FP16 | IEEE Half Precision | Legacy GPU inference |
| U8 / INT8 | 8-bit Integer | Fast & cheap inference |
| INT4 | 4-bit Integer | Extreme compression |
Quantization trades precision for efficiency.
$$ \text{Smaller precision} \Rightarrow \text{Less memory} \Rightarrow \text{Faster inference} $$
4️⃣ Adapters and LoRA (Low-Rank Adaptation)
Adapters are modular fine-tuning layers.
LoRA works by injecting a low-rank update:
$$ W’ = W + \Delta W $$
Where:
$$ \Delta W = A B \quad \text{with} \quad \text{rank}(A,B) \ll \text{rank}(W) $$
Why LoRA matters
- Base model is frozen
- Only a few million parameters trained
- Easy to distribute
- Easy to swap
This is why open-source LLMs scale socially.
5️⃣ Finetuning vs Pretraining
Pretraining
Pretraining teaches the model language itself:
$$ \mathcal{L} = -\sum_{t} \log P(x_t \mid x_{<t}) $$
- Trillions of tokens
- Next-token prediction
- Costs millions of GPU-hours
- One-time process
Finetuning
Finetuning teaches the model how to behave.
- Instruction following
- Reasoning style
- Domain specialization
- Safety alignment
This is where “chat”, “assistant”, and “expert” personalities come from.
6️⃣ Quantized Models
Quantized models are the same brain, cheaper hardware.
Examples:
- 120B → INT8 → fits on multi-GPU servers
- 7B → INT4 → fits on a laptop GPU
Trade-off:
$$ \text{Compression} \uparrow \Rightarrow \text{Accuracy} \downarrow \ (\text{slightly}) $$
7️⃣ What is gpt-oss-safeguard-120b?
This is not the main language model.
It is a safety enforcement model.
Responsibilities:
- Input filtering
- Output moderation
- Policy enforcement
- Risk classification
In proprietary APIs, this layer is invisible.
In open-weight systems, you must build it yourself.
8️⃣ Mixture-of-Experts (MoE) Architecture
MoE is the defining architecture of modern large-scale LLMs.
Dense model:
Every parameter is used every time.
MoE model:
Only some experts activate per token.
MoE at scale
| Property | Value |
|---|---|
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Experts | 384 |
| Experts per Token | 8 |
This means:
$$ \text{Compute cost} \propto 32B \quad \text{not} \quad 1T $$
Intuition
Instead of:
“Use the whole brain for everything”
We get:
“Call 8 specialists per word”
9️⃣ Architecture Details Explained
Attention Heads
64 heads
Each head attends to different relationships:
- Syntax
- Semantics
- Long-range dependencies
SwiGLU Activation
$$ \text{SwiGLU}(x) = (xW_1 \odot \sigma(xW_2))W_3 $$
Why it’s used:
- Smooth gradients
- Better expressivity
- Stable training
Context Length: 256K
This enables:
- Whole-codebase reasoning
- Long legal / medical documents
- Multi-step agent planning
🔟 Tokenization (o200k_harmony)
Vocabulary size:
$$ |\mathcal{V}| \approx 200{,}000 $$
This includes:
- Natural language
- Code tokens
- Math symbols
- Tool-call syntax
- Agent control tokens
Tokenizer design is not trivial—it directly affects reasoning quality.
1️⃣1️⃣ RAG (Retrieval-Augmented Generation)
RAG adds external memory:
User → Retriever → Documents → LLM → Answer
Strengths:
- Fresh knowledge
- Enterprise data
- Auditable sources
Limitations:
- Weak for deep reasoning
- Brittle retrieval
- Latency overhead
RAG is evolving—not obsolete.
1️⃣2️⃣ Agentic LLMs
Modern LLMs are not single-shot generators.
They operate in loops:
Plan → Act → Observe → Reflect → Repeat
Tools include:
- Web search
- Python execution
- Databases
- APIs
This is where LLMs become systems, not models.
1️⃣3️⃣ Putting It All Together
A modern LLM stack looks like:
Pretrained MoE LLM
↓
Finetune / LoRA
↓
Quantization
↓
RAG + Tools
↓
Agent Loop
↓
Safety Guards
🧠 Final Thoughts
If LLMs feel overwhelming, that’s because:
They are no longer just models.
They are:
- Architectures
- Systems
- Ecosystems
- Infrastructures
Understanding them today means thinking like:
- A machine learning researcher
- A systems engineer
- A product architect
And yes — the pace is brutal.
But now, you’re back in control.
🤖 Agentic Large Language Models
Modern LLMs are no longer passive text generators.
They are agents.
They:
- Plan actions
- Call tools
- Observe results
- Reflect and revise
- Loop until completion
Understanding agentic workflows is now a core literacy for anyone working with LLMs.
1️⃣ What Does “Agentic” Actually Mean?
An agentic LLM is a model embedded inside a control loop.
At minimum, the loop looks like:
Thought → Action → Observation → Thought → ...
This is not metaphorical.
It is a programmatic execution cycle.
The canonical agent loop
1. Parse user intent
2. Plan intermediate steps
3. Decide which tool to call
4. Execute tool
5. Observe results
6. Update internal state
7. Continue or stop
This loop transforms LLMs from:
“Answer machines”
into:
“Task-completing systems”
2️⃣ ReAct, Plan-Act-Reflect, and Toolformers
Most modern agents descend from three ideas:
🔹 ReAct (Reason + Act)
The model alternates between reasoning and acting.
Thought: I need recent data.
Action: WebSearch(query="...")
Observation: ...
Reasoning grounds tool usage.
🔹 Plan–Act–Reflect
More structured agent loop:
Plan → Act → Observe → Reflect → Re-plan
Reflection is critical:
- Error correction
- Long-horizon reasoning
- Self-debugging
🔹 Toolformers
LLMs trained to decide when tools are useful, not just how to use them.
This is why modern models expose:
- Web search
- Python
- File systems
- APIs
3️⃣ Why Chain-of-Thought (CoT) Matters
Chain-of-Thought is not verbosity.
It is externalized intermediate computation.
Formally, CoT approximates:
$$ P(y \mid x) = \sum_{z} P(y \mid z, x) P(z \mid x) $$
Where:
- ( x ) = input
- ( z ) = latent reasoning steps
- ( y ) = output
4️⃣ Why Open Models Expose CoT
This is a philosophical and architectural difference.
Proprietary models
- CoT is hidden or summarized
- Exposed reasoning is filtered
- System-level alignment is enforced upstream
Open-weight models
- CoT is part of the artifact
- Debuggable
- Inspectable
- Modifiable
This is intentional.
Why the open community wants CoT
- Debugging
- Inspect failure modes
- Research
- Analyze reasoning depth
- Alignment
- Study safety trade-offs
- Education
- Teach reasoning, not just answers
Open models treat CoT as:
a feature, not a liability
5️⃣ Adjustable Reasoning Effort
Modern reasoning models expose a control variable:
- Think fast (cheap)
- Think slow (deep)
Conceptually:
$$ \text{Compute} \propto \text{Reasoning Depth} $$
This enables:
- Cost-aware deployment
- Adaptive intelligence
- Agent-level optimization
6️⃣ Why CoT Is Risky — and Still Open
CoT can leak:
- Sensitive heuristics
- Attack strategies
- Unsafe reasoning paths
Open models accept this risk because:
- They prioritize transparency
- Safety is enforced at the system level, not hidden logic
This shifts responsibility to the system designer.
7️⃣ Mapping the Modern LLM Ecosystem
Let’s zoom out.
🔵 OpenAI
Philosophy: System-level safety, agent-first APIs
- Strong reasoning models
- Deep tool integration
- Hidden CoT by default
- Heavy alignment layers
Strengths:
- Production-grade agents
- Robust safety
- Best-in-class reasoning
Trade-off:
- Limited transparency
- No weight access
🟣 Meta (LLaMA family)
Philosophy: Open weights, scalable infrastructure
- Dense + MoE research
- Strong multilingual support
- Community-driven fine-tuning
Strengths:
- Foundation for OSS ecosystem
- Research-friendly
- Broad adoption
Trade-off:
- Safety is DIY
- Tooling varies by implementation
🟢 Mistral
Philosophy: Efficiency + elegance
- MoE-first designs
- Strong small/medium models
- European regulatory awareness
Strengths:
- High performance per parameter
- Clean architecture
- Excellent for on-prem
Trade-off:
- Smaller ecosystem (for now)
⚫ Open-Source Community (OSS)
This is not one actor — it is an ecosystem.
Includes:
- Weight merges
- Custom LoRA adapters
- Experimental architectures
- Specialized agents
OSS prioritizes:
- Transparency
- Modularity
- Hackability
Risk:
- Inconsistent safety
- Fragmentation
8️⃣ Comparative Mental Model
| Axis | OpenAI | Meta | Mistral | OSS |
|---|---|---|---|---|
| Weights | Closed | Open | Open | Open |
| CoT | Hidden | Exposed | Exposed | Exposed |
| Safety | Centralized | Optional | Optional | DIY |
| Agents | Native | External | External | Experimental |
| Research | Controlled | Open | Focused | Chaotic |
9️⃣ Why This Ecosystem Exists
No single model can optimize for:
- Safety
- Transparency
- Performance
- Cost
- Control
Different actors choose different trade-offs.
This diversity is healthy.
🔟 The New Role of the Practitioner
Working with LLMs today means you are no longer just a user.
You are:
- A system designer
- A safety engineer
- A reasoning architect
Understanding agentic workflows and CoT exposure is mandatory.
🧠 Final Synthesis
LLMs are no longer:
“Models you query”
They are:
“Systems you design”
Agentic workflows provide agency.
Chain-of-Thought provides cognition.
Open ecosystems provide freedom.
And freedom always comes with responsibility.
🧠 Attention, Scaling Laws, and the Emergence of Reasoning
If embeddings explain what language means,
attention explains how meaning is composed.
This chapter answers five fundamental questions:
- What is attention—numerically?
- Why self-attention works so well
- How training dynamics and scaling laws shape intelligence
- When tokenization breaks models
- How embeddings turn into reasoning
This is where LLMs stop being “vector machines”
and start behaving like reasoning systems.
1️⃣ Attention Explained with Numbers
Attention is a weighted averaging mechanism.
Each token decides:
“Which other tokens matter to me right now?”
Step 1: From Embeddings to Q, K, V
For each token embedding:
$$ \mathbf{q} = \mathbf{x}W_Q,\quad \mathbf{k} = \mathbf{x}W_K,\quad \mathbf{v} = \mathbf{x}W_V $$
Where:
- q = query
- k = key
- v = value
All are vectors.
Step 2: Similarity Scores
For token ( i ) attending to token ( j ):
$$ s_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} $$
This measures relevance.
Step 3: Softmax = Probability Distribution
$$ \alpha_{ij} = \frac{\exp(s_{ij})}{\sum_j \exp(s_{ij})} $$
Attention is probabilistic focus.
Step 4: Weighted Sum
The output token representation is a mixture of other tokens.
2️⃣ Why Self-Attention Works
Self-attention allows every token to:
- access global context
- dynamically reweight importance
- adapt per task and per position
This solves three core problems at once:
- long-range dependency
- variable structure
- parallel computation
Key Insight
Self-attention is content-addressable memory.
Instead of indexing by position, tokens index by meaning.
3️⃣ Multi-Head Attention: Many Views of Meaning
In practice, attention is multi-headed.
$$ \text{Attention} = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W_O $$
Each head learns different relationships:
- syntax
- semantics
- coreference
- arithmetic
- code structure
This is distributed reasoning.
4️⃣ Why Attention Enables Reasoning
Reasoning requires:
- variable binding
- relational comparison
- composition
Attention enables:
$$ \text{Reasoning} \approx \text{Iterative Context Mixing} $$
Each layer refines representations by recontextualizing tokens.
5️⃣ Training Dynamics: How Models Actually Learn
LLMs are trained with gradient descent.
Each update minimizes:
$$ \mathcal{L} = -\sum_t \log P(x_t \mid x_{<t}) $$
Learning emerges from:
- many small updates
- massive data
- overparameterization
Optimization Intuition
Early training:
- learns token statistics
Mid training:
- learns syntax and patterns
Late training:
- learns abstractions and reasoning heuristics
6️⃣ Scaling Laws
Empirically, performance follows power laws.
$$ \mathcal{L}(N, D, C) \propto N^{-\alpha} D^{-\beta} C^{-\gamma} $$
Where:
- ( N ) = parameters
- ( D ) = data
- ( C ) = compute
Consequence
- Bigger models → better reasoning
- More data → better generalization
- More compute → smoother optimization
There is no sharp intelligence threshold—only scale.
7️⃣ Why Bigger Models Reason Better
Large models:
- store more abstractions
- represent deeper hierarchies
- maintain longer dependencies
Reasoning is not programmed. It emerges when capacity is sufficient.
8️⃣ When Tokenization Goes Wrong
Tokenization is a silent failure mode.
Bad tokenization causes:
- excessive token counts
- broken morphemes
- semantic fragmentation
Example: Over-Fragmentation
"electromagnetism"
→ ["elec", "tro", "mag", "net", "ism"]
Meaning is diluted across tokens.
Multilingual Failure
Low-resource languages may:
- use many tokens per word
- receive fewer gradient updates
- have poorer embeddings
This directly harms performance.
9️⃣ Tokenization and Reasoning Errors
Reasoning depends on stable symbols.
If numbers, variables, or operators are split poorly:
- math fails
- code fails
- logic fails
This is why modern tokenizers:
- include digits
- include operators
- include code tokens
🔟 From Embeddings to Reasoning
Embeddings alone do not reason.
Reasoning emerges from:
- attention
- depth
- recurrence across layers
Each layer computes:
$$ \mathbf{H}^{(l+1)} = \text{TransformerBlock}(\mathbf{H}^{(l)}) $$
This is iterative refinement.
Reasoning as Trajectory in Latent Space
A reasoning chain is a path:
$$ \mathbf{z}_0 \rightarrow \mathbf{z}_1 \rightarrow \dots \rightarrow \mathbf{z}_T $$
Each step refines belief.
1️⃣1️⃣ Why Chain-of-Thought Helps
Explicit reasoning externalizes latent steps.
It:
- stabilizes trajectories
- reduces entropy
- improves correctness
But the real reasoning happens inside the vectors.
1️⃣2️⃣ Summary Mental Model
- Tokens are symbols
- Embeddings are points
- Attention is interaction
- Layers are refinement
- Scale is capacity
- Reasoning is emergence
🧠 Final Intuition
LLMs do not “think” like humans.
They:
- transform vectors
- mix context
- optimize probabilities
Yet from this process, reasoning emerges.
That is the core miracle of modern AI.
🔄 Why Transformers Replace RNNs Forever
Recurrent Neural Networks (RNNs) were once the backbone of sequence modeling.
Transformers ended that era.
This chapter explains why this replacement is permanent, not a trend.
1️⃣ What RNNs Were Trying to Solve
Language is sequential.
RNNs model sequences by recurrence.
This looks elegant:
- memory through hidden state
- time-aware processing
But elegance does not scale.
2️⃣ The Fundamental Limits of RNNs
2.1 Vanishing and Exploding Gradients
Backpropagation through time multiplies Jacobians.
This product either:
- shrinks to zero
- explodes to infinity
No architecture tweak fully fixes this.
2.2 Sequential Bottleneck
RNNs must compute:
$$ \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \dots \rightarrow \mathbf{h}_T $$
This is inherently serial.
GPUs hate serial computation.
2.3 Memory is Compressed Too Early
RNNs force all past context into a fixed-size vector.
This causes:
- information loss
- interference
- forgetting long-range dependencies
3️⃣ Why Attention Is a Structural Upgrade
Transformers remove recurrence entirely.
$$ \mathbf{H}^{(l+1)} = \text{Attention}(\mathbf{H}^{(l)}) $$
Key properties:
- full context access
- parallel computation
- content-based memory
4️⃣ Attention vs Recurrence: A Direct Comparison
| Property | RNN | Transformer |
|---|---|---|
| Memory access | Compressed | Explicit |
| Parallelism | ❌ | ✅ |
| Long-range dependency | Weak | Strong |
| Training stability | Fragile | Stable |
| Scaling behavior | Poor | Excellent |
5️⃣ Why Transformers Scale and RNNs Do Not
Scaling requires:
- predictable gradients
- efficient hardware use
- stable optimization
Transformers satisfy all three.
RNNs satisfy none.
6️⃣ The Death of Inductive Bias
RNNs hard-code temporal order.
Transformers learn structure from data.
This flexibility allows:
- language
- code
- math
- vision
- multimodal reasoning
One architecture. Many domains.
7️⃣ Final Verdict
Transformers did not replace RNNs because they are newer.
They replaced RNNs because they are:
- structurally superior
- computationally aligned with modern hardware
- compatible with scale
This replacement is irreversible.
Transformers are not better RNNs.
They are a different species entirely.
⚠️ Failure Modes of Reasoning Models
LLMs can reason.
But they can also fail—quietly, confidently, and convincingly.
This chapter dissects why reasoning models fail, even at large scale.
1️⃣ Reasoning Is Approximate Inference
LLMs estimate:
$$ P(x_t \mid x_{<t}) $$
They do not verify truth. They maximize likelihood.
This creates systematic failure modes.
2️⃣ Hallucination as Probability Maximization
Hallucination occurs when:
$$ \arg\max_x P(x \mid \text{context}) \neq \text{truth} $$
If the model has seen similar patterns, it may confidently invent details.
3️⃣ Shortcut Reasoning
Models often learn:
- surface heuristics
- dataset biases
- shallow correlations
Instead of reasoning:
“This looks like problem type X, answer is usually Y.”
This works—until it doesn’t.
4️⃣ Chain-of-Thought Collapse
Long reasoning chains can drift.
Each step compounds error:
$$ \epsilon_{total} \approx \sum_t \epsilon_t $$
This leads to:
- incorrect conclusions
- internally consistent nonsense
5️⃣ Symbolic Fragility
LLMs struggle with:
- exact arithmetic
- variable binding
- stateful reasoning
Why?
Because symbols are distributed, not discrete.
6️⃣ Out-of-Distribution Reasoning
Reasoning degrades sharply when:
- assumptions shift
- constraints change
- rules are inverted
LLMs interpolate well. They extrapolate poorly.
7️⃣ Alignment vs Reasoning Tension
Safety training can:
- suppress exploration
- bias outputs
- reduce uncertainty expression
This can mask reasoning errors instead of fixing them.
8️⃣ Summary of Failure Modes
| Failure | Root Cause |
|---|---|
| Hallucination | Likelihood ≠ Truth |
| Logical error | Approximate inference |
| Overconfidence | Entropy minimization |
| Math failure | Symbolic mismatch |
| OOD collapse | Lack of world grounding |
🧠 Key Insight
LLMs reason statistically, not causally.
Understanding failure modes is not a weakness— it is a prerequisite for building better systems.
Reasoning models are powerful—but not infallible.