DK-009 — From Pretraining to MoE and Agentic Systems


🧩 MoE, Finetuning, RAG, and Training: How LLMs Are Actually Built

Modern LLM development is full of buzzwords.

Mixture-of-Experts (MoE).
Finetuning.
RAG.
Pretraining.

Most confusion comes from mixing what these techniques do with what people hope they do.

This chapter explains:

  • what MoE really is
  • what each LLM improvement method actually changes
  • when each method works
  • when it fundamentally cannot work

1️⃣ What Is Mixture-of-Experts (MoE)?

MoE is conditional computation.

Instead of activating the entire model for every token, MoE activates only a subset.


1.1 Dense Models (Baseline)

In a dense transformer:

$$ \mathbf{y} = f(\mathbf{x}; \theta) $$

All parameters are used every time.

Pros:

  • simple
  • stable

Cons:

  • expensive
  • hard to scale beyond hardware limits

1.2 MoE Core Idea

MoE decomposes the model:

  • many experts: E_1, E_2, E_N
  • a router (gating network)

For each token:

$$ \mathbf{y} = \sum_{i \in \mathcal{S}} g_i(\mathbf{x}) E_i(\mathbf{x}) $$

Where:

  • S = selected experts (e.g. 8 out of 384)
  • gi = routing weights

1.3 Why MoE Exists

MoE lets us build:

  • 1T parameter models
  • while computing only ~30B parameters per token

This is how:

  • GPT-style models scale
  • training cost stays (barely) manageable

1.4 MoE Is Not Free

MoE introduces:

  • routing instability
  • load imbalance
  • expert collapse
  • communication overhead

MoE improves capacity, not intelligence.


2️⃣ Pretraining: The Foundation

Pretraining teaches a model language itself.

What it learns:

  • grammar
  • facts
  • patterns
  • implicit reasoning heuristics

What it does NOT learn:

  • your private data
  • your business rules
  • real-time information

Pretraining is expensive and irreversible.


3️⃣ Finetuning: Changing Behavior, Not Knowledge

Finetuning continues training on new data:

$$ \theta’ = \theta - \alpha \nabla_\theta \mathcal{L}_{\text{task}} $$


3.1 What Finetuning Is Good At

  • style adaptation
  • instruction following
  • tone control
  • domain biasing

Example:

  • legal writing
  • medical summarization
  • customer support tone

3.2 What Finetuning Is Bad At

  • memorizing large knowledge bases
  • updating fast-changing facts
  • precise retrieval

Finetuning compresses information into weights.

Compression causes forgetting.


3.3 LoRA: Efficient Finetuning

LoRA freezes base weights and adds low-rank adapters:

$$ W’ = W + BA $$

Pros:

  • cheap
  • reversible
  • modular

Cons:

  • limited expressivity

4️⃣ Retrieval-Augmented Generation (RAG)

RAG separates:

  • knowledge storage
  • language generation

4.1 RAG Architecture

  1. Embed documents
  2. Store in vector database
  3. Retrieve relevant chunks
  4. Inject into prompt
  5. Generate answer

4.2 Embedding Step

Each chunk is mapped.

Similarity:

$$ \text{sim}(\mathbf{q}, \mathbf{e}_i) = \cos(\mathbf{q}, \mathbf{e}_i) $$


4.3 Why RAG Works

  • avoids catastrophic forgetting
  • updates instantly
  • preserves factual accuracy
  • auditable sources

4.4 Why RAG Fails

RAG does NOT:

  • reason across many documents
  • plan multi-step solutions
  • resolve contradictions

RAG retrieves text. The model still guesses.


5️⃣ Training vs Finetuning vs RAG (Truth Table)

Method Changes Weights Adds Knowledge Real-Time Update
Pretraining
Finetuning ⚠️
LoRA ⚠️ ⚠️
RAG

No method replaces the others.


6️⃣ Which Method Is “Best”?

Wrong question.

The correct question:

What part of the system is broken?


6.1 Use Pretraining When

  • building a foundation model
  • massive compute available
  • general capability needed

6.2 Use Finetuning When

  • behavior is wrong
  • format is wrong
  • style is wrong

6.3 Use RAG When

  • knowledge changes
  • traceability matters
  • data is large and sparse

6.4 Use Agents When

  • tasks span time
  • decisions affect future states
  • planning is required

7️⃣ The Real LLM Stack (Systems View)

A real system looks like:

$$ \text{LLM} + \text{Memory} + \text{Retrieval} + \text{Tools} + \text{Control Loop} $$

No single technique is sufficient.


8️⃣ Final Reality Check

  • MoE scales capacity, not understanding
  • Finetuning shapes behavior, not truth
  • RAG supplies facts, not reasoning
  • Training teaches language, not agency

🧠 Closing Insight

Intelligence is not in the model.
Intelligence emerges from the system.


🧩 Vocabulary, Tokens, and Embeddings (The Real Basics)

Before talking about LLMs, agents, or reasoning,
we must understand how text becomes numbers.

Everything starts here.


1️⃣ Is One Vocabulary Entry One Word?

Not necessarily.

A vocabulary (vocab) is a list of tokens, not words.

A token can be:

  • a full word
  • part of a word
  • punctuation
  • a space
  • a symbol

Example: “Lionel Messi”

Depending on the tokenizer, this can become:

  • "Lionel" + " Messi"
  • "Lion" + "el" + " Mess" + "i"
  • "Li" + "on" + "el" + " Mess" + "i"

So:

❌ 1 vocab ≠ 1 word
✅ 1 vocab = 1 token unit


2️⃣ What Is a Token?

A token is the smallest unit the model processes.

Formally:

$$ \text{text} \rightarrow \text{tokens} = {t_1, t_2, \dots, t_n} $$

Example sentence:

“Lionel Messi is the best footballer”

Might tokenize as:


["Lionel", " Messi", " is", " the", " best", " football", "er"]

The model never sees letters or words —
it sees token IDs.

Example:


"Lionel" → 18372
" Messi" → 91827


3️⃣ Why Tokens Instead of Words?

Because:

  • languages differ
  • words are ambiguous
  • new words appear

Tokens allow:

  • multilingual support
  • efficient compression
  • shared subwords across languages

Example:

  • English: football
  • Spanish: fútbol
  • Name: Messi

The tokenizer learns reusable pieces.


4️⃣ From Token to Vector

Once we have token IDs, we convert them to vectors.

This is done using an embedding matrix.


Embedding Lookup

Each token ID maps to a vector:

$$ \text{Embedding}: \text{token_id} \rightarrow \mathbf{v} \in \mathbb{R}^d $$

Example (simplified):


"Lionel" → [0.12, -0.88, 0.34, ..., 0.05]
"Messi"  → [0.91,  0.11, 0.77, ..., -0.42]

Typical dimensions:

  • 768
  • 1024
  • 4096+
  • even 8192+

5️⃣ What Is an Embedding?

An embedding is a vector that represents meaning.

It encodes:

  • semantics
  • relationships
  • context potential

Important:

Embeddings do NOT store definitions
They store positions in meaning space


Lionel Messi in Embedding Space

Conceptually:

  • “Messi” is close to:

    • “football”
    • “Argentina”
    • “Barcelona”
    • “GOAT”
  • Far from:

    • “quantum mechanics”
    • “cooking recipe”
    • “neural network”

Distance matters.


6️⃣ Vector, Embedding, Latent Space (Same Family)

  • Vector: a list of numbers
  • Embedding: a vector with learned meaning
  • Latent space: the space where embeddings live

Formally:

$$ \mathbf{v}_{\text{Messi}} \in \mathbb{R}^d $$

Where closeness implies semantic similarity.


7️⃣ What Does an Embedding Look Like?

It looks like nothing human-readable.

Example (tiny fake embedding):


Messi →
[ 0.83, -1.12, 0.44, 0.09, -0.31 ]


8️⃣ Key Takeaways (Do Not Skip)

  • Vocabulary ≠ words
  • Tokens are subword units
  • Tokens become vectors
  • Vectors live in embedding space
  • Meaning = geometry, not text

🧠 Final Intuition

Humans read words.
Models navigate vector spaces.


🔍 Attention: How Vectors Talk to Each Other

After tokenization and embeddings,
we have vectors — but isolated vectors mean nothing.

Attention is the mechanism that lets vectors interact.

This chapter explains attention numerically and intuitively.


1️⃣ The Core Problem

We start with embeddings:

$$ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n $$

Each vector represents a token.

Problem:

  • each vector knows only itself
  • language meaning depends on relationships

Example:

“Lionel Messi scored a goal”

The meaning of “scored” depends on “Messi”.


2️⃣ Attention Is Weighted Interaction

Attention answers one question:

Which other tokens matter for this token?

Mathematically:

$$ \text{Attention}(\mathbf{q}, \mathbf{k}, \mathbf{v}) $$

Where:

  • Query (Q) = what I am looking for
  • Key (K) = what I offer
  • Value (V) = what I contribute

3️⃣ Creating Q, K, V

Each embedding is linearly projected:

$$ \mathbf{q}_i = W_Q \mathbf{x}_i $$

$$ \mathbf{k}_i = W_K \mathbf{x}_i $$

$$ \mathbf{v}_i = W_V \mathbf{x}_i $$

Same vector → three different roles.


4️⃣ Attention Scores (Who Listens to Whom)

For token ( i ), compute similarity to token ( j ):

$$ \alpha_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} $$

This measures relevance.

Example:

  • “scored” attends strongly to “Messi”
  • weakly to “a”

5️⃣ Softmax: Turning Scores into Weights

Normalize scores:

$$ w_{ij} = \text{softmax}(\alpha_{ij}) $$

Now:

  • weights sum to 1
  • attention becomes probabilistic focus

6️⃣ Mixing Values (Talking Happens Here)

Interpretation:

“Token i becomes a mixture of other tokens.”

This is contextualization.


7️⃣ Lionel Messi Example

Sentence:

“Lionel Messi won the Ballon d’Or”

Token “won”:

  • attends strongly to “Messi”
  • moderately to “Ballon”
  • weakly to “the”

Result:

  • “won” now encodes who won what

8️⃣ Self-Attention vs Cross-Attention

  • Self-attention: tokens attend to each other
  • Cross-attention: tokens attend to external memory (e.g. RAG)

Same math. Different source.


9️⃣ Multi-Head Attention

One attention head is limited.

So we use many:

$$ \text{MultiHead} = \text{Concat}(\text{head}_1, \dots, \text{head}_h) $$

Each head learns:

  • syntax
  • coreference
  • semantics
  • long-range relations

🧠 Key Insight

Attention does not store meaning.
It routes meaning.


🧠 From Embeddings to Thought

Embeddings are static.

Thought is dynamic.

This chapter explains how repeated attention + transformation turns vectors into reasoning.


1️⃣ One Layer Is Not Thought

After one attention layer:

$$ \mathbf{H}^{(1)} = \text{Attention}(\mathbf{X}) $$

We get:

  • contextualized vectors
  • shallow understanding

But no reasoning yet.


2️⃣ Stacking Layers = Iterative Refinement

Transformers stack layers:

$$ \mathbf{H}^{(l+1)} = \text{FFN}(\text{Attention}(\mathbf{H}^{(l)})) $$

Each layer:

  • refines representations
  • integrates broader context
  • abstracts meaning

3️⃣ Feedforward Networks (Nonlinearity)

FFN:

$$ \text{FFN}(\mathbf{x}) = W_2 \sigma(W_1 \mathbf{x}) $$

Purpose:

  • mix features
  • create nonlinear concepts
  • enable abstraction

4️⃣ Thought as Trajectory in Latent Space

A “thought” is not a symbol.

It is a path:

$$ \mathbf{x}^{(0)} \rightarrow \mathbf{x}^{(1)} \rightarrow \dots \rightarrow \mathbf{x}^{(L)} $$

Each layer moves the vector through latent space.


5️⃣ Example: Simple Reasoning

Prompt:

“Messi is older than Neymar. Who is younger?”

Early layers:

  • identify entities

Middle layers:

  • encode comparison

Later layers:

  • resolve answer

No rule engine.
Just geometry evolving.


6️⃣ Why This Feels Like Thinking

Because:

  • representations become more abstract
  • irrelevant details are suppressed
  • relationships dominate

This mirrors human cognition functionally, not biologically.


7️⃣ Why It Sometimes Fails

Because:

  • reasoning is approximate
  • errors compound
  • no explicit truth-checking

Thought is simulated, not guaranteed.


8️⃣ Chain-of-Thought Is Externalized Latent Process

CoT exposes:

$$ \text{hidden transformations} \rightarrow \text{text} $$

Open models allow this because:

  • transparency
  • debuggability
  • research value

🧠 Final Insight

LLMs do not “think” symbolically.
They evolve representations.


🤖 Foundations of Agentic AI and Large Language Models

Artificial Intelligence today is no longer about single predictions.

Modern AI systems:

  • reason
  • plan
  • act
  • observe
  • adapt

At the center of this shift is the combination of:

  • Large Language Models (LLMs)
  • Agentic workflows
  • System-level design

This chapter is a true basic foundation, written for nerds who want to understand, not just use.


1️⃣ What Is an AI Agent?

An AI agent is not a model.

An agent is a system that uses a model to interact with an environment over time.

Formally, an agent:

  • receives observations
  • maintains internal state
  • selects actions
  • receives feedback
  • updates itself

A minimal agent loop:


Observation → Reasoning → Action → Environment → Observation


Agent vs Model

Model Agent
Stateless Stateful
Single output Continuous loop
No tools Tool-using
No memory Memory-aware

An LLM becomes agentic only when embedded in this loop.


2️⃣ What Does an Agent Actually Do?

An agent typically performs tasks like:

  • searching information
  • writing code
  • running programs
  • analyzing data
  • making decisions
  • coordinating tools

Example:


User: "Analyze my dataset and plot trends"

Agent:

1. Understand task
2. Plan steps
3. Load data
4. Run code
5. Inspect output
6. Adjust
7. Respond

This is goal-directed behavior.


3️⃣ What Is a Token?

LLMs do not see words.

They see tokens.

A token is a discrete unit produced by a tokenizer.

Example:


"Artificial intelligence is powerful"
→ ["Artificial", " intelligence", " is", " powerful"]

Sometimes:

  • one word = one token
  • one word = many tokens
  • symbols, code, spaces are tokens

Are More Tokens Better?

No.

Token count affects:

  • cost
  • latency
  • memory usage

What matters is information density, not raw token count.


4️⃣ Token → Vector → Embedding

Each token is mapped to a vector.

This mapping is called an embedding.

Formally:

$$ \text{Embedding}: \mathcal{V} \rightarrow \mathbb{R}^d $$

Where:

  • V = vocabulary
  • d = embedding dimension

What Is a Vector?

A vector is just a list of numbers:

$$ \mathbf{v} = [v_1, v_2, \dots, v_d] $$

Each dimension encodes latent features:

  • syntax
  • semantics
  • style
  • function

5️⃣ What Is Latent Space?

Latent space is the geometry of meaning.

In latent space:

  • similar concepts are close
  • different concepts are far apart

Distance is often measured by cosine similarity:

$$ \text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} $$


What Is “z”?

In ML papers, ( z ) usually denotes a latent variable.

In LLMs:

  • embeddings
  • hidden states
  • attention outputs

are all forms of latent representations.


6️⃣ Probability in LLMs

LLMs model probability distributions over tokens.

At each step:

$$ P(x_t \mid x_1, x_2, \dots, x_{t-1}) $$

The model predicts:

“Given everything so far, what token is most likely next?”

Training minimizes cross-entropy loss:

$$ \mathcal{L} = -\sum_t \log P(x_t) $$


7️⃣ Why LLMs Feel “Smart”

Because reasoning emerges from:

  • scale
  • representation
  • optimization

Not because the model “understands” like a human.


8️⃣ Architecture: Mixture-of-Experts (MoE)

MoE replaces one big network with many specialists.

Instead of using all parameters every time, the model selects a few experts per token.


Key Parameters Explained

Architecture: Mixture-of-Experts (MoE)

A sparse architecture with expert routing.


Total Parameters: 1T

$$ 1T = 1{,}000{,}000{,}000{,}000 $$

These include all experts combined.


Activated Parameters: 32B

Only a subset is used per token:

$$ \text{Compute} \propto 32B \ll 1T $$

This makes MoE scalable.


Number of Layers: 61

Total transformer layers.


Number of Dense Layers: 1

One shared layer used by all tokens.


Attention Hidden Dimension: 7168

Size of token representation inside attention.


MoE Hidden Dimension (per Expert): 2048

Each expert is smaller and specialized.


Number of Experts: 384

Total pool of experts.


Selected Experts per Token: 8

Router chooses 8 experts per token.


Number of Shared Experts: 1

Always-active expert for stability.


Vocabulary Size: 160K

Number of tokens the model understands.


Context Length: 256K

Maximum tokens in one forward pass.


Attention Mechanism: MLA

Multi-head attention optimized for long context.


Activation Function: SwiGLU

$$ \text{SwiGLU}(x) = (xW_1 \odot \sigma(xW_2))W_3 $$

Smooth, stable, expressive.


Vision Encoder: MoonViT

Visual encoder for multimodal inputs.


Parameters of Vision Encoder: 400M

Separate vision model feeding into LLM.


9️⃣ What Tasks Are LLMs Trained For?

Primary objective:

  • Next-token prediction

Emergent abilities:

  • reasoning
  • coding
  • translation
  • summarization
  • planning
  • tool use

These arise from generalization, not explicit programming.


🔟 How Are LLMs Evaluated?

Evaluation uses benchmarks:

  • MMLU
  • GSM8K (math)
  • HumanEval (code)
  • BIG-bench
  • Agent task suites

Metrics include:

  • accuracy
  • pass@k
  • reasoning depth
  • tool success rate

1️⃣1️⃣ Hardware Required to Train Models

Training requires massive compute.

Example for 100B+ models:

  • NVIDIA H100 GPUs
  • Thousands of GPUs
  • Millions of GPU-hours

Approximate relation:

$$ \text{Training Cost} \propto \text{Parameters} \times \text{Tokens} $$


1️⃣2️⃣ Why You Can Ask Anything and Get Answers

LLMs work because they learn:

  • patterns
  • abstractions
  • relationships

They do not retrieve answers like a database.

They generate answers probabilistically.


1️⃣3️⃣ From LLMs to Agentic AI

An agent wraps the LLM with:

  • memory
  • tools
  • control logic
  • safety constraints

This transforms:

language modeling
into
decision-making


🧠 Final Mental Model

  • Tokens are symbols
  • Embeddings are meaning
  • Latent space is geometry
  • Probability drives generation
  • MoE enables scale
  • Agents enable action

🧭 Closing Thought

LLMs are not magic.

They are:

  • mathematics
  • optimization
  • systems engineering

But when combined correctly, they form agentic AI systems.

And that is where modern AI truly begins.


🔤 Tokens, Embeddings, and How Language Becomes Numbers

Large Language Models do not understand words.

They do not understand sentences.

They do not understand languages.

They understand numbers.

This chapter explains—step by step—how:

  • text becomes tokens
  • tokens become vectors
  • vectors become embeddings
  • embeddings are trained
  • multiple languages coexist in one model

This is the core mechanical foundation of LLMs.


1️⃣ What Is a Token?

A token is the smallest unit of text that a language model processes.

A token is not:

  • necessarily a word
  • necessarily a character
  • necessarily a syllable

It is a unit defined by a tokenizer.


2️⃣ Examples: Words, Subwords, and Symbols

Consider the sentence:


"Language models are powerful."

A tokenizer might produce:


["Language", " models", " are", " powerful", "."]

Each item above is one token.


Subword Tokenization Example

Now consider:


"tokenization"

This may become:


["token", "ization"]

Or even:


["tok", "en", "ization"]

Why?

Because tokenizers optimize for frequency and efficiency, not linguistics.


3️⃣ Why Not Just Use Words?

Using full words causes problems:

  • Vocabulary explodes
  • Rare words are unseen
  • New words cannot be handled

Instead, modern LLMs use subword tokenization.


4️⃣ Byte Pair Encoding (BPE)

Most LLMs use BPE or variants.

The idea is simple:

  1. Start with characters
  2. Merge frequent pairs
  3. Repeat until vocabulary size is reached

Formally, BPE minimizes:

$$ \text{Total Tokens} = \sum_{i} \text{Length}(x_i) $$

subject to a fixed vocabulary size.


5️⃣ Tokens Across Languages

LLMs do not have separate vocabularies per language.

They use one shared vocabulary.

Example:

Language Tokenization
English Subwords
Thai Character-like chunks
Chinese Characters
Japanese Mixed Kanji + Kana
Code Keywords + symbols

Example: Chinese


"人工智能"
→ ["人", "工", "智", "能"]

Each character is already meaningful, so tokenization is straightforward.


Example: Thai

Thai has no spaces:


"ภาษาไทยยากไหม"

Tokenizer output may look like:


["ภาษา", "ไทย", "ยาก", "ไหม"]

Learned statistically, not linguistically.


6️⃣ Vocabulary Size

Vocabulary size determines how many unique tokens exist.

Typical values:

  • 32K
  • 50K
  • 100K
  • 160K
  • 200K+

A larger vocabulary means:

  • fewer tokens per sentence
  • larger embedding tables
  • higher memory cost

7️⃣ From Token to ID

Each token is mapped to an integer ID.

Example:


"language" → 48321

This is just a lookup.


8️⃣ Token → Vector (Embedding)

Token IDs are mapped to vectors via an embedding matrix.

Formally:

$$ E \in \mathbb{R}^{|\mathcal{V}| \times d} $$

Where:

  • V = vocabulary size
  • d = embedding dimension

Embedding Lookup

Given token ID ( i ):

$$ \mathbf{e}_i = E[i] $$

This vector represents the token in continuous space.


9️⃣ What Is an Embedding?

An embedding is:

  • a dense vector
  • learned during training
  • representing semantic and syntactic properties

🔟 What Is a Vector?

A vector is a list of real numbers:

$$ \mathbf{v} = [v_1, v_2, \dots, v_d] $$

Each dimension has no human-interpretable meaning alone.

Meaning emerges from relative geometry.


1️⃣1️⃣ What Is Latent Space?

Latent space is the space formed by embeddings.

In latent space:

  • distance encodes similarity
  • directions encode relationships

Distance is often measured by cosine similarity:

$$ \text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} $$


1️⃣2️⃣ Tokens in Context: Position Matters

Embeddings alone ignore order.

Transformers add positional information:

$$ \mathbf{h}_t = \mathbf{e}_t + \mathbf{p}_t $$

Where:

  • pt = positional encoding

This allows the model to distinguish:


"dog bites man"
vs
"man bites dog"


1️⃣3️⃣ How Tokens Are Used During Training

Training objective:

$$ P(x_t \mid x_1, \dots, x_{t-1}) $$

For a sequence:


"The cat sat"

Training pairs are:


"The" → "cat"
"The cat" → "sat"


Loss Function

Cross-entropy loss:

$$ \mathcal{L} = -\log P(x_t) $$

The model is penalized if the correct next token has low probability.


1️⃣4️⃣ How Embeddings Are Learned

Embeddings are not pretrained separately.

They are learned end-to-end.

During backpropagation:

  • gradients flow into embedding vectors
  • frequent tokens update often
  • rare tokens update less

This is why:

  • common words are well-shaped
  • rare words are noisier

1️⃣5️⃣ Multilingual Training

Training data mixes languages.

The model learns:

  • shared structure (logic, syntax)
  • language-specific patterns

This creates cross-lingual embeddings.


1️⃣6️⃣ Are More Tokens Better?

No.

More tokens means:

  • more compute
  • more memory
  • slower inference

Better tokenization means:

  • fewer tokens
  • richer embeddings

Quality beats quantity.


1️⃣7️⃣ Summary Mental Model

  • Text → tokens
  • Tokens → IDs
  • IDs → vectors
  • Vectors → latent space
  • Latent space → probability
  • Probability → language generation

🧠 Final Intuition

Language models do not store sentences.

They store geometric relationships between tokens.

Meaning is not memorized.

It is emergent.


🧠 Modern Large Language Models (LLMs)

Large Language Models are no longer just “big neural networks that predict the next word”.

They are:

  • Reasoning engines
  • Tool-using agents
  • Modular systems
  • Open-weight infrastructures

This article is a deep but foundational recap of modern LLMs—written for people who already speak ML, but feel the ecosystem is moving too fast to track.

If you’ve ever asked:

“Wait… what exactly is MoE, LoRA, RAG, quantization, or agentic LLMs?”

This is for you.


1️⃣ What Does “Model Tree” Mean?

When browsing open models (e.g. on HuggingFace), you often see a structure like:


openai/gpt-oss-120b
├── Adapters
├── Finetunes
├── Merges
├── Quantizations

This is not chaos.
It is evolution.

Think of a model tree as a genetic family:

  • Base model → the pretrained brain
  • Adapters → plug-in skills
  • Finetunes → retrained personalities
  • Merges → hybrid offspring
  • Quantizations → compressed forms

2️⃣ Base Model: gpt-oss-120b


Model size: 120B parameters
Tensor type: BF16 / U8
License: Apache 2.0

What is “120B parameters”?

A parameter is a learned scalar value inside the neural network.


120B = 120,000,000,000 parameters

Memory footprint (roughly):

$$ \text{Memory} \approx \text{Parameters} \times \text{Bytes per parameter} $$

  • BF16 → ~2 bytes → ~240 GB
  • FP32 → ~4 bytes → ~480 GB

This is why compression, sharding, and MoE exist.


3️⃣ Tensor Types: BF16 vs U8

Type Meaning Usage
BF16 Brain Floating Point 16 Training / high-quality inference
FP16 IEEE Half Precision Legacy GPU inference
U8 / INT8 8-bit Integer Fast & cheap inference
INT4 4-bit Integer Extreme compression

Quantization trades precision for efficiency.

$$ \text{Smaller precision} \Rightarrow \text{Less memory} \Rightarrow \text{Faster inference} $$


4️⃣ Adapters and LoRA (Low-Rank Adaptation)

Adapters are modular fine-tuning layers.

LoRA works by injecting a low-rank update:

$$ W’ = W + \Delta W $$

Where:

$$ \Delta W = A B \quad \text{with} \quad \text{rank}(A,B) \ll \text{rank}(W) $$

Why LoRA matters

  • Base model is frozen
  • Only a few million parameters trained
  • Easy to distribute
  • Easy to swap

This is why open-source LLMs scale socially.


5️⃣ Finetuning vs Pretraining

Pretraining

Pretraining teaches the model language itself:

$$ \mathcal{L} = -\sum_{t} \log P(x_t \mid x_{<t}) $$

  • Trillions of tokens
  • Next-token prediction
  • Costs millions of GPU-hours
  • One-time process

Finetuning

Finetuning teaches the model how to behave.

  • Instruction following
  • Reasoning style
  • Domain specialization
  • Safety alignment

This is where “chat”, “assistant”, and “expert” personalities come from.


6️⃣ Quantized Models

Quantized models are the same brain, cheaper hardware.

Examples:

  • 120B → INT8 → fits on multi-GPU servers
  • 7B → INT4 → fits on a laptop GPU

Trade-off:

$$ \text{Compression} \uparrow \Rightarrow \text{Accuracy} \downarrow \ (\text{slightly}) $$


7️⃣ What is gpt-oss-safeguard-120b?

This is not the main language model.

It is a safety enforcement model.

Responsibilities:

  • Input filtering
  • Output moderation
  • Policy enforcement
  • Risk classification

In proprietary APIs, this layer is invisible.
In open-weight systems, you must build it yourself.


8️⃣ Mixture-of-Experts (MoE) Architecture

MoE is the defining architecture of modern large-scale LLMs.

Dense model:

Every parameter is used every time.

MoE model:

Only some experts activate per token.


MoE at scale

Property Value
Total Parameters 1T
Activated Parameters 32B
Number of Experts 384
Experts per Token 8

This means:

$$ \text{Compute cost} \propto 32B \quad \text{not} \quad 1T $$


Intuition

Instead of:

“Use the whole brain for everything”

We get:

“Call 8 specialists per word”


9️⃣ Architecture Details Explained

Attention Heads


64 heads

Each head attends to different relationships:

  • Syntax
  • Semantics
  • Long-range dependencies

SwiGLU Activation

$$ \text{SwiGLU}(x) = (xW_1 \odot \sigma(xW_2))W_3 $$

Why it’s used:

  • Smooth gradients
  • Better expressivity
  • Stable training

Context Length: 256K

This enables:

  • Whole-codebase reasoning
  • Long legal / medical documents
  • Multi-step agent planning

🔟 Tokenization (o200k_harmony)

Vocabulary size:

$$ |\mathcal{V}| \approx 200{,}000 $$

This includes:

  • Natural language
  • Code tokens
  • Math symbols
  • Tool-call syntax
  • Agent control tokens

Tokenizer design is not trivial—it directly affects reasoning quality.


1️⃣1️⃣ RAG (Retrieval-Augmented Generation)

RAG adds external memory:


User → Retriever → Documents → LLM → Answer

Strengths:

  • Fresh knowledge
  • Enterprise data
  • Auditable sources

Limitations:

  • Weak for deep reasoning
  • Brittle retrieval
  • Latency overhead

RAG is evolving—not obsolete.


1️⃣2️⃣ Agentic LLMs

Modern LLMs are not single-shot generators.

They operate in loops:


Plan → Act → Observe → Reflect → Repeat

Tools include:

  • Web search
  • Python execution
  • Databases
  • APIs

This is where LLMs become systems, not models.


1️⃣3️⃣ Putting It All Together

A modern LLM stack looks like:


Pretrained MoE LLM
↓
Finetune / LoRA
↓
Quantization
↓
RAG + Tools
↓
Agent Loop
↓
Safety Guards


🧠 Final Thoughts

If LLMs feel overwhelming, that’s because:

They are no longer just models.

They are:

  • Architectures
  • Systems
  • Ecosystems
  • Infrastructures

Understanding them today means thinking like:

  • A machine learning researcher
  • A systems engineer
  • A product architect

And yes — the pace is brutal.

But now, you’re back in control.


🤖 Agentic Large Language Models

Modern LLMs are no longer passive text generators.

They are agents.

They:

  • Plan actions
  • Call tools
  • Observe results
  • Reflect and revise
  • Loop until completion

Understanding agentic workflows is now a core literacy for anyone working with LLMs.


1️⃣ What Does “Agentic” Actually Mean?

An agentic LLM is a model embedded inside a control loop.

At minimum, the loop looks like:


Thought → Action → Observation → Thought → ...

This is not metaphorical.
It is a programmatic execution cycle.


The canonical agent loop


1. Parse user intent
2. Plan intermediate steps
3. Decide which tool to call
4. Execute tool
5. Observe results
6. Update internal state
7. Continue or stop

This loop transforms LLMs from:

“Answer machines”

into:

“Task-completing systems”


2️⃣ ReAct, Plan-Act-Reflect, and Toolformers

Most modern agents descend from three ideas:

🔹 ReAct (Reason + Act)

The model alternates between reasoning and acting.


Thought: I need recent data.
Action: WebSearch(query="...")
Observation: ...

Reasoning grounds tool usage.


🔹 Plan–Act–Reflect

More structured agent loop:


Plan → Act → Observe → Reflect → Re-plan

Reflection is critical:

  • Error correction
  • Long-horizon reasoning
  • Self-debugging

🔹 Toolformers

LLMs trained to decide when tools are useful, not just how to use them.

This is why modern models expose:

  • Web search
  • Python
  • File systems
  • APIs

3️⃣ Why Chain-of-Thought (CoT) Matters

Chain-of-Thought is not verbosity.

It is externalized intermediate computation.

Formally, CoT approximates:

$$ P(y \mid x) = \sum_{z} P(y \mid z, x) P(z \mid x) $$

Where:

  • ( x ) = input
  • ( z ) = latent reasoning steps
  • ( y ) = output

4️⃣ Why Open Models Expose CoT

This is a philosophical and architectural difference.

Proprietary models

  • CoT is hidden or summarized
  • Exposed reasoning is filtered
  • System-level alignment is enforced upstream

Open-weight models

  • CoT is part of the artifact
  • Debuggable
  • Inspectable
  • Modifiable

This is intentional.


Why the open community wants CoT

  1. Debugging
    • Inspect failure modes
  2. Research
    • Analyze reasoning depth
  3. Alignment
    • Study safety trade-offs
  4. Education
    • Teach reasoning, not just answers

Open models treat CoT as:

a feature, not a liability


5️⃣ Adjustable Reasoning Effort

Modern reasoning models expose a control variable:

  • Think fast (cheap)
  • Think slow (deep)

Conceptually:

$$ \text{Compute} \propto \text{Reasoning Depth} $$

This enables:

  • Cost-aware deployment
  • Adaptive intelligence
  • Agent-level optimization

6️⃣ Why CoT Is Risky — and Still Open

CoT can leak:

  • Sensitive heuristics
  • Attack strategies
  • Unsafe reasoning paths

Open models accept this risk because:

  • They prioritize transparency
  • Safety is enforced at the system level, not hidden logic

This shifts responsibility to the system designer.


7️⃣ Mapping the Modern LLM Ecosystem

Let’s zoom out.


🔵 OpenAI

Philosophy: System-level safety, agent-first APIs

  • Strong reasoning models
  • Deep tool integration
  • Hidden CoT by default
  • Heavy alignment layers

Strengths:

  • Production-grade agents
  • Robust safety
  • Best-in-class reasoning

Trade-off:

  • Limited transparency
  • No weight access

🟣 Meta (LLaMA family)

Philosophy: Open weights, scalable infrastructure

  • Dense + MoE research
  • Strong multilingual support
  • Community-driven fine-tuning

Strengths:

  • Foundation for OSS ecosystem
  • Research-friendly
  • Broad adoption

Trade-off:

  • Safety is DIY
  • Tooling varies by implementation

🟢 Mistral

Philosophy: Efficiency + elegance

  • MoE-first designs
  • Strong small/medium models
  • European regulatory awareness

Strengths:

  • High performance per parameter
  • Clean architecture
  • Excellent for on-prem

Trade-off:

  • Smaller ecosystem (for now)

⚫ Open-Source Community (OSS)

This is not one actor — it is an ecosystem.

Includes:

  • Weight merges
  • Custom LoRA adapters
  • Experimental architectures
  • Specialized agents

OSS prioritizes:

  • Transparency
  • Modularity
  • Hackability

Risk:

  • Inconsistent safety
  • Fragmentation

8️⃣ Comparative Mental Model

Axis OpenAI Meta Mistral OSS
Weights Closed Open Open Open
CoT Hidden Exposed Exposed Exposed
Safety Centralized Optional Optional DIY
Agents Native External External Experimental
Research Controlled Open Focused Chaotic

9️⃣ Why This Ecosystem Exists

No single model can optimize for:

  • Safety
  • Transparency
  • Performance
  • Cost
  • Control

Different actors choose different trade-offs.

This diversity is healthy.


🔟 The New Role of the Practitioner

Working with LLMs today means you are no longer just a user.

You are:

  • A system designer
  • A safety engineer
  • A reasoning architect

Understanding agentic workflows and CoT exposure is mandatory.


🧠 Final Synthesis

LLMs are no longer:

“Models you query”

They are:

“Systems you design”

Agentic workflows provide agency.
Chain-of-Thought provides cognition.
Open ecosystems provide freedom.

And freedom always comes with responsibility.


🧠 Attention, Scaling Laws, and the Emergence of Reasoning

If embeddings explain what language means,
attention explains how meaning is composed.

This chapter answers five fundamental questions:

  1. What is attention—numerically?
  2. Why self-attention works so well
  3. How training dynamics and scaling laws shape intelligence
  4. When tokenization breaks models
  5. How embeddings turn into reasoning

This is where LLMs stop being “vector machines”
and start behaving like reasoning systems.


1️⃣ Attention Explained with Numbers

Attention is a weighted averaging mechanism.

Each token decides:

“Which other tokens matter to me right now?”


Step 1: From Embeddings to Q, K, V

For each token embedding:

$$ \mathbf{q} = \mathbf{x}W_Q,\quad \mathbf{k} = \mathbf{x}W_K,\quad \mathbf{v} = \mathbf{x}W_V $$

Where:

  • q = query
  • k = key
  • v = value

All are vectors.


Step 2: Similarity Scores

For token ( i ) attending to token ( j ):

$$ s_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} $$

This measures relevance.


Step 3: Softmax = Probability Distribution

$$ \alpha_{ij} = \frac{\exp(s_{ij})}{\sum_j \exp(s_{ij})} $$

Attention is probabilistic focus.


Step 4: Weighted Sum

The output token representation is a mixture of other tokens.


2️⃣ Why Self-Attention Works

Self-attention allows every token to:

  • access global context
  • dynamically reweight importance
  • adapt per task and per position

This solves three core problems at once:

  • long-range dependency
  • variable structure
  • parallel computation

Key Insight

Self-attention is content-addressable memory.

Instead of indexing by position, tokens index by meaning.


3️⃣ Multi-Head Attention: Many Views of Meaning

In practice, attention is multi-headed.

$$ \text{Attention} = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W_O $$

Each head learns different relationships:

  • syntax
  • semantics
  • coreference
  • arithmetic
  • code structure

This is distributed reasoning.


4️⃣ Why Attention Enables Reasoning

Reasoning requires:

  • variable binding
  • relational comparison
  • composition

Attention enables:

$$ \text{Reasoning} \approx \text{Iterative Context Mixing} $$

Each layer refines representations by recontextualizing tokens.


5️⃣ Training Dynamics: How Models Actually Learn

LLMs are trained with gradient descent.

Each update minimizes:

$$ \mathcal{L} = -\sum_t \log P(x_t \mid x_{<t}) $$

Learning emerges from:

  • many small updates
  • massive data
  • overparameterization

Optimization Intuition

Early training:

  • learns token statistics

Mid training:

  • learns syntax and patterns

Late training:

  • learns abstractions and reasoning heuristics

6️⃣ Scaling Laws

Empirically, performance follows power laws.

$$ \mathcal{L}(N, D, C) \propto N^{-\alpha} D^{-\beta} C^{-\gamma} $$

Where:

  • ( N ) = parameters
  • ( D ) = data
  • ( C ) = compute

Consequence

  • Bigger models → better reasoning
  • More data → better generalization
  • More compute → smoother optimization

There is no sharp intelligence threshold—only scale.


7️⃣ Why Bigger Models Reason Better

Large models:

  • store more abstractions
  • represent deeper hierarchies
  • maintain longer dependencies

Reasoning is not programmed. It emerges when capacity is sufficient.


8️⃣ When Tokenization Goes Wrong

Tokenization is a silent failure mode.

Bad tokenization causes:

  • excessive token counts
  • broken morphemes
  • semantic fragmentation

Example: Over-Fragmentation


"electromagnetism"
→ ["elec", "tro", "mag", "net", "ism"]

Meaning is diluted across tokens.


Multilingual Failure

Low-resource languages may:

  • use many tokens per word
  • receive fewer gradient updates
  • have poorer embeddings

This directly harms performance.


9️⃣ Tokenization and Reasoning Errors

Reasoning depends on stable symbols.

If numbers, variables, or operators are split poorly:

  • math fails
  • code fails
  • logic fails

This is why modern tokenizers:

  • include digits
  • include operators
  • include code tokens

🔟 From Embeddings to Reasoning

Embeddings alone do not reason.

Reasoning emerges from:

  • attention
  • depth
  • recurrence across layers

Each layer computes:

$$ \mathbf{H}^{(l+1)} = \text{TransformerBlock}(\mathbf{H}^{(l)}) $$

This is iterative refinement.


Reasoning as Trajectory in Latent Space

A reasoning chain is a path:

$$ \mathbf{z}_0 \rightarrow \mathbf{z}_1 \rightarrow \dots \rightarrow \mathbf{z}_T $$

Each step refines belief.


1️⃣1️⃣ Why Chain-of-Thought Helps

Explicit reasoning externalizes latent steps.

It:

  • stabilizes trajectories
  • reduces entropy
  • improves correctness

But the real reasoning happens inside the vectors.


1️⃣2️⃣ Summary Mental Model

  • Tokens are symbols
  • Embeddings are points
  • Attention is interaction
  • Layers are refinement
  • Scale is capacity
  • Reasoning is emergence

🧠 Final Intuition

LLMs do not “think” like humans.

They:

  • transform vectors
  • mix context
  • optimize probabilities

Yet from this process, reasoning emerges.

That is the core miracle of modern AI.


🔄 Why Transformers Replace RNNs Forever

Recurrent Neural Networks (RNNs) were once the backbone of sequence modeling.

Transformers ended that era.

This chapter explains why this replacement is permanent, not a trend.


1️⃣ What RNNs Were Trying to Solve

Language is sequential.

RNNs model sequences by recurrence.

This looks elegant:

  • memory through hidden state
  • time-aware processing

But elegance does not scale.


2️⃣ The Fundamental Limits of RNNs

2.1 Vanishing and Exploding Gradients

Backpropagation through time multiplies Jacobians.

This product either:

  • shrinks to zero
  • explodes to infinity

No architecture tweak fully fixes this.


2.2 Sequential Bottleneck

RNNs must compute:

$$ \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \dots \rightarrow \mathbf{h}_T $$

This is inherently serial.

GPUs hate serial computation.


2.3 Memory is Compressed Too Early

RNNs force all past context into a fixed-size vector.

This causes:

  • information loss
  • interference
  • forgetting long-range dependencies

3️⃣ Why Attention Is a Structural Upgrade

Transformers remove recurrence entirely.

$$ \mathbf{H}^{(l+1)} = \text{Attention}(\mathbf{H}^{(l)}) $$

Key properties:

  • full context access
  • parallel computation
  • content-based memory

4️⃣ Attention vs Recurrence: A Direct Comparison

Property RNN Transformer
Memory access Compressed Explicit
Parallelism
Long-range dependency Weak Strong
Training stability Fragile Stable
Scaling behavior Poor Excellent

5️⃣ Why Transformers Scale and RNNs Do Not

Scaling requires:

  • predictable gradients
  • efficient hardware use
  • stable optimization

Transformers satisfy all three.

RNNs satisfy none.


6️⃣ The Death of Inductive Bias

RNNs hard-code temporal order.

Transformers learn structure from data.

This flexibility allows:

  • language
  • code
  • math
  • vision
  • multimodal reasoning

One architecture. Many domains.


7️⃣ Final Verdict

Transformers did not replace RNNs because they are newer.

They replaced RNNs because they are:

  • structurally superior
  • computationally aligned with modern hardware
  • compatible with scale

This replacement is irreversible.


Transformers are not better RNNs.
They are a different species entirely.


⚠️ Failure Modes of Reasoning Models

LLMs can reason.

But they can also fail—quietly, confidently, and convincingly.

This chapter dissects why reasoning models fail, even at large scale.


1️⃣ Reasoning Is Approximate Inference

LLMs estimate:

$$ P(x_t \mid x_{<t}) $$

They do not verify truth. They maximize likelihood.

This creates systematic failure modes.


2️⃣ Hallucination as Probability Maximization

Hallucination occurs when:

$$ \arg\max_x P(x \mid \text{context}) \neq \text{truth} $$

If the model has seen similar patterns, it may confidently invent details.


3️⃣ Shortcut Reasoning

Models often learn:

  • surface heuristics
  • dataset biases
  • shallow correlations

Instead of reasoning:

“This looks like problem type X, answer is usually Y.”

This works—until it doesn’t.


4️⃣ Chain-of-Thought Collapse

Long reasoning chains can drift.

Each step compounds error:

$$ \epsilon_{total} \approx \sum_t \epsilon_t $$

This leads to:

  • incorrect conclusions
  • internally consistent nonsense

5️⃣ Symbolic Fragility

LLMs struggle with:

  • exact arithmetic
  • variable binding
  • stateful reasoning

Why?

Because symbols are distributed, not discrete.


6️⃣ Out-of-Distribution Reasoning

Reasoning degrades sharply when:

  • assumptions shift
  • constraints change
  • rules are inverted

LLMs interpolate well. They extrapolate poorly.


7️⃣ Alignment vs Reasoning Tension

Safety training can:

  • suppress exploration
  • bias outputs
  • reduce uncertainty expression

This can mask reasoning errors instead of fixing them.


8️⃣ Summary of Failure Modes

Failure Root Cause
Hallucination Likelihood ≠ Truth
Logical error Approximate inference
Overconfidence Entropy minimization
Math failure Symbolic mismatch
OOD collapse Lack of world grounding

🧠 Key Insight

LLMs reason statistically, not causally.

Understanding failure modes is not a weakness— it is a prerequisite for building better systems.


Reasoning models are powerful—but not infallible.


Previous
Next