DK-009 — From Pretraining to MoE and Agentic Systems

🧩 MoE, Finetuning, RAG, and Training: How LLMs Are Actually Built

Modern LLM development is full of buzzwords.

Mixture-of-Experts (MoE).
Finetuning.
RAG.
Pretraining.

Most confusion comes from mixing what these techniques do with what people hope they do.

This chapter explains:

what MoE really is
what each LLM improvement method actually changes
when each method works
when it fundamentally cannot work

1️⃣ What Is Mixture-of-Experts (MoE)?

MoE is conditional computation.

Instead of activating the entire model for every token, MoE activates only a subset.

1.1 Dense Models (Baseline)

In a dense transformer:

$$ \mathbf{y} = f(\mathbf{x}; \theta) $$

All parameters are used every time.

Pros:

simple
stable

Cons:

expensive
hard to scale beyond hardware limits

1.2 MoE Core Idea

MoE decomposes the model:

many experts: E_1, E_2, E_N
a router (gating network)

For each token:

$$ \mathbf{y} = \sum_{i \in \mathcal{S}} g_i(\mathbf{x}) E_i(\mathbf{x}) $$

Where:

S = selected experts (e.g. 8 out of 384)
gi = routing weights

1.3 Why MoE Exists

MoE lets us build:

1T parameter models
while computing only ~30B parameters per token

This is how:

GPT-style models scale
training cost stays (barely) manageable

1.4 MoE Is Not Free

MoE introduces:

routing instability
load imbalance
expert collapse
communication overhead

MoE improves capacity, not intelligence.

2️⃣ Pretraining: The Foundation

Pretraining teaches a model language itself.

What it learns:

grammar
facts
patterns
implicit reasoning heuristics

What it does NOT learn:

your private data
your business rules
real-time information

Pretraining is expensive and irreversible.

3️⃣ Finetuning: Changing Behavior, Not Knowledge

Finetuning continues training on new data:

$$ \theta’ = \theta - \alpha \nabla_\theta \mathcal{L}_{\text{task}} $$

3.1 What Finetuning Is Good At

style adaptation
instruction following
tone control
domain biasing

Example:

legal writing
medical summarization
customer support tone

3.2 What Finetuning Is Bad At

memorizing large knowledge bases
updating fast-changing facts
precise retrieval

Finetuning compresses information into weights.

Compression causes forgetting.

3.3 LoRA: Efficient Finetuning

LoRA freezes base weights and adds low-rank adapters:

$$ W’ = W + BA $$

Pros:

cheap
reversible
modular

Cons:

limited expressivity

4️⃣ Retrieval-Augmented Generation (RAG)

RAG separates:

knowledge storage
language generation

4.1 RAG Architecture

Embed documents
Store in vector database
Retrieve relevant chunks
Inject into prompt
Generate answer

4.2 Embedding Step

Each chunk is mapped.

Similarity:

$$ \text{sim}(\mathbf{q}, \mathbf{e}_i) = \cos(\mathbf{q}, \mathbf{e}_i) $$

4.3 Why RAG Works

avoids catastrophic forgetting
updates instantly
preserves factual accuracy
auditable sources

4.4 Why RAG Fails

RAG does NOT:

reason across many documents
plan multi-step solutions
resolve contradictions

RAG retrieves text. The model still guesses.

5️⃣ Training vs Finetuning vs RAG (Truth Table)

Method	Changes Weights	Adds Knowledge	Real-Time Update
Pretraining	✅	✅	❌
Finetuning	✅	⚠️	❌
LoRA	⚠️	⚠️	❌
RAG	❌	✅	✅

No method replaces the others.

6️⃣ Which Method Is “Best”?

Wrong question.

The correct question:

What part of the system is broken?

6.1 Use Pretraining When

building a foundation model
massive compute available
general capability needed

6.2 Use Finetuning When

behavior is wrong
format is wrong
style is wrong

6.3 Use RAG When

knowledge changes
traceability matters
data is large and sparse

6.4 Use Agents When

tasks span time
decisions affect future states
planning is required

7️⃣ The Real LLM Stack (Systems View)

A real system looks like:

$$ \text{LLM} + \text{Memory} + \text{Retrieval} + \text{Tools} + \text{Control Loop} $$

No single technique is sufficient.

8️⃣ Final Reality Check

MoE scales capacity, not understanding
Finetuning shapes behavior, not truth
RAG supplies facts, not reasoning
Training teaches language, not agency

🧠 Closing Insight

Intelligence is not in the model.
Intelligence emerges from the system.

🧩 Vocabulary, Tokens, and Embeddings (The Real Basics)

Before talking about LLMs, agents, or reasoning,
we must understand how text becomes numbers.

Everything starts here.

1️⃣ Is One Vocabulary Entry One Word?

Not necessarily.

A vocabulary (vocab) is a list of tokens, not words.

A token can be:

a full word
part of a word
punctuation
a space
a symbol

Example: “Lionel Messi”

Depending on the tokenizer, this can become:

"Lionel" + " Messi"
"Lion" + "el" + " Mess" + "i"
"Li" + "on" + "el" + " Mess" + "i"

So:

❌ 1 vocab ≠ 1 word
✅ 1 vocab = 1 token unit

2️⃣ What Is a Token?

A token is the smallest unit the model processes.

Formally:

$$ \text{text} \rightarrow \text{tokens} = {t_1, t_2, \dots, t_n} $$

Example sentence:

“Lionel Messi is the best footballer”

Might tokenize as:


["Lionel", " Messi", " is", " the", " best", " football", "er"]

The model never sees letters or words —
it sees token IDs.

Example:


"Lionel" → 18372
" Messi" → 91827

3️⃣ Why Tokens Instead of Words?

Because:

languages differ
words are ambiguous
new words appear

Tokens allow:

multilingual support
efficient compression
shared subwords across languages

Example:

English: football
Spanish: fútbol
Name: Messi

The tokenizer learns reusable pieces.

4️⃣ From Token to Vector

Once we have token IDs, we convert them to vectors.

This is done using an embedding matrix.

Embedding Lookup

Each token ID maps to a vector:

$$ \text{Embedding}: \text{token_id} \rightarrow \mathbf{v} \in \mathbb{R}^d $$

Example (simplified):


"Lionel" → [0.12, -0.88, 0.34, ..., 0.05]
"Messi"  → [0.91,  0.11, 0.77, ..., -0.42]

Typical dimensions:

768
1024
4096+
even 8192+

5️⃣ What Is an Embedding?

An embedding is a vector that represents meaning.

It encodes:

semantics
relationships
context potential

Important:

Embeddings do NOT store definitions
They store positions in meaning space

Lionel Messi in Embedding Space

Conceptually:

“Messi” is close to:
- “football”
- “Argentina”
- “Barcelona”
- “GOAT”
Far from:
- “quantum mechanics”
- “cooking recipe”
- “neural network”

Distance matters.

6️⃣ Vector, Embedding, Latent Space (Same Family)

Vector: a list of numbers
Embedding: a vector with learned meaning
Latent space: the space where embeddings live

Formally:

$$ \mathbf{v}_{\text{Messi}} \in \mathbb{R}^d $$

Where closeness implies semantic similarity.

7️⃣ What Does an Embedding Look Like?

It looks like nothing human-readable.

Example (tiny fake embedding):


Messi →
[ 0.83, -1.12, 0.44, 0.09, -0.31 ]

8️⃣ Key Takeaways (Do Not Skip)

Vocabulary ≠ words
Tokens are subword units
Tokens become vectors
Vectors live in embedding space
Meaning = geometry, not text

🧠 Final Intuition

Humans read words.
Models navigate vector spaces.

🔍 Attention: How Vectors Talk to Each Other

After tokenization and embeddings,
we have vectors — but isolated vectors mean nothing.

Attention is the mechanism that lets vectors interact.

This chapter explains attention numerically and intuitively.

1️⃣ The Core Problem

We start with embeddings:

$$ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n $$

Each vector represents a token.

Problem:

each vector knows only itself
language meaning depends on relationships

Example:

“Lionel Messi scored a goal”

The meaning of “scored” depends on “Messi”.

2️⃣ Attention Is Weighted Interaction

Attention answers one question:

Which other tokens matter for this token?

Mathematically:

$$ \text{Attention}(\mathbf{q}, \mathbf{k}, \mathbf{v}) $$

Where:

Query (Q) = what I am looking for
Key (K) = what I offer
Value (V) = what I contribute

3️⃣ Creating Q, K, V

Each embedding is linearly projected:

$$ \mathbf{q}_i = W_Q \mathbf{x}_i $$

$$ \mathbf{k}_i = W_K \mathbf{x}_i $$

$$ \mathbf{v}_i = W_V \mathbf{x}_i $$

Same vector → three different roles.

4️⃣ Attention Scores (Who Listens to Whom)

For token ( i ), compute similarity to token ( j ):

$$ \alpha_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} $$

This measures relevance.

Example:

“scored” attends strongly to “Messi”
weakly to “a”

5️⃣ Softmax: Turning Scores into Weights

Normalize scores:

$$ w_{ij} = \text{softmax}(\alpha_{ij}) $$

Now:

weights sum to 1
attention becomes probabilistic focus

6️⃣ Mixing Values (Talking Happens Here)

Interpretation:

“Token i becomes a mixture of other tokens.”

This is contextualization.

7️⃣ Lionel Messi Example

Sentence:

“Lionel Messi won the Ballon d’Or”

Token “won”:

attends strongly to “Messi”
moderately to “Ballon”
weakly to “the”

Result:

“won” now encodes who won what

8️⃣ Self-Attention vs Cross-Attention

Self-attention: tokens attend to each other
Cross-attention: tokens attend to external memory (e.g. RAG)

Same math. Different source.

9️⃣ Multi-Head Attention

One attention head is limited.

So we use many:

$$ \text{MultiHead} = \text{Concat}(\text{head}_1, \dots, \text{head}_h) $$

Each head learns:

syntax
coreference
semantics
long-range relations

🧠 Key Insight

Attention does not store meaning.
It routes meaning.

🧠 From Embeddings to Thought

Embeddings are static.

Thought is dynamic.

This chapter explains how repeated attention + transformation turns vectors into reasoning.

1️⃣ One Layer Is Not Thought

After one attention layer:

$$ \mathbf{H}^{(1)} = \text{Attention}(\mathbf{X}) $$

We get:

contextualized vectors
shallow understanding

But no reasoning yet.

Transformers stack layers:

$$ \mathbf{H}^{(l+1)} = \text{FFN}(\text{Attention}(\mathbf{H}^{(l)})) $$

Each layer:

refines representations
integrates broader context
abstracts meaning

3️⃣ Feedforward Networks (Nonlinearity)

FFN:

$$ \text{FFN}(\mathbf{x}) = W_2 \sigma(W_1 \mathbf{x}) $$

Purpose:

mix features
create nonlinear concepts
enable abstraction

4️⃣ Thought as Trajectory in Latent Space

A “thought” is not a symbol.

It is a path:

$$ \mathbf{x}^{(0)} \rightarrow \mathbf{x}^{(1)} \rightarrow \dots \rightarrow \mathbf{x}^{(L)} $$

Each layer moves the vector through latent space.

5️⃣ Example: Simple Reasoning

Prompt:

“Messi is older than Neymar. Who is younger?”

Early layers:

identify entities

Middle layers:

encode comparison

Later layers:

resolve answer

No rule engine.
Just geometry evolving.

6️⃣ Why This Feels Like Thinking

Because:

representations become more abstract
irrelevant details are suppressed
relationships dominate

This mirrors human cognition functionally, not biologically.

7️⃣ Why It Sometimes Fails

Because:

reasoning is approximate
errors compound
no explicit truth-checking

Thought is simulated, not guaranteed.

8️⃣ Chain-of-Thought Is Externalized Latent Process

CoT exposes:

$$ \text{hidden transformations} \rightarrow \text{text} $$

Open models allow this because:

transparency
debuggability
research value

🧠 Final Insight

LLMs do not “think” symbolically.
They evolve representations.

🤖 Foundations of Agentic AI and Large Language Models

Artificial Intelligence today is no longer about single predictions.

Modern AI systems:

reason
plan
act
observe
adapt

At the center of this shift is the combination of:

Large Language Models (LLMs)
Agentic workflows
System-level design

This chapter is a true basic foundation, written for nerds who want to understand, not just use.

1️⃣ What Is an AI Agent?

An AI agent is not a model.

An agent is a system that uses a model to interact with an environment over time.

Formally, an agent:

receives observations
maintains internal state
selects actions
receives feedback
updates itself

A minimal agent loop:


Observation → Reasoning → Action → Environment → Observation

Agent vs Model

Model	Agent
Stateless	Stateful
Single output	Continuous loop
No tools	Tool-using
No memory	Memory-aware

An LLM becomes agentic only when embedded in this loop.

2️⃣ What Does an Agent Actually Do?

An agent typically performs tasks like:

searching information
writing code
running programs
analyzing data
making decisions
coordinating tools

Example:


User: "Analyze my dataset and plot trends"

Agent:

1. Understand task
2. Plan steps
3. Load data
4. Run code
5. Inspect output
6. Adjust
7. Respond

This is goal-directed behavior.

3️⃣ What Is a Token?

LLMs do not see words.

They see tokens.

A token is a discrete unit produced by a tokenizer.

Example:


"Artificial intelligence is powerful"
→ ["Artificial", " intelligence", " is", " powerful"]

Sometimes:

one word = one token
one word = many tokens
symbols, code, spaces are tokens

Are More Tokens Better?

No.

Token count affects:

cost
latency
memory usage

What matters is information density, not raw token count.

4️⃣ Token → Vector → Embedding

Each token is mapped to a vector.

This mapping is called an embedding.

Formally:

$$ \text{Embedding}: \mathcal{V} \rightarrow \mathbb{R}^d $$

Where:

V = vocabulary
d = embedding dimension

What Is a Vector?

A vector is just a list of numbers:

$$ \mathbf{v} = [v_1, v_2, \dots, v_d] $$

Each dimension encodes latent features:

syntax
semantics
style
function

5️⃣ What Is Latent Space?

Latent space is the geometry of meaning.

In latent space:

similar concepts are close
different concepts are far apart

Distance is often measured by cosine similarity:

$$ \text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} $$

What Is “z”?

In ML papers, ( z ) usually denotes a latent variable.

In LLMs:

embeddings
hidden states
attention outputs

are all forms of latent representations.

6️⃣ Probability in LLMs

LLMs model probability distributions over tokens.

At each step:

$$ P(x_t \mid x_1, x_2, \dots, x_{t-1}) $$

The model predicts:

“Given everything so far, what token is most likely next?”

Training minimizes cross-entropy loss:

$$ \mathcal{L} = -\sum_t \log P(x_t) $$

7️⃣ Why LLMs Feel “Smart”

Because reasoning emerges from:

scale
representation
optimization

Not because the model “understands” like a human.

8️⃣ Architecture: Mixture-of-Experts (MoE)

MoE replaces one big network with many specialists.

Instead of using all parameters every time, the model selects a few experts per token.

Key Parameters Explained

Architecture: Mixture-of-Experts (MoE)

A sparse architecture with expert routing.

Total Parameters: 1T

$$ 1T = 1{,}000{,}000{,}000{,}000 $$

These include all experts combined.

Activated Parameters: 32B

Only a subset is used per token:

$$ \text{Compute} \propto 32B \ll 1T $$

This makes MoE scalable.

Number of Layers: 61

Total transformer layers.

Number of Dense Layers: 1

One shared layer used by all tokens.

Attention Hidden Dimension: 7168

Size of token representation inside attention.

MoE Hidden Dimension (per Expert): 2048

Each expert is smaller and specialized.

Number of Experts: 384

Total pool of experts.

Selected Experts per Token: 8

Router chooses 8 experts per token.

Number of Shared Experts: 1

Always-active expert for stability.

Vocabulary Size: 160K

Number of tokens the model understands.

Context Length: 256K

Maximum tokens in one forward pass.

Attention Mechanism: MLA

Multi-head attention optimized for long context.

Activation Function: SwiGLU

$$ \text{SwiGLU}(x) = (xW_1 \odot \sigma(xW_2))W_3 $$

Smooth, stable, expressive.

Vision Encoder: MoonViT

Visual encoder for multimodal inputs.

Parameters of Vision Encoder: 400M

Separate vision model feeding into LLM.

9️⃣ What Tasks Are LLMs Trained For?

Primary objective:

Next-token prediction

Emergent abilities:

reasoning
coding
translation
summarization
planning
tool use

These arise from generalization, not explicit programming.

🔟 How Are LLMs Evaluated?

Evaluation uses benchmarks:

MMLU
GSM8K (math)
HumanEval (code)
BIG-bench
Agent task suites

Metrics include:

accuracy
pass@k
reasoning depth
tool success rate

1️⃣1️⃣ Hardware Required to Train Models

Training requires massive compute.

Example for 100B+ models:

NVIDIA H100 GPUs
Thousands of GPUs
Millions of GPU-hours

Approximate relation:

$$ \text{Training Cost} \propto \text{Parameters} \times \text{Tokens} $$

1️⃣2️⃣ Why You Can Ask Anything and Get Answers

LLMs work because they learn:

patterns
abstractions
relationships

They do not retrieve answers like a database.

They generate answers probabilistically.

1️⃣3️⃣ From LLMs to Agentic AI

An agent wraps the LLM with:

memory
tools
control logic
safety constraints

This transforms:

language modeling
into
decision-making

🧠 Final Mental Model

Tokens are symbols
Embeddings are meaning
Latent space is geometry
Probability drives generation
MoE enables scale
Agents enable action

🧭 Closing Thought

LLMs are not magic.

They are:

mathematics
optimization
systems engineering

But when combined correctly, they form agentic AI systems.

And that is where modern AI truly begins.

🔤 Tokens, Embeddings, and How Language Becomes Numbers

Large Language Models do not understand words.

They do not understand sentences.

They do not understand languages.

They understand numbers.

This chapter explains—step by step—how:

text becomes tokens
tokens become vectors
vectors become embeddings
embeddings are trained
multiple languages coexist in one model

This is the core mechanical foundation of LLMs.

1️⃣ What Is a Token?

A token is the smallest unit of text that a language model processes.

A token is not:

necessarily a word
necessarily a character
necessarily a syllable

It is a unit defined by a tokenizer.

2️⃣ Examples: Words, Subwords, and Symbols

Consider the sentence:


"Language models are powerful."

A tokenizer might produce:


["Language", " models", " are", " powerful", "."]

Each item above is one token.

Subword Tokenization Example

Now consider:


"tokenization"

This may become:


["token", "ization"]

Or even:


["tok", "en", "ization"]

Why?

Because tokenizers optimize for frequency and efficiency, not linguistics.

3️⃣ Why Not Just Use Words?

Using full words causes problems:

Vocabulary explodes
Rare words are unseen
New words cannot be handled

Instead, modern LLMs use subword tokenization.

4️⃣ Byte Pair Encoding (BPE)

Most LLMs use BPE or variants.

The idea is simple:

Start with characters
Merge frequent pairs
Repeat until vocabulary size is reached

Formally, BPE minimizes:

$$ \text{Total Tokens} = \sum_{i} \text{Length}(x_i) $$

subject to a fixed vocabulary size.

5️⃣ Tokens Across Languages

LLMs do not have separate vocabularies per language.

They use one shared vocabulary.

Example:

Language	Tokenization
English	Subwords
Thai	Character-like chunks
Chinese	Characters
Japanese	Mixed Kanji + Kana
Code	Keywords + symbols

Example: Chinese


"人工智能"
→ ["人", "工", "智", "能"]

Each character is already meaningful, so tokenization is straightforward.

Example: Thai

Thai has no spaces:


"ภาษาไทยยากไหม"

Tokenizer output may look like:


["ภาษา", "ไทย", "ยาก", "ไหม"]

Learned statistically, not linguistically.

6️⃣ Vocabulary Size

Vocabulary size determines how many unique tokens exist.

Typical values:

32K
50K
100K
160K
200K+

A larger vocabulary means:

fewer tokens per sentence
larger embedding tables
higher memory cost

7️⃣ From Token to ID

Each token is mapped to an integer ID.

Example:


"language" → 48321

This is just a lookup.

8️⃣ Token → Vector (Embedding)

Token IDs are mapped to vectors via an embedding matrix.

Formally:

$$ E \in \mathbb{R}^{|\mathcal{V}| \times d} $$

Where:

V = vocabulary size
d = embedding dimension

Embedding Lookup

Given token ID ( i ):

$$ \mathbf{e}_i = E[i] $$

This vector represents the token in continuous space.

9️⃣ What Is an Embedding?

An embedding is:

a dense vector
learned during training
representing semantic and syntactic properties

🔟 What Is a Vector?

A vector is a list of real numbers:

$$ \mathbf{v} = [v_1, v_2, \dots, v_d] $$

Each dimension has no human-interpretable meaning alone.

Meaning emerges from relative geometry.

1️⃣1️⃣ What Is Latent Space?

Latent space is the space formed by embeddings.

In latent space:

distance encodes similarity
directions encode relationships

Distance is often measured by cosine similarity:

$$ \text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} $$

1️⃣2️⃣ Tokens in Context: Position Matters

Embeddings alone ignore order.

Transformers add positional information:

$$ \mathbf{h}_t = \mathbf{e}_t + \mathbf{p}_t $$

Where:

pt = positional encoding

This allows the model to distinguish:


"dog bites man"
vs
"man bites dog"

1️⃣3️⃣ How Tokens Are Used During Training

Training objective:

$$ P(x_t \mid x_1, \dots, x_{t-1}) $$

For a sequence:


"The cat sat"

Training pairs are:


"The" → "cat"
"The cat" → "sat"

Loss Function

Cross-entropy loss:

$$ \mathcal{L} = -\log P(x_t) $$

The model is penalized if the correct next token has low probability.

1️⃣4️⃣ How Embeddings Are Learned

Embeddings are not pretrained separately.

They are learned end-to-end.

During backpropagation:

gradients flow into embedding vectors
frequent tokens update often
rare tokens update less

This is why:

common words are well-shaped
rare words are noisier

1️⃣5️⃣ Multilingual Training

Training data mixes languages.

The model learns:

shared structure (logic, syntax)
language-specific patterns

This creates cross-lingual embeddings.

1️⃣6️⃣ Are More Tokens Better?

No.

More tokens means:

more compute
more memory
slower inference

Better tokenization means:

fewer tokens
richer embeddings

Quality beats quantity.

1️⃣7️⃣ Summary Mental Model

Text → tokens
Tokens → IDs
IDs → vectors
Vectors → latent space
Latent space → probability
Probability → language generation

🧠 Final Intuition

Language models do not store sentences.

They store geometric relationships between tokens.

Meaning is not memorized.

It is emergent.

🧠 Modern Large Language Models (LLMs)

Large Language Models are no longer just “big neural networks that predict the next word”.

They are:

Reasoning engines
Tool-using agents
Modular systems
Open-weight infrastructures

This article is a deep but foundational recap of modern LLMs—written for people who already speak ML, but feel the ecosystem is moving too fast to track.

If you’ve ever asked:

“Wait… what exactly is MoE, LoRA, RAG, quantization, or agentic LLMs?”

This is for you.

1️⃣ What Does “Model Tree” Mean?

When browsing open models (e.g. on HuggingFace), you often see a structure like:


openai/gpt-oss-120b
├── Adapters
├── Finetunes
├── Merges
├── Quantizations

This is not chaos.
It is evolution.

Think of a model tree as a genetic family:

Base model → the pretrained brain
Adapters → plug-in skills
Finetunes → retrained personalities
Merges → hybrid offspring
Quantizations → compressed forms

2️⃣ Base Model: gpt-oss-120b


Model size: 120B parameters
Tensor type: BF16 / U8
License: Apache 2.0

What is “120B parameters”?

A parameter is a learned scalar value inside the neural network.


120B = 120,000,000,000 parameters

Memory footprint (roughly):

$$ \text{Memory} \approx \text{Parameters} \times \text{Bytes per parameter} $$

BF16 → ~2 bytes → ~240 GB
FP32 → ~4 bytes → ~480 GB

This is why compression, sharding, and MoE exist.

3️⃣ Tensor Types: BF16 vs U8

Type	Meaning	Usage
BF16	Brain Floating Point 16	Training / high-quality inference
FP16	IEEE Half Precision	Legacy GPU inference
U8 / INT8	8-bit Integer	Fast & cheap inference
INT4	4-bit Integer	Extreme compression

Quantization trades precision for efficiency.

$$ \text{Smaller precision} \Rightarrow \text{Less memory} \Rightarrow \text{Faster inference} $$

4️⃣ Adapters and LoRA (Low-Rank Adaptation)

Adapters are modular fine-tuning layers.

LoRA works by injecting a low-rank update:

$$ W’ = W + \Delta W $$

Where:

$$ \Delta W = A B \quad \text{with} \quad \text{rank}(A,B) \ll \text{rank}(W) $$

Why LoRA matters

Base model is frozen
Only a few million parameters trained
Easy to distribute
Easy to swap

This is why open-source LLMs scale socially.

5️⃣ Finetuning vs Pretraining

Pretraining

Pretraining teaches the model language itself:

$$ \mathcal{L} = -\sum_{t} \log P(x_t \mid x_{<t}) $$

Trillions of tokens
Next-token prediction
Costs millions of GPU-hours
One-time process

Finetuning

Finetuning teaches the model how to behave.

Instruction following
Reasoning style
Domain specialization
Safety alignment

This is where “chat”, “assistant”, and “expert” personalities come from.

6️⃣ Quantized Models

Quantized models are the same brain, cheaper hardware.

Examples:

120B → INT8 → fits on multi-GPU servers
7B → INT4 → fits on a laptop GPU

Trade-off:

$$ \text{Compression} \uparrow \Rightarrow \text{Accuracy} \downarrow \ (\text{slightly}) $$

7️⃣ What is gpt-oss-safeguard-120b?

This is not the main language model.

It is a safety enforcement model.

Responsibilities:

Input filtering
Output moderation
Policy enforcement
Risk classification

In proprietary APIs, this layer is invisible.
In open-weight systems, you must build it yourself.

8️⃣ Mixture-of-Experts (MoE) Architecture

MoE is the defining architecture of modern large-scale LLMs.

Dense model:

Every parameter is used every time.

MoE model:

Only some experts activate per token.

MoE at scale

Property	Value
Total Parameters	1T
Activated Parameters	32B
Number of Experts	384
Experts per Token	8

This means:

$$ \text{Compute cost} \propto 32B \quad \text{not} \quad 1T $$

Intuition

Instead of:

“Use the whole brain for everything”

We get:

“Call 8 specialists per word”

9️⃣ Architecture Details Explained

Attention Heads


64 heads

Each head attends to different relationships:

Syntax
Semantics
Long-range dependencies

SwiGLU Activation

$$ \text{SwiGLU}(x) = (xW_1 \odot \sigma(xW_2))W_3 $$

Why it’s used:

Smooth gradients
Better expressivity
Stable training

Context Length: 256K

This enables:

Whole-codebase reasoning
Long legal / medical documents
Multi-step agent planning

🔟 Tokenization (o200k_harmony)

Vocabulary size:

$$ |\mathcal{V}| \approx 200{,}000 $$

This includes:

Natural language
Code tokens
Math symbols
Tool-call syntax
Agent control tokens

Tokenizer design is not trivial—it directly affects reasoning quality.

1️⃣1️⃣ RAG (Retrieval-Augmented Generation)

RAG adds external memory:


User → Retriever → Documents → LLM → Answer

Strengths:

Fresh knowledge
Enterprise data
Auditable sources

Limitations:

Weak for deep reasoning
Brittle retrieval
Latency overhead

RAG is evolving—not obsolete.

1️⃣2️⃣ Agentic LLMs

Modern LLMs are not single-shot generators.

They operate in loops:


Plan → Act → Observe → Reflect → Repeat

Tools include:

Web search
Python execution
Databases
APIs

This is where LLMs become systems, not models.

1️⃣3️⃣ Putting It All Together

A modern LLM stack looks like:


Pretrained MoE LLM
↓
Finetune / LoRA
↓
Quantization
↓
RAG + Tools
↓
Agent Loop
↓
Safety Guards

🧠 Final Thoughts

If LLMs feel overwhelming, that’s because:

They are no longer just models.

They are:

Architectures
Systems
Ecosystems
Infrastructures

Understanding them today means thinking like:

A machine learning researcher
A systems engineer
A product architect

And yes — the pace is brutal.

But now, you’re back in control.

🤖 Agentic Large Language Models

Modern LLMs are no longer passive text generators.

They are agents.

They:

Plan actions
Call tools
Observe results
Reflect and revise
Loop until completion

Understanding agentic workflows is now a core literacy for anyone working with LLMs.

1️⃣ What Does “Agentic” Actually Mean?

An agentic LLM is a model embedded inside a control loop.

At minimum, the loop looks like:


Thought → Action → Observation → Thought → ...

This is not metaphorical.
It is a programmatic execution cycle.

The canonical agent loop


1. Parse user intent
2. Plan intermediate steps
3. Decide which tool to call
4. Execute tool
5. Observe results
6. Update internal state
7. Continue or stop

This loop transforms LLMs from:

“Answer machines”

into:

“Task-completing systems”

2️⃣ ReAct, Plan-Act-Reflect, and Toolformers

Most modern agents descend from three ideas:

🔹 ReAct (Reason + Act)

The model alternates between reasoning and acting.


Thought: I need recent data.
Action: WebSearch(query="...")
Observation: ...

Reasoning grounds tool usage.

🔹 Plan–Act–Reflect

More structured agent loop:


Plan → Act → Observe → Reflect → Re-plan

Reflection is critical:

Error correction
Long-horizon reasoning
Self-debugging

🔹 Toolformers

LLMs trained to decide when tools are useful, not just how to use them.

This is why modern models expose:

Web search
Python
File systems
APIs

3️⃣ Why Chain-of-Thought (CoT) Matters

Chain-of-Thought is not verbosity.

It is externalized intermediate computation.

Formally, CoT approximates:

$$ P(y \mid x) = \sum_{z} P(y \mid z, x) P(z \mid x) $$

Where:

( x ) = input
( z ) = latent reasoning steps
( y ) = output

4️⃣ Why Open Models Expose CoT

This is a philosophical and architectural difference.

Proprietary models

CoT is hidden or summarized
Exposed reasoning is filtered
System-level alignment is enforced upstream

Open-weight models

CoT is part of the artifact
Debuggable
Inspectable
Modifiable

This is intentional.

Why the open community wants CoT

Debugging
- Inspect failure modes
Research
- Analyze reasoning depth
Alignment
- Study safety trade-offs
Education
- Teach reasoning, not just answers

Open models treat CoT as:

a feature, not a liability

5️⃣ Adjustable Reasoning Effort

Modern reasoning models expose a control variable:

Think fast (cheap)
Think slow (deep)

Conceptually:

$$ \text{Compute} \propto \text{Reasoning Depth} $$

This enables:

Cost-aware deployment
Adaptive intelligence
Agent-level optimization

6️⃣ Why CoT Is Risky — and Still Open

CoT can leak:

Sensitive heuristics
Attack strategies
Unsafe reasoning paths

Open models accept this risk because:

They prioritize transparency
Safety is enforced at the system level, not hidden logic

This shifts responsibility to the system designer.

7️⃣ Mapping the Modern LLM Ecosystem

Let’s zoom out.

🔵 OpenAI

Philosophy: System-level safety, agent-first APIs

Strong reasoning models
Deep tool integration
Hidden CoT by default
Heavy alignment layers

Strengths:

Production-grade agents
Robust safety
Best-in-class reasoning

Trade-off:

Limited transparency
No weight access

🟣 Meta (LLaMA family)

Philosophy: Open weights, scalable infrastructure

Dense + MoE research
Strong multilingual support
Community-driven fine-tuning

Strengths:

Foundation for OSS ecosystem
Research-friendly
Broad adoption

Trade-off:

Safety is DIY
Tooling varies by implementation

🟢 Mistral

Philosophy: Efficiency + elegance

MoE-first designs
Strong small/medium models
European regulatory awareness

Strengths:

High performance per parameter
Clean architecture
Excellent for on-prem

Trade-off:

Smaller ecosystem (for now)

⚫ Open-Source Community (OSS)

This is not one actor — it is an ecosystem.

Includes:

Weight merges
Custom LoRA adapters
Experimental architectures
Specialized agents

OSS prioritizes:

Transparency
Modularity
Hackability

Risk:

Inconsistent safety
Fragmentation

8️⃣ Comparative Mental Model

Axis	OpenAI	Meta	Mistral	OSS
Weights	Closed	Open	Open	Open
CoT	Hidden	Exposed	Exposed	Exposed
Safety	Centralized	Optional	Optional	DIY
Agents	Native	External	External	Experimental
Research	Controlled	Open	Focused	Chaotic

9️⃣ Why This Ecosystem Exists

No single model can optimize for:

Safety
Transparency
Performance
Cost
Control

Different actors choose different trade-offs.

This diversity is healthy.

🔟 The New Role of the Practitioner

Working with LLMs today means you are no longer just a user.

You are:

A system designer
A safety engineer
A reasoning architect

Understanding agentic workflows and CoT exposure is mandatory.

🧠 Final Synthesis

LLMs are no longer:

“Models you query”

They are:

“Systems you design”

Agentic workflows provide agency.
Chain-of-Thought provides cognition.
Open ecosystems provide freedom.

And freedom always comes with responsibility.

🧠 Attention, Scaling Laws, and the Emergence of Reasoning

If embeddings explain what language means,
attention explains how meaning is composed.

This chapter answers five fundamental questions:

What is attention—numerically?
Why self-attention works so well
How training dynamics and scaling laws shape intelligence
When tokenization breaks models
How embeddings turn into reasoning

This is where LLMs stop being “vector machines”
and start behaving like reasoning systems.

1️⃣ Attention Explained with Numbers

Attention is a weighted averaging mechanism.

Each token decides:

“Which other tokens matter to me right now?”

Step 1: From Embeddings to Q, K, V

For each token embedding:

$$ \mathbf{q} = \mathbf{x}W_Q,\quad \mathbf{k} = \mathbf{x}W_K,\quad \mathbf{v} = \mathbf{x}W_V $$

Where:

q = query
k = key
v = value

All are vectors.

Step 2: Similarity Scores

For token ( i ) attending to token ( j ):

$$ s_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}} $$

This measures relevance.

Step 3: Softmax = Probability Distribution

$$ \alpha_{ij} = \frac{\exp(s_{ij})}{\sum_j \exp(s_{ij})} $$

Attention is probabilistic focus.

Step 4: Weighted Sum

The output token representation is a mixture of other tokens.

2️⃣ Why Self-Attention Works

Self-attention allows every token to:

access global context
dynamically reweight importance
adapt per task and per position

This solves three core problems at once:

long-range dependency
variable structure
parallel computation

Key Insight

Self-attention is content-addressable memory.

Instead of indexing by position, tokens index by meaning.

3️⃣ Multi-Head Attention: Many Views of Meaning

In practice, attention is multi-headed.

$$ \text{Attention} = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W_O $$

Each head learns different relationships:

syntax
semantics
coreference
arithmetic
code structure

This is distributed reasoning.

4️⃣ Why Attention Enables Reasoning

Reasoning requires:

variable binding
relational comparison
composition

Attention enables:

$$ \text{Reasoning} \approx \text{Iterative Context Mixing} $$

Each layer refines representations by recontextualizing tokens.

5️⃣ Training Dynamics: How Models Actually Learn

LLMs are trained with gradient descent.

Each update minimizes:

$$ \mathcal{L} = -\sum_t \log P(x_t \mid x_{<t}) $$

Learning emerges from:

many small updates
massive data
overparameterization

Optimization Intuition

Early training:

learns token statistics

Mid training:

learns syntax and patterns

Late training:

learns abstractions and reasoning heuristics

6️⃣ Scaling Laws

Empirically, performance follows power laws.

$$ \mathcal{L}(N, D, C) \propto N^{-\alpha} D^{-\beta} C^{-\gamma} $$

Where:

( N ) = parameters
( D ) = data
( C ) = compute

Consequence

Bigger models → better reasoning
More data → better generalization
More compute → smoother optimization

There is no sharp intelligence threshold—only scale.

7️⃣ Why Bigger Models Reason Better

Large models:

store more abstractions
represent deeper hierarchies
maintain longer dependencies

Reasoning is not programmed. It emerges when capacity is sufficient.

8️⃣ When Tokenization Goes Wrong

Tokenization is a silent failure mode.

Bad tokenization causes:

excessive token counts
broken morphemes
semantic fragmentation

Example: Over-Fragmentation


"electromagnetism"
→ ["elec", "tro", "mag", "net", "ism"]

Meaning is diluted across tokens.

Multilingual Failure

Low-resource languages may:

use many tokens per word
receive fewer gradient updates
have poorer embeddings

This directly harms performance.

9️⃣ Tokenization and Reasoning Errors

Reasoning depends on stable symbols.

If numbers, variables, or operators are split poorly:

math fails
code fails
logic fails

This is why modern tokenizers:

include digits
include operators
include code tokens

🔟 From Embeddings to Reasoning

Embeddings alone do not reason.

Reasoning emerges from:

attention
depth
recurrence across layers

Each layer computes:

$$ \mathbf{H}^{(l+1)} = \text{TransformerBlock}(\mathbf{H}^{(l)}) $$

This is iterative refinement.

Reasoning as Trajectory in Latent Space

A reasoning chain is a path:

$$ \mathbf{z}_0 \rightarrow \mathbf{z}_1 \rightarrow \dots \rightarrow \mathbf{z}_T $$

Each step refines belief.

1️⃣1️⃣ Why Chain-of-Thought Helps

Explicit reasoning externalizes latent steps.

It:

stabilizes trajectories
reduces entropy
improves correctness

But the real reasoning happens inside the vectors.

1️⃣2️⃣ Summary Mental Model

Tokens are symbols
Embeddings are points
Attention is interaction
Layers are refinement
Scale is capacity
Reasoning is emergence

🧠 Final Intuition

LLMs do not “think” like humans.

They:

transform vectors
mix context
optimize probabilities

Yet from this process, reasoning emerges.

That is the core miracle of modern AI.

🔄 Why Transformers Replace RNNs Forever

Recurrent Neural Networks (RNNs) were once the backbone of sequence modeling.

Transformers ended that era.

This chapter explains why this replacement is permanent, not a trend.

1️⃣ What RNNs Were Trying to Solve

Language is sequential.

RNNs model sequences by recurrence.

This looks elegant:

memory through hidden state
time-aware processing

But elegance does not scale.

2️⃣ The Fundamental Limits of RNNs

2.1 Vanishing and Exploding Gradients

Backpropagation through time multiplies Jacobians.

This product either:

shrinks to zero
explodes to infinity

No architecture tweak fully fixes this.

2.2 Sequential Bottleneck

RNNs must compute:

$$ \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \dots \rightarrow \mathbf{h}_T $$

This is inherently serial.

GPUs hate serial computation.

2.3 Memory is Compressed Too Early

RNNs force all past context into a fixed-size vector.

This causes:

information loss
interference
forgetting long-range dependencies

3️⃣ Why Attention Is a Structural Upgrade

Transformers remove recurrence entirely.

$$ \mathbf{H}^{(l+1)} = \text{Attention}(\mathbf{H}^{(l)}) $$

Key properties:

full context access
parallel computation
content-based memory

4️⃣ Attention vs Recurrence: A Direct Comparison

Property	RNN	Transformer
Memory access	Compressed	Explicit
Parallelism	❌	✅
Long-range dependency	Weak	Strong
Training stability	Fragile	Stable
Scaling behavior	Poor	Excellent

5️⃣ Why Transformers Scale and RNNs Do Not

Scaling requires:

predictable gradients
efficient hardware use
stable optimization

Transformers satisfy all three.

RNNs satisfy none.

6️⃣ The Death of Inductive Bias

RNNs hard-code temporal order.

Transformers learn structure from data.

This flexibility allows:

language
code
math
vision
multimodal reasoning

One architecture. Many domains.

7️⃣ Final Verdict

Transformers did not replace RNNs because they are newer.

They replaced RNNs because they are:

structurally superior
computationally aligned with modern hardware
compatible with scale

This replacement is irreversible.

Transformers are not better RNNs.
They are a different species entirely.

⚠️ Failure Modes of Reasoning Models

LLMs can reason.

But they can also fail—quietly, confidently, and convincingly.

This chapter dissects why reasoning models fail, even at large scale.

1️⃣ Reasoning Is Approximate Inference

LLMs estimate:

$$ P(x_t \mid x_{<t}) $$

They do not verify truth. They maximize likelihood.

This creates systematic failure modes.

2️⃣ Hallucination as Probability Maximization

Hallucination occurs when:

$$ \arg\max_x P(x \mid \text{context}) \neq \text{truth} $$

If the model has seen similar patterns, it may confidently invent details.

3️⃣ Shortcut Reasoning

Models often learn:

surface heuristics
dataset biases
shallow correlations

Instead of reasoning:

“This looks like problem type X, answer is usually Y.”

This works—until it doesn’t.

4️⃣ Chain-of-Thought Collapse

Long reasoning chains can drift.

Each step compounds error:

$$ \epsilon_{total} \approx \sum_t \epsilon_t $$

This leads to:

incorrect conclusions
internally consistent nonsense

5️⃣ Symbolic Fragility

LLMs struggle with:

exact arithmetic
variable binding
stateful reasoning

Why?

Because symbols are distributed, not discrete.

6️⃣ Out-of-Distribution Reasoning

Reasoning degrades sharply when:

assumptions shift
constraints change
rules are inverted

LLMs interpolate well. They extrapolate poorly.

7️⃣ Alignment vs Reasoning Tension

Safety training can:

suppress exploration
bias outputs
reduce uncertainty expression

This can mask reasoning errors instead of fixing them.

8️⃣ Summary of Failure Modes

Failure	Root Cause
Hallucination	Likelihood ≠ Truth
Logical error	Approximate inference
Overconfidence	Entropy minimization
Math failure	Symbolic mismatch
OOD collapse	Lack of world grounding

🧠 Key Insight

LLMs reason statistically, not causally.

Understanding failure modes is not a weakness— it is a prerequisite for building better systems.

Reasoning models are powerful—but not infallible.

Last updated on 2026

DK-009 — From Pretraining to MoE and Agentic Systems

🧩 MoE, Finetuning, RAG, and Training: How LLMs Are Actually Built

1️⃣ What Is Mixture-of-Experts (MoE)?

1.1 Dense Models (Baseline)

1.2 MoE Core Idea

1.3 Why MoE Exists

1.4 MoE Is Not Free

2️⃣ Pretraining: The Foundation

3️⃣ Finetuning: Changing Behavior, Not Knowledge

3.1 What Finetuning Is Good At

3.2 What Finetuning Is Bad At

3.3 LoRA: Efficient Finetuning

4️⃣ Retrieval-Augmented Generation (RAG)

4.1 RAG Architecture

4.2 Embedding Step

4.3 Why RAG Works

4.4 Why RAG Fails

5️⃣ Training vs Finetuning vs RAG (Truth Table)

6️⃣ Which Method Is “Best”?

6.1 Use Pretraining When

6.2 Use Finetuning When

6.3 Use RAG When

6.4 Use Agents When

7️⃣ The Real LLM Stack (Systems View)

8️⃣ Final Reality Check

🧠 Closing Insight

🧩 Vocabulary, Tokens, and Embeddings (The Real Basics)

1️⃣ Is One Vocabulary Entry One Word?

Example: “Lionel Messi”

2️⃣ What Is a Token?

3️⃣ Why Tokens Instead of Words?

4️⃣ From Token to Vector

Embedding Lookup

5️⃣ What Is an Embedding?

Lionel Messi in Embedding Space

6️⃣ Vector, Embedding, Latent Space (Same Family)

7️⃣ What Does an Embedding Look Like?

8️⃣ Key Takeaways (Do Not Skip)

🧠 Final Intuition

🔍 Attention: How Vectors Talk to Each Other

1️⃣ The Core Problem

2️⃣ Attention Is Weighted Interaction

3️⃣ Creating Q, K, V

4️⃣ Attention Scores (Who Listens to Whom)

5️⃣ Softmax: Turning Scores into Weights

6️⃣ Mixing Values (Talking Happens Here)

7️⃣ Lionel Messi Example

8️⃣ Self-Attention vs Cross-Attention

9️⃣ Multi-Head Attention

🧠 Key Insight

🧠 From Embeddings to Thought

1️⃣ One Layer Is Not Thought

2️⃣ Stacking Layers = Iterative Refinement

3️⃣ Feedforward Networks (Nonlinearity)

4️⃣ Thought as Trajectory in Latent Space

5️⃣ Example: Simple Reasoning

6️⃣ Why This Feels Like Thinking

7️⃣ Why It Sometimes Fails

8️⃣ Chain-of-Thought Is Externalized Latent Process

🧠 Final Insight

🤖 Foundations of Agentic AI and Large Language Models

1️⃣ What Is an AI Agent?

Agent vs Model

2️⃣ What Does an Agent Actually Do?

3️⃣ What Is a Token?

Are More Tokens Better?

4️⃣ Token → Vector → Embedding

What Is a Vector?

5️⃣ What Is Latent Space?

What Is “z”?

6️⃣ Probability in LLMs

7️⃣ Why LLMs Feel “Smart”

8️⃣ Architecture: Mixture-of-Experts (MoE)

Key Parameters Explained

Architecture: Mixture-of-Experts (MoE)

Total Parameters: 1T

Activated Parameters: 32B

Number of Layers: 61

Number of Dense Layers: 1

Attention Hidden Dimension: 7168