Lecture 03 — Training Paradigms: Pretraining, Fine-tuning, and Training from Scratch

~3–4 hours (core learning lecture)

🌍 Why This Lecture Matters

Every modern AI system answers one critical question:

How was this intelligence created?

Understanding training paradigms means understanding:

capability
limitation
cost
risk
ethics

This lecture teaches you how intelligence is shaped, not just deployed.

🧠 The Three Ways Machines Learn

All modern multimodal systems are trained using one (or more) of these paradigms:

🧱 Training from Scratch
🔧 Fine-tuning
🧠 Pretraining (Foundation Models)

🧩 Big Picture Comparison

Paradigm	Data Size	Cost	Flexibility	Risk
Scratch	Massive	💰💰💰💰	High	Very High
Pretraining	Huge	💰💰💰	Medium	Medium
Fine-tuning	Small–Medium	💰	Low–Medium	Low

Most real systems use pretrained + fine-tuned models.

🧱 Paradigm 1 — Training From Scratch

What It Means

Training all weights from random initialization.

No prior knowledge. No shortcuts.

When It Makes Sense (Rare)

New modality (e.g., brain signals)
Fundamental research
Extreme domain shift
National-scale infrastructure

Why It’s Dangerous

Requires massive data
Requires massive compute
High chance of bias
Easy to fail silently

Training from scratch is not bravery — it’s responsibility.

🧪 Knowledge Check — Scratch Training

Q1 (True / False)

Training from scratch is the best choice for most applications.

Answer

False.

Q2 (Objective)

Name one valid reason to train from scratch.

Answer

When no pretrained model exists for the modality or domain.

🧠 Paradigm 2 — Pretraining (Foundation Models)

What Is Pretraining?

Learning general-purpose representations from massive unlabeled or weakly labeled data.

Examples:

GPT (text)
CLIP (image–text)
Whisper (audio)

Why Pretraining Works

Because the world is:

repetitive
structured
statistically learnable

Pretraining captures world regularities.

Multimodal Pretraining

Typical objective:


(Image, Text) → Predict missing modality

This creates cross-modal alignment.

🧪 Knowledge Check — Pretraining

Q3 (MCQ)

What is the main goal of pretraining?

A) Task accuracy
B) Memorization
C) General representation learning
D) Deployment speed

Correct Answer

C) General representation learning

🔧 Paradigm 3 — Fine-tuning

What Is Fine-tuning?

Adapting a pretrained model to:

a specific task
a specific domain
a specific behavior

Types of Fine-tuning

Type	Description
Full fine-tuning	Update all weights
Partial	Update some layers
PEFT	LoRA, adapters
Instruction tuning	Align behavior

Why Fine-tuning Is Powerful

Low data requirement
Low compute
Fast iteration
Safer behavior

Fine-tuning is how most intelligence is specialized.

🧪 Knowledge Check — Fine-tuning

Q4 (True / False)

Fine-tuning always requires large datasets.

Answer

False.

Q5 (Objective)

Why is PEFT (e.g., LoRA) popular?

Answer

It reduces memory and compute while preserving performance.

🧠 Pretraining vs Fine-tuning (Mental Model)

Think of a human:

Pretraining = education
Fine-tuning = job training
Scratch = growing up alone on an island

🧠 Multimodal-Specific Considerations

What Can Be Fine-tuned?

Encoders
Projection layers
LLM
Output heads

Common Strategy (Best Practice)

Component	Strategy
Encoder	Freeze
Projection	Train
LLM	PEFT
Head	Train

Alignment layers are usually the sweet spot.

🧪 Knowledge Check — Multimodal Strategy

Q6 (MCQ)

Which component is most commonly fine-tuned first?

A) Tokenizer
B) Vision encoder
C) Projection layer
D) Dataset

Correct Answer

C) Projection layer

⚠️ Overfitting & Catastrophic Forgetting

Fine-tuning risks:

forgetting general knowledge
over-specialization
bias amplification

Mitigations:

low learning rate
freezing layers
mixed data

🧠 Paradigm Comparison by Task

Task	Best Paradigm
Image captioning	Pretrained + light FT
Medical QA	FT + HITL
New sensor modality	Scratch
Internal company data	RAG or FT

🧪 Knowledge Check — Decision Making

Q7 (Objective)

Why might RAG be preferred over fine-tuning?

Answer

When knowledge changes frequently or data cannot be embedded in weights.

🧠 Training Is Not Just Optimization

Training choices encode:

values
assumptions
power

Who chooses the data chooses the behavior.

🌱 Ethical Considerations

Pretraining data bias
Fine-tuning reinforcement of norms
Scratch training without safeguards

Ethics begins before training starts.

🧪 Knowledge Check — Ethics

Q8 (True / False)

Bias can only be fixed during deployment.

Answer

False — bias enters during data and training.

🧠 Architect’s Decision Tree (Practical)

Ask:

Is there a pretrained model?
Is the domain stable?
Is data private?
Is behavior critical?

Then choose:

RAG
Fine-tuning
Scratch (rare)

✅ Final Takeaways

Training defines intelligence
Pretraining builds foundations
Fine-tuning specializes behavior
Scratch training is exceptional
Ethics is embedded in data

🌍 Final Reflection

If a model learns from biased data, who is accountable?

The humans who selected, curated, and approved the data.

Last updated on 2026