Lecture 03 — Training Paradigms: Pretraining, Fine-tuning, and Training from Scratch
~3–4 hours (core learning lecture)
🌍 Why This Lecture Matters
Every modern AI system answers one critical question:
How was this intelligence created?
Understanding training paradigms means understanding:
- capability
- limitation
- cost
- risk
- ethics
This lecture teaches you how intelligence is shaped, not just deployed.
🧠 The Three Ways Machines Learn
All modern multimodal systems are trained using one (or more) of these paradigms:
- 🧱 Training from Scratch
- 🔧 Fine-tuning
- 🧠 Pretraining (Foundation Models)
🧩 Big Picture Comparison
| Paradigm | Data Size | Cost | Flexibility | Risk |
|---|---|---|---|---|
| Scratch | Massive | 💰💰💰💰 | High | Very High |
| Pretraining | Huge | 💰💰💰 | Medium | Medium |
| Fine-tuning | Small–Medium | 💰 | Low–Medium | Low |
Most real systems use pretrained + fine-tuned models.
🧱 Paradigm 1 — Training From Scratch
What It Means
Training all weights from random initialization.
No prior knowledge. No shortcuts.
When It Makes Sense (Rare)
- New modality (e.g., brain signals)
- Fundamental research
- Extreme domain shift
- National-scale infrastructure
Why It’s Dangerous
- Requires massive data
- Requires massive compute
- High chance of bias
- Easy to fail silently
Training from scratch is not bravery — it’s responsibility.
🧪 Knowledge Check — Scratch Training
Q1 (True / False)
Training from scratch is the best choice for most applications.
Answer
False.
Q2 (Objective)
Name one valid reason to train from scratch.
Answer
When no pretrained model exists for the modality or domain.
🧠 Paradigm 2 — Pretraining (Foundation Models)
What Is Pretraining?
Learning general-purpose representations from massive unlabeled or weakly labeled data.
Examples:
- GPT (text)
- CLIP (image–text)
- Whisper (audio)
Why Pretraining Works
Because the world is:
- repetitive
- structured
- statistically learnable
Pretraining captures world regularities.
Multimodal Pretraining
Typical objective:
(Image, Text) → Predict missing modality
This creates cross-modal alignment.
🧪 Knowledge Check — Pretraining
Q3 (MCQ)
What is the main goal of pretraining?
A) Task accuracy
B) Memorization
C) General representation learning
D) Deployment speed
Correct Answer
C) General representation learning
🔧 Paradigm 3 — Fine-tuning
What Is Fine-tuning?
Adapting a pretrained model to:
- a specific task
- a specific domain
- a specific behavior
Types of Fine-tuning
| Type | Description |
|---|---|
| Full fine-tuning | Update all weights |
| Partial | Update some layers |
| PEFT | LoRA, adapters |
| Instruction tuning | Align behavior |
Why Fine-tuning Is Powerful
- Low data requirement
- Low compute
- Fast iteration
- Safer behavior
Fine-tuning is how most intelligence is specialized.
🧪 Knowledge Check — Fine-tuning
Q4 (True / False)
Fine-tuning always requires large datasets.
Answer
False.
Q5 (Objective)
Why is PEFT (e.g., LoRA) popular?
Answer
It reduces memory and compute while preserving performance.
🧠 Pretraining vs Fine-tuning (Mental Model)
Think of a human:
- Pretraining = education
- Fine-tuning = job training
- Scratch = growing up alone on an island
🧠 Multimodal-Specific Considerations
What Can Be Fine-tuned?
- Encoders
- Projection layers
- LLM
- Output heads
Common Strategy (Best Practice)
| Component | Strategy |
|---|---|
| Encoder | Freeze |
| Projection | Train |
| LLM | PEFT |
| Head | Train |
Alignment layers are usually the sweet spot.
🧪 Knowledge Check — Multimodal Strategy
Q6 (MCQ)
Which component is most commonly fine-tuned first?
A) Tokenizer
B) Vision encoder
C) Projection layer
D) Dataset
Correct Answer
C) Projection layer
⚠️ Overfitting & Catastrophic Forgetting
Fine-tuning risks:
- forgetting general knowledge
- over-specialization
- bias amplification
Mitigations:
- low learning rate
- freezing layers
- mixed data
🧠 Paradigm Comparison by Task
| Task | Best Paradigm |
|---|---|
| Image captioning | Pretrained + light FT |
| Medical QA | FT + HITL |
| New sensor modality | Scratch |
| Internal company data | RAG or FT |
🧪 Knowledge Check — Decision Making
Q7 (Objective)
Why might RAG be preferred over fine-tuning?
Answer
When knowledge changes frequently or data cannot be embedded in weights.
🧠 Training Is Not Just Optimization
Training choices encode:
- values
- assumptions
- power
Who chooses the data chooses the behavior.
🌱 Ethical Considerations
- Pretraining data bias
- Fine-tuning reinforcement of norms
- Scratch training without safeguards
Ethics begins before training starts.
🧪 Knowledge Check — Ethics
Q8 (True / False)
Bias can only be fixed during deployment.
Answer
False — bias enters during data and training.
🧠 Architect’s Decision Tree (Practical)
Ask:
- Is there a pretrained model?
- Is the domain stable?
- Is data private?
- Is behavior critical?
Then choose:
- RAG
- Fine-tuning
- Scratch (rare)
✅ Final Takeaways
- Training defines intelligence
- Pretraining builds foundations
- Fine-tuning specializes behavior
- Scratch training is exceptional
- Ethics is embedded in data
🌍 Final Reflection
If a model learns from biased data, who is accountable?
The humans who selected, curated, and approved the data.