Lecture 04 — Data: The Fuel of AI
~1.5–2 hours (core understanding lecture)
📦 What Is Data? (Simple but Deep)
Data is recorded experience.
Everything AI knows comes from data.
Examples:
- Images 🖼️ → what the world looks like
- Text 📚 → how humans think & speak
- Audio 🎧 → how humans sound
- Numbers 🔢 → measurements, behavior, signals
No data = no learning
Better data = better intelligence
🧠 A Powerful Truth
AI does not see reality.
AI sees data.
This difference explains almost every AI failure.
🧩 Types of Data (With Intuition)
🖼️ Image Data
- Photos
- Medical scans
- Satellite images
AI learns:
- shapes
- textures
- patterns
But not meaning.
📚 Text Data
- Books
- Articles
- Code
- Conversations
AI learns:
- grammar
- logic patterns
- style
Not truth. Not intention.
🎧 Audio Data
- Speech
- Music
- Noise
AI learns:
- frequency patterns
- pronunciation
- rhythm
🔢 Tabular / Numeric Data
- Excel sheets
- Databases
- Logs
Used in:
- finance
- healthcare
- business
🎥 Video Data
- Images over time
- Motion + sound
Hardest data type.
🔀 Multimodal Data (Modern AI)
Text + Image + Audio together.
Examples:
- GPT-4V
- Gemini
- CLIP
Multimodal AI sees the world more like humans — but still imperfectly.
🧠 Data vs Information vs Knowledge
| Concept | Meaning |
|---|---|
| Data | Raw facts |
| Information | Structured data |
| Knowledge | Patterns learned |
| Wisdom | Human judgment |
AI stops at patterns.
Humans add meaning.
😄 Funny Example: Teaching a Kid vs AI
You show a kid:
- 1 picture of a dog 🐶
The kid understands.
You show an AI:
- 1 picture of a dog
The AI learns nothing.
AI needs:
- thousands
- millions
- billions of examples
Why? Because AI does not understand context.
⚠️ Garbage In = Garbage Out (GIGO)
Classic rule:
Bad data → bad AI
Examples:
- Biased hiring data → biased hiring AI
- Incomplete medical data → dangerous predictions
- Noisy labels → confused models
Models don’t fix data problems. They amplify them.
🧠 What Is Bias? (Clear Definition)
Bias means:
Data does not represent reality fairly.
Sources:
- Historical inequality
- Sampling errors
- Human prejudice
- Missing groups
AI learns our past, not our ideals.
😬 Real-World Example (Easy to Feel)
If all training images of “CEO” are:
- mostly men
AI learns:
CEO = man
Not because AI is evil —
but because data told it so.
🧒 Ethics for Kids (Simple Question)
Would you trust a robot that learned everything from one person?
Of course not.
Diversity of data = fairness.
⚖️ Ethics for Adults (Serious Truth)
Data decisions are power decisions:
- Who is included?
- Who is missing?
- Who defines labels?
These choices shape society.
🧠 Data Is Not Neutral
Important myth:
“Data is objective.”
Reality:
- Data is collected by humans
- Labeled by humans
- Filtered by humans
Therefore:
Data contains values.
🧪 Data Lifecycle (How AI Is Really Built)
- Collect data
- Clean data
- Label data
- Split data (train / val / test)
- Train model
- Evaluate
- Deploy
- Monitor drift
Most failures happen at steps 1–4.
🤖 Why Big Models Need Big Data
ChatGPT was trained on:
- massive text corpora
- diverse sources
Not because it’s smart —
but because scale matters.
More data → better generalization (to a point).
🧠 Small Data Still Matters
In:
- medicine
- law
- robotics
We must:
- be careful
- use domain knowledge
- avoid blind automation
🔍 Data Leakage (Silent Killer)
Data leakage = test data sneaks into training.
Result:
- fake high accuracy
- real-world failure
Always separate data properly.
🌍 Data Drift (Why Models Fail Over Time)
World changes. Data changes. Models decay.
Examples:
- language evolves
- behavior changes
- environments shift
AI must be maintained, not worshiped.
🤖 Data Is the Memory of AI
AI has no childhood. No intuition. No experience.
Data is its memory of the world.
Choose memory wisely.
🧠 Big Takeaways
- Data shapes intelligence
- Bias is unavoidable, but manageable
- Ethics starts at data collection
- Models reflect us
🌱 Final Reflection
If AI reflects society, what responsibility do we have as data creators?
To be careful, inclusive, and wise.