Lecture 04 — Data: The Fuel of AI

~1.5–2 hours (core understanding lecture)


📦 What Is Data? (Simple but Deep)

Data is recorded experience.

Everything AI knows comes from data.

Examples:

  • Images 🖼️ → what the world looks like
  • Text 📚 → how humans think & speak
  • Audio 🎧 → how humans sound
  • Numbers 🔢 → measurements, behavior, signals

No data = no learning
Better data = better intelligence


🧠 A Powerful Truth

AI does not see reality.
AI sees data.

This difference explains almost every AI failure.


🧩 Types of Data (With Intuition)

🖼️ Image Data

  • Photos
  • Medical scans
  • Satellite images

AI learns:

  • shapes
  • textures
  • patterns

But not meaning.


📚 Text Data

  • Books
  • Articles
  • Code
  • Conversations

AI learns:

  • grammar
  • logic patterns
  • style

Not truth. Not intention.


🎧 Audio Data

  • Speech
  • Music
  • Noise

AI learns:

  • frequency patterns
  • pronunciation
  • rhythm

🔢 Tabular / Numeric Data

  • Excel sheets
  • Databases
  • Logs

Used in:

  • finance
  • healthcare
  • business

🎥 Video Data

  • Images over time
  • Motion + sound

Hardest data type.


🔀 Multimodal Data (Modern AI)

Text + Image + Audio together.

Examples:

  • GPT-4V
  • Gemini
  • CLIP

Multimodal AI sees the world more like humans — but still imperfectly.


🧠 Data vs Information vs Knowledge

Concept Meaning
Data Raw facts
Information Structured data
Knowledge Patterns learned
Wisdom Human judgment

AI stops at patterns.

Humans add meaning.


😄 Funny Example: Teaching a Kid vs AI

You show a kid:

  • 1 picture of a dog 🐶

The kid understands.

You show an AI:

  • 1 picture of a dog

The AI learns nothing.

AI needs:

  • thousands
  • millions
  • billions of examples

Why? Because AI does not understand context.


⚠️ Garbage In = Garbage Out (GIGO)

Classic rule:

Bad data → bad AI

Examples:

  • Biased hiring data → biased hiring AI
  • Incomplete medical data → dangerous predictions
  • Noisy labels → confused models

Models don’t fix data problems. They amplify them.


🧠 What Is Bias? (Clear Definition)

Bias means:

Data does not represent reality fairly.

Sources:

  • Historical inequality
  • Sampling errors
  • Human prejudice
  • Missing groups

AI learns our past, not our ideals.


😬 Real-World Example (Easy to Feel)

If all training images of “CEO” are:

  • mostly men

AI learns:

CEO = man

Not because AI is evil —
but because data told it so.


🧒 Ethics for Kids (Simple Question)

Would you trust a robot that learned everything from one person?

Of course not.

Diversity of data = fairness.


⚖️ Ethics for Adults (Serious Truth)

Data decisions are power decisions:

  • Who is included?
  • Who is missing?
  • Who defines labels?

These choices shape society.


🧠 Data Is Not Neutral

Important myth:

“Data is objective.”

Reality:

  • Data is collected by humans
  • Labeled by humans
  • Filtered by humans

Therefore:

Data contains values.


🧪 Data Lifecycle (How AI Is Really Built)

  1. Collect data
  2. Clean data
  3. Label data
  4. Split data (train / val / test)
  5. Train model
  6. Evaluate
  7. Deploy
  8. Monitor drift

Most failures happen at steps 1–4.


🤖 Why Big Models Need Big Data

ChatGPT was trained on:

  • massive text corpora
  • diverse sources

Not because it’s smart —
but because scale matters.

More data → better generalization (to a point).


🧠 Small Data Still Matters

In:

  • medicine
  • law
  • robotics

We must:

  • be careful
  • use domain knowledge
  • avoid blind automation

🔍 Data Leakage (Silent Killer)

Data leakage = test data sneaks into training.

Result:

  • fake high accuracy
  • real-world failure

Always separate data properly.


🌍 Data Drift (Why Models Fail Over Time)

World changes. Data changes. Models decay.

Examples:

  • language evolves
  • behavior changes
  • environments shift

AI must be maintained, not worshiped.


🤖 Data Is the Memory of AI

AI has no childhood. No intuition. No experience.

Data is its memory of the world.

Choose memory wisely.


🧠 Big Takeaways

  • Data shapes intelligence
  • Bias is unavoidable, but manageable
  • Ethics starts at data collection
  • Models reflect us

🌱 Final Reflection

If AI reflects society, what responsibility do we have as data creators?

To be careful, inclusive, and wise.

Previous
Next