DE101-PD03 — Data Cleaning & Transformation (Production Mindset)

~60 minutes

🎯 Why Data Cleaning Is the Real Job

In the real world:

70–80% of data work = cleaning & transformation
Models fail more often due to bad data, not bad algorithms

At companies like Google, Meta, OpenAI:

If your data is wrong, your model is wrong.

🧠 What Is Data Cleaning?

Data cleaning means:

Making data usable
Making assumptions explicit
Removing ambiguity
Preventing silent bugs

🧹 Common Real-World Data Problems

Problem	Example
Missing values	NaN age, null clicks
Wrong data type	“30” instead of 30
Inconsistent labels	“USA”, “U.S.”, “United States”
Duplicate rows	Same user logged twice
Outliers	Age = 999

📊 Example Dataset (User Analytics)

import pandas as pd
import numpy as np

data = {
    "user_id": [1, 2, 3, 3],
    "age": [25, None, "30", 30],
    "country": ["US", "USA", "us", "USA"],
    "clicks": [10, np.nan, 5, 5]
}

df = pd.DataFrame(data)
df

🔍 Step 1: Inspect Before Cleaning (Critical)

Never clean blindly.

df.info()
df.describe(include="all")
df.isna().sum()

This answers:

Which columns are broken?
How many missing values?
Are types correct?

🗑️ Removing Missing Data (`dropna`)

df.dropna()

⚠️ Dangerous in production:

You may delete important users
Always know why you drop

Better:

df.dropna(subset=["age"])

🩹 Filling Missing Data (`fillna`)

df["clicks"] = df["clicks"].fillna(0)

Common strategies:

Mean / median (numeric)
Mode (categorical)
Zero (counts)
Forward fill (time series)

🔢 Fixing Data Types (`astype`)

df["age"] = df["age"].astype(int)

This often fails → must clean first:

df["age"] = pd.to_numeric(df["age"], errors="coerce")

Engineers always expect type issues.

🏷️ Renaming for Clarity

df.rename(columns={"clicks": "num_clicks"})

Why this matters:

Self-documenting code
Fewer misunderstandings
Cleaner downstream pipelines

🧠 Deduplication (Very Common)

df.drop_duplicates(subset=["user_id"])

Used for:

User logs
Event streams
Experiment tracking

🌍 Normalizing Categorical Data

df["country"] = (
    df["country"]
    .str.lower()
    .replace({"us": "usa"})
)

Small inconsistency → huge analytics bug.

🔄 Feature Transformation

Create new features from old ones:

df["is_active"] = df["num_clicks"] > 0

This is feature engineering.

🧪 Outlier Handling

df = df[df["age"] < 100]

Outliers:

Break averages
Break models
Must be justified, not guessed

🧩 `.assign()` — Functional Style

df = df.assign(
    age_next_year=lambda x: x["age"] + 1,
    clicks_per_age=lambda x: x["num_clicks"] / x["age"]
)

No mutation. Clear logic.

🔗 Method Chaining (Professional Style)

df = (
    df
    .drop_duplicates("user_id")
    .assign(
        age=lambda x: pd.to_numeric(x["age"], errors="coerce"),
        country=lambda x: x["country"].str.lower()
    )
    .dropna(subset=["age"])
    .query("age < 100")
)

This is how production Pandas code looks.

Why method chaining is preferred in real projects?

Readable, debuggable, testable, reproducible

🧠 Pipeline Thinking (Interview Gold)

Engineers think in pipelines:

Load data
Inspect
Clean
Transform
Validate
Export

Each step should be:

Explicit
Reproducible
Testable

🧪 Validation Checks (Often Missed)

assert df["age"].min() > 0
assert df["user_id"].is_unique

These prevent silent failures.

🏢 Real-World Case (Meta / OpenAI)

Before training models:

Pandas pipeline cleans data
Validates assumptions
Logs anomalies
Produces clean tables

Only then → Spark / TensorFlow / PyTorch.

⚠️ Common Beginner Mistakes

❌ Overusing dropna() ❌ Ignoring data types ❌ Modifying DataFrame in-place blindly ❌ No validation checks

🏁 Summary

Cleaning is the real job
Inspect before modifying
Be explicit, not clever
Method chaining = professional style
Pipelines prevent bugs

Good data → good decisions → good models.

🚀 Next Chapter

👉 DE101-PD04 — GroupBy, Aggregation, and Business Analytics

This is where insights are born.

Last updated on 2025