DE101-PD03 — Data Cleaning & Transformation (Production Mindset)

~60 minutes

🎯 Why Data Cleaning Is the Real Job

In the real world:

  • 70–80% of data work = cleaning & transformation
  • Models fail more often due to bad data, not bad algorithms

At companies like Google, Meta, OpenAI:

If your data is wrong, your model is wrong.


🧠 What Is Data Cleaning?

Data cleaning means:

  • Making data usable
  • Making assumptions explicit
  • Removing ambiguity
  • Preventing silent bugs

🧹 Common Real-World Data Problems

Problem Example
Missing values NaN age, null clicks
Wrong data type “30” instead of 30
Inconsistent labels “USA”, “U.S.”, “United States”
Duplicate rows Same user logged twice
Outliers Age = 999

📊 Example Dataset (User Analytics)

import pandas as pd
import numpy as np

data = {
    "user_id": [1, 2, 3, 3],
    "age": [25, None, "30", 30],
    "country": ["US", "USA", "us", "USA"],
    "clicks": [10, np.nan, 5, 5]
}

df = pd.DataFrame(data)
df

🔍 Step 1: Inspect Before Cleaning (Critical)

Never clean blindly.

df.info()
df.describe(include="all")
df.isna().sum()

This answers:

  • Which columns are broken?
  • How many missing values?
  • Are types correct?

🗑️ Removing Missing Data (dropna)

df.dropna()

⚠️ Dangerous in production:

  • You may delete important users
  • Always know why you drop

Better:

df.dropna(subset=["age"])

🩹 Filling Missing Data (fillna)

df["clicks"] = df["clicks"].fillna(0)

Common strategies:

  • Mean / median (numeric)
  • Mode (categorical)
  • Zero (counts)
  • Forward fill (time series)

🔢 Fixing Data Types (astype)

df["age"] = df["age"].astype(int)

This often fails → must clean first:

df["age"] = pd.to_numeric(df["age"], errors="coerce")

Engineers always expect type issues.


🏷️ Renaming for Clarity

df.rename(columns={"clicks": "num_clicks"})

Why this matters:

  • Self-documenting code
  • Fewer misunderstandings
  • Cleaner downstream pipelines

🧠 Deduplication (Very Common)

df.drop_duplicates(subset=["user_id"])

Used for:

  • User logs
  • Event streams
  • Experiment tracking

🌍 Normalizing Categorical Data

df["country"] = (
    df["country"]
    .str.lower()
    .replace({"us": "usa"})
)

Small inconsistency → huge analytics bug.


🔄 Feature Transformation

Create new features from old ones:

df["is_active"] = df["num_clicks"] > 0

This is feature engineering.


🧪 Outlier Handling

df = df[df["age"] < 100]

Outliers:

  • Break averages
  • Break models
  • Must be justified, not guessed

🧩 .assign() — Functional Style

df = df.assign(
    age_next_year=lambda x: x["age"] + 1,
    clicks_per_age=lambda x: x["num_clicks"] / x["age"]
)

No mutation. Clear logic.


🔗 Method Chaining (Professional Style)

df = (
    df
    .drop_duplicates("user_id")
    .assign(
        age=lambda x: pd.to_numeric(x["age"], errors="coerce"),
        country=lambda x: x["country"].str.lower()
    )
    .dropna(subset=["age"])
    .query("age < 100")
)

This is how production Pandas code looks.

Why method chaining is preferred in real projects?

Readable, debuggable, testable, reproducible


🧠 Pipeline Thinking (Interview Gold)

Engineers think in pipelines:

  1. Load data
  2. Inspect
  3. Clean
  4. Transform
  5. Validate
  6. Export

Each step should be:

  • Explicit
  • Reproducible
  • Testable

🧪 Validation Checks (Often Missed)

assert df["age"].min() > 0
assert df["user_id"].is_unique

These prevent silent failures.


🏢 Real-World Case (Meta / OpenAI)

Before training models:

  • Pandas pipeline cleans data
  • Validates assumptions
  • Logs anomalies
  • Produces clean tables

Only then → Spark / TensorFlow / PyTorch.


⚠️ Common Beginner Mistakes

❌ Overusing dropna() ❌ Ignoring data types ❌ Modifying DataFrame in-place blindly ❌ No validation checks


🏁 Summary

  • Cleaning is the real job
  • Inspect before modifying
  • Be explicit, not clever
  • Method chaining = professional style
  • Pipelines prevent bugs

Good data → good decisions → good models.


🚀 Next Chapter

👉 DE101-PD04 — GroupBy, Aggregation, and Business Analytics

This is where insights are born.

Previous
Next