DE101-PD04 — GroupBy, Aggregation & Joins (Analytics & Interview Mastery)

~75–90 minutes

🎯 Why This Chapter Is Critical

If Pandas were a weapon:

Cleaning = sharpening
GroupBy & Join = actual power

Most real-world questions are:

“Which group performs best?”
“How do metrics change by segment?”
“How do we combine datasets correctly?”

🧠 GroupBy Mental Model (Most Important Idea)

GroupBy = Split → Apply → Combine

Split data into groups
Apply aggregation or transformation
Combine results into a new table

This mindset works in:

Pandas
SQL
Spark
BigQuery

📊 Example Dataset (E-Commerce Analytics)

import pandas as pd

data = {
    "user_id": [1, 2, 1, 3, 2, 3, 1],
    "country": ["US", "US", "US", "FR", "US", "FR", "US"],
    "revenue": [100, 200, 150, 80, 120, 60, 90],
    "device": ["mobile", "desktop", "mobile", "mobile", "mobile", "desktop", "desktop"]
}

df = pd.DataFrame(data)
df

📦 Basic GroupBy

df.groupby("country")["revenue"].sum()

Output

country
FR    140
US    660

This answers:

“How much revenue comes from each country?”

🔢 Multiple Aggregations

df.groupby("country")["revenue"].agg(["sum", "mean", "count"])

Used daily in:

KPI dashboards
Business reports
Model evaluation

🧠 GroupBy Multiple Columns

df.groupby(["country", "device"])["revenue"].sum()

Now you are thinking like an analyst.

🧮 Resetting Index (Very Common)

df.groupby("country", as_index=False)["revenue"].sum()

Or:

df.groupby("country")["revenue"].sum().reset_index()

This makes output usable for joins.

🧩 `.transform()` — Keep Original Shape

df["country_avg_revenue"] = (
    df.groupby("country")["revenue"].transform("mean")
)

Use when:

You need group stats per row
Feature engineering for ML

🔁 `.apply()` — Flexible but Dangerous

df.groupby("country").apply(lambda x: x["revenue"].max() - x["revenue"].min())

⚠️ Powerful but slower. Use only when aggregation cannot express logic.

🧠 SQL vs Pandas GroupBy

SQL	Pandas
GROUP BY	groupby()
SUM	sum()
AVG	mean()
COUNT	count()

SQL vs Pandas GroupBy?

Conceptually identical — only syntax differs

🔗 Why Joins Matter

Real-world data is never in one table.

You often have:

users table
orders table
events table
experiments table

Joins connect the story.

📊 Example Tables

Users

users = pd.DataFrame({
    "user_id": [1, 2, 3],
    "country": ["US", "US", "FR"]
})

Orders

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104],
    "user_id": [1, 2, 1, 4],
    "amount": [250, 300, 150, 500]
})

🔗 Inner Join

pd.merge(users, orders, on="user_id", how="inner")

Keeps only matching rows.

🔗 Left Join (Most Common)

pd.merge(users, orders, on="user_id", how="left")

Used when:

Users are primary
Orders may be missing

🔗 Right Join

pd.merge(users, orders, on="user_id", how="right")

Less common but useful for audits.

🔗 Outer Join

pd.merge(users, orders, on="user_id", how="outer")

Used for:

Data reconciliation
Finding missing relationships

🧠 Join on Different Column Names

pd.merge(
    users,
    orders,
    left_on="user_id",
    right_on="user_id"
)

In real data, column names rarely match.

⚠️ Join Explosion (Critical Warning)

Many-to-many joins can multiply rows.

users.merge(orders, on="user_id")

Always ask:

“What is the cardinality?”

🧪 Validation After Join

assert merged["user_id"].notna().all()

Production engineers always validate joins.

🧠 Real-World Case (Google / Meta / OpenAI)

Example:

Join user metadata
Join experiment assignment
Aggregate metrics per variant
Decide product direction

Bad join → wrong decision → $$$ lost

🧠 Interview Patterns

Interviewers test:

GroupBy logic
Correct join type
Index reset
Avoiding row explosion

🏁 Summary

GroupBy = Split → Apply → Combine
.agg() for metrics
.transform() for ML features
Joins connect reality
Validate after join

If you master this chapter, you can survive most analytics interviews.

🚀 Next Chapter

👉 DE101-PD05 — Time Series & Window Operations

This is where senior-level thinking begins.

Last updated on 2025