DE101-PD04 — GroupBy, Aggregation & Joins (Analytics & Interview Mastery)
~75–90 minutes
🎯 Why This Chapter Is Critical
If Pandas were a weapon:
- Cleaning = sharpening
- GroupBy & Join = actual power
Most real-world questions are:
- “Which group performs best?”
- “How do metrics change by segment?”
- “How do we combine datasets correctly?”
🧠 GroupBy Mental Model (Most Important Idea)
GroupBy = Split → Apply → Combine
- Split data into groups
- Apply aggregation or transformation
- Combine results into a new table
This mindset works in:
- Pandas
- SQL
- Spark
- BigQuery
📊 Example Dataset (E-Commerce Analytics)
import pandas as pd
data = {
"user_id": [1, 2, 1, 3, 2, 3, 1],
"country": ["US", "US", "US", "FR", "US", "FR", "US"],
"revenue": [100, 200, 150, 80, 120, 60, 90],
"device": ["mobile", "desktop", "mobile", "mobile", "mobile", "desktop", "desktop"]
}
df = pd.DataFrame(data)
df
📦 Basic GroupBy
df.groupby("country")["revenue"].sum()
Output
country
FR 140
US 660
This answers:
“How much revenue comes from each country?”
🔢 Multiple Aggregations
df.groupby("country")["revenue"].agg(["sum", "mean", "count"])
Used daily in:
- KPI dashboards
- Business reports
- Model evaluation
🧠 GroupBy Multiple Columns
df.groupby(["country", "device"])["revenue"].sum()
Now you are thinking like an analyst.
🧮 Resetting Index (Very Common)
df.groupby("country", as_index=False)["revenue"].sum()
Or:
df.groupby("country")["revenue"].sum().reset_index()
This makes output usable for joins.
🧩 .transform() — Keep Original Shape
df["country_avg_revenue"] = (
df.groupby("country")["revenue"].transform("mean")
)
Use when:
- You need group stats per row
- Feature engineering for ML
🔁 .apply() — Flexible but Dangerous
df.groupby("country").apply(lambda x: x["revenue"].max() - x["revenue"].min())
⚠️ Powerful but slower. Use only when aggregation cannot express logic.
🧠 SQL vs Pandas GroupBy
| SQL | Pandas |
|---|---|
| GROUP BY | groupby() |
| SUM | sum() |
| AVG | mean() |
| COUNT | count() |
SQL vs Pandas GroupBy?
Conceptually identical — only syntax differs
🔗 Why Joins Matter
Real-world data is never in one table.
You often have:
- users table
- orders table
- events table
- experiments table
Joins connect the story.
📊 Example Tables
Users
users = pd.DataFrame({
"user_id": [1, 2, 3],
"country": ["US", "US", "FR"]
})
Orders
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104],
"user_id": [1, 2, 1, 4],
"amount": [250, 300, 150, 500]
})
🔗 Inner Join
pd.merge(users, orders, on="user_id", how="inner")
Keeps only matching rows.
🔗 Left Join (Most Common)
pd.merge(users, orders, on="user_id", how="left")
Used when:
- Users are primary
- Orders may be missing
🔗 Right Join
pd.merge(users, orders, on="user_id", how="right")
Less common but useful for audits.
🔗 Outer Join
pd.merge(users, orders, on="user_id", how="outer")
Used for:
- Data reconciliation
- Finding missing relationships
🧠 Join on Different Column Names
pd.merge(
users,
orders,
left_on="user_id",
right_on="user_id"
)
In real data, column names rarely match.
⚠️ Join Explosion (Critical Warning)
Many-to-many joins can multiply rows.
users.merge(orders, on="user_id")
Always ask:
“What is the cardinality?”
🧪 Validation After Join
assert merged["user_id"].notna().all()
Production engineers always validate joins.
🧠 Real-World Case (Google / Meta / OpenAI)
Example:
- Join user metadata
- Join experiment assignment
- Aggregate metrics per variant
- Decide product direction
Bad join → wrong decision → $$$ lost
🧠 Interview Patterns
Interviewers test:
- GroupBy logic
- Correct join type
- Index reset
- Avoiding row explosion
🏁 Summary
- GroupBy = Split → Apply → Combine
.agg()for metrics.transform()for ML features- Joins connect reality
- Validate after join
If you master this chapter, you can survive most analytics interviews.
🚀 Next Chapter
👉 DE101-PD05 — Time Series & Window Operations
This is where senior-level thinking begins.